Commit Graph

28269 Commits

Author SHA1 Message Date
Sinan Kaya
1b6266ebe3 watchdog: Reduce message verbosity
Code is emitting the following error message during boot on systems
without PMU hardware support while probing NMI capability.

 NMI watchdog: Perf event create on CPU 0 failed with -2

This error is emitted as the perf subsystem returns -ENOENT due to lack of
PMUs in the system.

It is followed by the warning that NMI watchdog is disabled:

  NMI watchdog: Perf NMI watchdog permanently disabled

While NMI disabled information is useful for ordinary users, seeing a PERF
event create failed with error code -2 is not.

Reduce the message severity to debug so that if debugging is still possible
in case the error code returned by perf is required for analysis.

Signed-off-by: Sinan Kaya <okaya@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=599368
Link: https://lkml.kernel.org/r/20180803060943.2643-1-okaya@kernel.org
2018-08-03 12:19:08 +02:00
Palmer Dabbelt
4f7799d96e genirq/irqchip: Remove MULTI_IRQ_HANDLER as it's now obselete
Now that every user of MULTI_IRQ_HANDLER has been convereted over to use
GENERIC_IRQ_MULTI_HANDLER remove the references to MULTI_IRQ_HANDLER.

Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux@armlinux.org.uk
Cc: catalin.marinas@arm.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: jonas@southpole.se
Cc: stefan.kristiansson@saunalahti.fi
Cc: shorne@gmail.com
Cc: jason@lakedaemon.net
Cc: marc.zyngier@arm.com
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: nicolas.pitre@linaro.org
Cc: vladimir.murzin@arm.com
Cc: keescook@chromium.org
Cc: jinb.park7@gmail.com
Cc: yamada.masahiro@socionext.com
Cc: alexandre.belloni@bootlin.com
Cc: pombredanne@nexb.com
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: kstewart@linuxfoundation.org
Cc: jhogan@kernel.org
Cc: mark.rutland@arm.com
Cc: ard.biesheuvel@linaro.org
Cc: james.morse@arm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: openrisc@lists.librecores.org
Link: https://lkml.kernel.org/r/20180622170126.6308-6-palmer@sifive.com
2018-08-03 12:14:10 +02:00
Roman Gushchin
cd33943176 bpf: introduce the bpf_get_local_storage() helper function
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.

It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
7b5dd2bde7 bpf: don't allow create maps of cgroup local storages
As there is one-to-one relation between a bpf program
and cgroup local storage map, there is no sense in
creating a map of cgroup local storage maps.

Forbid it explicitly to avoid possible side effects.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
3e6a4b3e02 bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE
BPF_MAP_TYPE_CGROUP_STORAGE maps are special in a way
that the access from the bpf program side is lookup-free.
That means the result is guaranteed to be a valid
pointer to the cgroup storage; no NULL-check is required.

This patch introduces BPF_PTR_TO_MAP_VALUE return type,
which is required to cause the verifier accept programs,
which are not checking the map value pointer for being NULL.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
394e40a297 bpf: extend bpf_prog_array to store pointers to the cgroup storage
This patch converts bpf_prog_array from an array of prog pointers
to the array of struct bpf_prog_array_item elements.

This allows to save a cgroup storage pointer for each bpf program
efficiently attached to a cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
d7bf2c10af bpf: allocate cgroup storage entries on attaching bpf programs
If a bpf program is using cgroup local storage, allocate
a bpf_cgroup_storage structure automatically on attaching the program
to a cgroup and save the pointer into the corresponding bpf_prog_list
entry.
Analogically, release the cgroup local storage on detaching
of the bpf program.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
aa0ad5b039 bpf: pass a pointer to a cgroup storage using pcpu variable
This commit introduces the bpf_cgroup_storage_set() helper,
which will be used to pass a pointer to a cgroup storage
to the bpf helper.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
de9cbbaadb bpf: introduce cgroup storage maps
This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
a special type of maps which are implementing the cgroup storage.

>From the userspace point of view it's almost a generic
hash map with the (cgroup inode id, attachment type) pair
used as a key.

The only difference is that some operations are restricted:
  1) a user can't create new entries,
  2) a user can't remove existing entries.

The lookup from userspace is o(log(n)).

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
0a4c58f570 bpf: add ability to charge bpf maps memory dynamically
This commits extends existing bpf maps memory charging API
to support dynamic charging/uncharging.

This is required to account memory used by maps,
if all entries are created dynamically after
the map initialization.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:31 +02:00
David S. Miller
89b1698c93 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
The BTF conflicts were simple overlapping changes.

The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:55:32 -07:00
Masami Hiramatsu
6bc6c77cfc tracing/kprobes: Fix within_notrace_func() to check only notrace functions
Fix within_notrace_func() to check only notrace functions and to ignore the
kprobe-event which can not solve symbol addresses.

within_notrace_func() returns true if the given kprobe events probe point
seems to be out-of-range. But that is not the correct place to check for it,
it should be done in kprobes afterwards.

kprobe-events allow users to define a probe point on "currently unloaded
module" so that it can trace the event during module load. In this case, the
user will put a probe on a symbol which is not in kallsyms yet and it hits
the within_notrace_func().  As a result, kprobe-events always refuses if
user defines a probe on a "currenly unloaded module".

Fixes: commit 45408c4f92 ("tracing: kprobes: Prohibit probing on notrace function")
Link: http://lkml.kernel.org/r/153319624799.29007.13604430345640129926.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-02 12:34:41 -04:00
zhong jiang
9be936f4b3 kernel/module: Use kmemdup to replace kmalloc+memcpy
we prefer to the kmemdup rather than kmalloc+memcpy. so just
replace them.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2018-08-02 18:03:17 +02:00
Peter Zijlstra
b80a2bfce8 stop_machine: Reflow cpu_stop_queue_two_works()
The code flow in cpu_stop_queue_two_works() is a little arcane; fix this by
lifting the preempt_disable() to the top to create more natural nesting wrt
the spinlocks and make the wake_up_q() and preempt_enable() unconditional
at the end.

Furthermore, enable preemption in the -EDEADLK case, such that we spin-wait
with preemption enabled.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: isaacm@codeaurora.org
Cc: matt@codeblueprint.co.uk
Cc: psodagud@codeaurora.org
Cc: gregkh@linuxfoundation.org
Cc: pkondeti@codeaurora.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180730112140.GH2494@hirez.programming.kicks-ass.net
2018-08-02 15:25:20 +02:00
Sudeep Holla
fbfa926008 clockevents: Warn if cpu_all_mask is used as cpumask
Using cpu_all_mask in clockevents cpumask may result in issues while
comparing multiple clockevent devices to choose the preferred one.

On one of the platforms with 2 system (i.e. non per-CPU) timers with
different ratings, having cpu_all_mask for one of the device resulted in a
boot hang due to a endless loop in clockevents_notify_released() as both
were clocksources were selected as preferred.

In order to prevent such issues in the future, warn if any clockevent
driver sets cpu_all_mask as it's cpumask and just override it to use
cpu_possible_mask. All the existing occurrences of cpu_all_mask are already
replaced with cpu_possible_mask.

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lkml.kernel.org/r/1531308264-24220-3-git-send-email-sudeep.holla@arm.com
2018-08-02 14:55:53 +02:00
Sudeep Holla
234b3840d7 tick/broadcast-hrtimer: Use cpu_possible_mask for ce_broadcast_hrtimer
This is the last instance of cpu_all_mask usage in the core framework.

Replace it with cpu_possible_mask like all other instances in the
clockevent drivers. This makes it possible to add a warning in the core
clockevents_register_device on usage of cpu_all_mask from any clockevent
drivers in the future.

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lkml.kernel.org/r/1531308264-24220-2-git-send-email-sudeep.holla@arm.com
2018-08-02 14:55:52 +02:00
Gaurav Kohli
363e934d88 timers: Clear timer_base::must_forward_clk with timer_base::lock held
timer_base::must_forward_clock is indicating that the base clock might be
stale due to a long idle sleep.

The forwarding of the base clock takes place in the timer softirq or when a
timer is enqueued to a base which is idle. If the enqueue of timer to an
idle base happens from a remote CPU, then the following race can happen:

  CPU0					CPU1
  run_timer_softirq			mod_timer

					base = lock_timer_base(timer);
  base->must_forward_clk = false
					if (base->must_forward_clk)
				       	    forward(base); -> skipped

					enqueue_timer(base, timer, idx);
					-> idx is calculated high due to
					   stale base
					unlock_timer_base(timer);
  base = lock_timer_base(timer);
  forward(base);

The root cause is that timer_base::must_forward_clk is cleared outside the
timer_base::lock held region, so the remote queuing CPU observes it as
cleared, but the base clock is still stale. This can cause large
granularity values for timers, i.e. the accuracy of the expiry time
suffers.

Prevent this by clearing the flag with timer_base::lock held, so that the
forwarding takes place before the cleared flag is observable by a remote
CPU.

Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: linux-arm-msm@vger.kernel.org
Link: https://lkml.kernel.org/r/1533199863-22748-1-git-send-email-gkohli@codeaurora.org
2018-08-02 12:52:38 +02:00
Ingo Molnar
16e0e6a83b Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-08-02 09:59:20 +02:00
Gustavo A. R. Silva
44ec3ec01f ftrace: Use true and false for boolean values in ops_references_rec()
Return statements in functions returning bool should use true or false
instead of an integer value.

This code was detected with the help of Coccinelle.

Link: http://lkml.kernel.org/r/20180802010056.GA31012@embeddedor.com

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01 21:15:31 -04:00
Steven Rostedt (VMware)
d7224c0e12 ring-buffer: Make ring_buffer_record_is_set_on() return bool
The value of ring_buffer_record_is_set_on() is either true or false, so have
its return value be bool.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01 21:09:50 -04:00
Steven Rostedt (VMware)
3ebea280d7 ring-buffer: Make ring_buffer_record_is_on() return bool
The value of ring_buffer_record_is_on() is either true or false, so have its
return value be bool.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01 21:08:30 -04:00
Christoph Hellwig
87a4c37599 kconfig: include kernel/Kconfig.preempt from init/Kconfig
Almost all architectures include it.  Add a ARCH_NO_PREEMPT symbol to
disable preempt support for alpha, hexagon, non-coldfire m68k and
user mode Linux.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-08-02 08:06:54 +09:00
Steven Rostedt (VMware)
ec57350883 tracing: Make tracer_tracing_is_on() return bool
There's code that expects tracer_tracing_is_on() to be either true or false,
not some random number. Currently, it should only return one or zero, but
just in case, change its return value to bool, to enforce it.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01 16:08:57 -04:00
Steven Rostedt (VMware)
978defee11 tracing: Do a WARN_ON() if start_thread() in hwlat is called when thread exists
The start function of the hwlat tracer should never be called when the hwlat
thread already exists. If it is called, do a WARN_ON().

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01 16:06:02 -04:00
Erica Bugden
82fbc8c48a ftrace: Add missing check for existing hwlat thread
The hwlat tracer uses a kernel thread to measure latencies. The function
that creates this kernel thread, start_kthread(), can be called when the
tracer is initialized and when the tracer is explicitly enabled.
start_kthread() does not check if there is an existing hwlat kernel
thread and will create a new one each time it is called.

This causes the reference to the previous thread to be lost. Without the
thread reference, the old kernel thread becomes unstoppable and
continues to use CPU time even after the hwlat tracer has been disabled.
This problem can be observed when a system is booted with tracing
enabled and the hwlat tracer is configured like this:

	echo hwlat > current_tracer; echo 1 > tracing_on

Add the missing check for an existing kernel thread in start_kthread()
to prevent this problem. This function and the rest of the hwlat kernel
thread setup and teardown are already serialized because they are called
through the tracer core code with trace_type_lock held.

[
 Note, this only fixes the symptom. The real fix was not to call
 this function when tracing_on was already one. But this still makes
 the code more robust, so we'll add it.
]

Link: http://lkml.kernel.org/r/1533120354-22923-1-git-send-email-erica.bugden@linutronix.de

Signed-off-by: Erica Bugden <erica.bugden@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01 16:04:24 -04:00
Steven Rostedt (VMware)
f143641bfe tracing: Do not call start/stop() functions when tracing_on does not change
Currently, when one echo's in 1 into tracing_on, the current tracer's
"start()" function is executed, even if tracing_on was already one. This can
lead to strange side effects. One being that if the hwlat tracer is enabled,
and someone does "echo 1 > tracing_on" into tracing_on, the hwlat tracer's
start() function is called again which will recreate another kernel thread,
and make it unable to remove the old one.

Link: http://lkml.kernel.org/r/1533120354-22923-1-git-send-email-erica.bugden@linutronix.de

Cc: stable@vger.kernel.org
Fixes: 2df8f8a6a8 ("tracing: Fix regression with irqsoff tracer and tracing_on file")
Reported-by: Erica Bugden <erica.bugden@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01 16:01:02 -04:00
Josef Bacik
2c323017e3 blk-cgroup: clear the throttle queue on fork
We were hitting a panic in production where we put too many times on the
request queue.  This is because we'd get the throttle_queue of the
parent if we fork()'ed while we needed to be throttled, but we didn't
have a reference on it.  Instead just clear these flags on fork so the
child doesn't pay for the sins of its father.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01 09:16:04 -06:00
Linus Torvalds
37b71411b7 audit/stable-4.18 PR 20180731
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAltguv8UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQVeRaWujKfIqYOA/9GgMzBJYU+bVCbNSagcq6LluWFoYV
 ObZb9sfsf23wL0YgtKgkWaCefWAAYnWOr6bUvDa+5oMRLVR+bsP+YEkCVK45CJr0
 g44oe4VH9t5inX2F2JSkoVbkUDZIwwOxiTi/L4Emqhv8cT9zc89tcKRjYhqt50d1
 4Gm4++jZcTHQNKkzYUIIpKc0TZmKW5mRNmFaGogWPi72FWrhbDfjKLZZUvd+kIUC
 HSKnv6pKnwbxLPhd9i0p5NchuTM6kRCptGzN07UUzeww6UVvs8t62+DzHUM1o3Ft
 sraIx7BLenGC8OBCgi8aNkE+yseQE4h2OTym3paEkLVJsl/9qcsSyXL1dwO4Z96U
 HFq/TpDZoBieZihHDBk4ry7ox942mE5N51QTDUh+cygEWeNvqGwqpAUbI14J23oh
 3p7w7hgXAtdtuj4pzqUARemHvIR0Xbpn8ritH9cx1s1mDdycyyBDn9mFw3Ehigom
 XIpUrSJtdfJYFj+z6wA4vXssvXe4TITrJTUmPAM1Alk1p+LhRkTA8JxBjHmL3qjR
 mFIxA40t+ON5OtCqTtGsapaoJy2Jj97dPEp5i5Jg49BQclQoTG2rpYuIu/aKrixG
 EZwdezckD3DPQUQdQidru7dS1J/phIaDDvEauq291ERHPfNAxQuMllXHeczzyJkc
 eVRMkj0/E5lihlE=
 =MN3z
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20180731' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "A single small audit fix to guard against memory allocation failures
  when logging information about a kernel module load.

  It's small, easy to understand, and self-contained; while nothing is
  zero risk, this should be pretty low"

* tag 'audit-pr-20180731' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: fix potential null dereference 'context->module.name'
2018-07-31 13:17:46 -07:00
Arthur Fabre
fbeb1603bf bpf: verifier: MOV64 don't mark dst reg unbounded
When check_alu_op() handles a BPF_MOV64 between two registers,
it calls check_reg_arg(DST_OP) on the dst register, marking it
as unbounded. If the src and dst register are the same, this
marks the src as unbounded, which can lead to unexpected errors
for further checks that rely on bounds info. For example:

	BPF_MOV64_IMM(BPF_REG_2, 0),
	BPF_MOV64_REG(BPF_REG_2, BPF_REG_2),
	BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2),
	BPF_MOV64_IMM(BPF_REG_0, 0),
	BPF_EXIT_INSN(),

Results in:

	"math between ctx pointer and register with unbounded
	min value is not allowed"

check_alu_op() now uses check_reg_arg(DST_OP_NO_MARK), and MOVs
that need to mark the dst register (MOVIMM, MOV32) do so.

Added a test case for MOV64 dst == src, and dst != src.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-31 22:09:33 +02:00
Anna-Maria Gleixner
80d20d35af nohz: Fix local_timer_softirq_pending()
local_timer_softirq_pending() checks whether the timer softirq is
pending with: local_softirq_pending() & TIMER_SOFTIRQ.

This is wrong because TIMER_SOFTIRQ is the softirq number and not a
bitmask. So the test checks for the wrong bit.

Use BIT(TIMER_SOFTIRQ) instead.

Fixes: 5d62c183f9 ("nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()")
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Cc: bigeasy@linutronix.de
Cc: peterz@infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180731161358.29472-1-anna-maria@linutronix.de
2018-07-31 22:08:44 +02:00
Joel Fernandes (Google)
c3bc8fd637 tracing: Centralize preemptirq tracepoints and unify their usage
This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.

Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.

* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
  3 users of the events exist:
  - Lockdep
  - irqsoff and preemptoff tracers
  - irqs and preempt trace events

The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.

In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:

 PREEMPT_TRACER   PREEMPTIRQ_EVENTS  IRQSOFF_TRACER PROVE_LOCKING
       |                 |     \         |           |
       \    (selects)    /      \        \ (selects) /
      TRACE_PREEMPT_TOGGLE       ----> TRACE_IRQFLAGS
                      \                  /
                       \ (depends on)   /
                     PREEMPTIRQ_TRACEPOINTS

Other than the performance tests mentioned in the previous patch, I also
ran the locking API test suite. I verified that all tests cases are
passing.

I also injected issues by not registering lockdep probes onto the
tracepoints and I see failures to confirm that the probes are indeed
working.

This series + lockdep probes not registered (just to inject errors):
[    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[    0.000000]        sirq-safe-A => hirqs-on/12:FAILED|FAILED|  ok  |
[    0.000000]        sirq-safe-A => hirqs-on/21:FAILED|FAILED|  ok  |
[    0.000000]          hard-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
[    0.000000]          soft-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
[    0.000000]          hard-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
[    0.000000]          soft-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
[    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |

With this series + lockdep probes registered, all locking tests pass:

[    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[    0.000000]        sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
[    0.000000]        sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
[    0.000000]          hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
[    0.000000]          soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
[    0.000000]          hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
[    0.000000]          soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |

Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-31 11:32:27 -04:00
Michael O'Farrell
9d2dcc8fc6 arm64: perf: Add cap_user_time aarch64
It is useful to get the running time of a thread.  Doing so in an
efficient manner can be important for performance of user applications.
Avoiding system calls in `clock_gettime` when handling
CLOCK_THREAD_CPUTIME_ID is important.  Other clocks are handled in the
VDSO, but CLOCK_THREAD_CPUTIME_ID falls back on the system call.

CLOCK_THREAD_CPUTIME_ID is not handled in the VDSO since it would have
costs associated with maintaining updated user space accessible time
offsets.  These offsets have to be updated everytime the a thread is
scheduled/descheduled.  However, for programs regularly checking the
running time of a thread, this is a performance improvement.

This patch takes a middle ground, and adds support for cap_user_time an
optional feature of the perf_event API.  This way costs are only
incurred when the perf_event api is enabled.  This is done the same way
as it is in x86.

Ultimately this allows calculating the thread running time in userspace
on aarch64 as follows (adapted from perf_event_open manpage):

u32 seq, time_mult, time_shift;
u64 running, count, time_offset, quot, rem, delta;
struct perf_event_mmap_page *pc;
pc = buf;  // buf is the perf event mmaped page as documented in the API.

if (pc->cap_usr_time) {
    do {
        seq = pc->lock;
        barrier();
        running = pc->time_running;

        count = readCNTVCT_EL0();  // Read ARM hardware clock.
        time_offset = pc->time_offset;
        time_mult   = pc->time_mult;
        time_shift  = pc->time_shift;

        barrier();
    } while (pc->lock != seq);

    quot = (count >> time_shift);
    rem = count & (((u64)1 << time_shift) - 1);
    delta = time_offset + quot * time_mult +
            ((rem * time_mult) >> time_shift);

    running += delta;
    // running now has the current nanosecond level thread time.
}

Summary of changes in the patch:

For aarch64 systems, make arch_perf_update_userpage update the timing
information stored in the perf_event page.  Requiring the following
calculations:
  - Calculate the appropriate time_mult, and time_shift factors to convert
    ticks to nano seconds for the current clock frequency.
  - Adjust the mult and shift factors to avoid shift factors of 32 bits.
    (possibly unnecessary)
  - The time_offset userspace should apply when doing calculations:
    negative the current sched time (now), because time_running and
    time_enabled fields of the perf_event page have just been updated.
Toggle bits to appropriate values:
  - Enable cap_user_time

Signed-off-by: Michael O'Farrell <micpof@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-07-31 10:14:00 +01:00
Linus Torvalds
f67077deb4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Several smallish fixes, I don't think any of this requires another -rc
  but I'll leave that up to you:

   1) Don't leak uninitialzed bytes to userspace in xfrm_user, from Eric
      Dumazet.

   2) Route leak in xfrm_lookup_route(), from Tommi Rantala.

   3) Premature poll() returns in AF_XDP, from Björn Töpel.

   4) devlink leak in netdevsim, from Jakub Kicinski.

   5) Don't BUG_ON in fib_compute_spec_dst, the condition can
      legitimately happen. From Lorenzo Bianconi.

   6) Fix some spectre v1 gadgets in generic socket code, from Jeremy
      Cline.

   7) Don't allow user to bind to out of range multicast groups, from
      Dmitry Safonov with a follow-up by Dmitry Safonov.

   8) Fix metrics leak in fib6_drop_pcpu_from(), from Sabrina Dubroca"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  netlink: Don't shift with UB on nlk->ngroups
  net/ipv6: fix metrics leak
  xen-netfront: wait xenbus state change when load module manually
  can: ems_usb: Fix memory leak on ems_usb_disconnect()
  openvswitch: meter: Fix setting meter id for new entries
  netlink: Do not subscribe to non-existent groups
  NET: stmmac: align DMA stuff to largest cache line length
  tcp_bbr: fix bw probing to raise in-flight data for very small BDPs
  net: socket: Fix potential spectre v1 gadget in sock_is_registered
  net: socket: fix potential spectre v1 gadget in socketcall
  net: mdio-mux: bcm-iproc: fix wrong getter and setter pair
  ipv4: remove BUG_ON() from fib_compute_spec_dst
  enic: handle mtu change for vf properly
  net: lan78xx: fix rx handling before first packet is send
  nfp: flower: fix port metadata conversion bug
  bpf: use GFP_ATOMIC instead of GFP_KERNEL in bpf_parse_prog()
  bpf: fix bpf_skb_load_bytes_relative pkt length check
  perf build: Build error in libbpf missing initialization
  net: ena: Fix use of uninitialized DMA address bits field
  bpf: btf: Use exact btf value_size match in map_check_btf()
  ...
2018-07-30 21:40:37 -07:00
Joel Fernandes (Google)
e6753f23d9 tracepoint: Make rcuidle tracepoint callers use SRCU
In recent tests with IRQ on/off tracepoints, a large performance
overhead ~10% is noticed when running hackbench. This is root caused to
calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
tracepoint code. Following a long discussion on the list [1] about this,
we concluded that srcu is a better alternative for use during rcu idle.
Although it does involve extra barriers, its lighter than the sched-rcu
version which has to do additional RCU calls to notify RCU idle about
entry into RCU sections.

In this patch, we change the underlying implementation of the
trace_*_rcuidle API to use SRCU. This has shown to improve performance
alot for the high frequency irq enable/disable tracepoints.

Test: Tested idle and preempt/irq tracepoints.

Here are some performance numbers:

With a run of the following 30 times on a single core x86 Qemu instance
with 1GB memory:
hackbench -g 4 -f 2 -l 3000

Completion times in seconds. CONFIG_PROVE_LOCKING=y.

No patches (without this series)
Mean: 3.048
Median: 3.025
Std Dev: 0.064

With Lockdep using irq tracepoints with RCU implementation:
Mean: 3.451   (-11.66 %)
Median: 3.447 (-12.22%)
Std Dev: 0.049

With Lockdep using irq tracepoints with SRCU implementation (this series):
Mean: 3.020   (I would consider the improvement against the "without
	       this series" case as just noise).
Median: 3.013
Std Dev: 0.033

[1] https://patchwork.kernel.org/patch/10344297/

[remove rcu_read_lock_sched_notrace as its the equivalent of
preempt_disable_notrace and is unnecessary to call in tracepoint code]
Link: http://lkml.kernel.org/r/20180730222423.196630-3-joel@joelfernandes.org

Cleaned-up-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ Simplified WARN_ON_ONCE() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-30 19:13:03 -04:00
Joel Fernandes (Google)
01f38497c6 lockdep: Use this_cpu_ptr instead of get_cpu_var stats
get_cpu_var disables preemption which has the potential to call into the
preemption disable trace points causing some complications. There's also
no need to disable preemption in uses of get_lock_stats anyway since
preempt is already disabled. So lets simplify the code.

Link: http://lkml.kernel.org/r/20180730222423.196630-2-joel@joelfernandes.org

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-30 19:06:54 -04:00
Francis Deslauriers
d899926f55 selftest/ftrace: Move kprobe selftest function to separate compile unit
Move selftest function to its own compile unit so it can be compiled
with the ftrace cflags (CC_FLAGS_FTRACE) allowing it to be probed
during the ftrace startup tests.

Link: http://lkml.kernel.org/r/153294604271.32740.16490677128630177030.stgit@devbox

Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-30 18:41:04 -04:00
Masami Hiramatsu
45408c4f92 tracing: kprobes: Prohibit probing on notrace function
Prohibit kprobe-events probing on notrace functions.  Since probing on a
notrace function can cause a recursive event call. In most cases those are just
skipped, but in some cases it falls into an infinite recursive call.

This protection can be disabled by the kconfig
CONFIG_KPROBE_EVENTS_ON_NOTRACE=y, but it is highly recommended to keep it
"n" for normal kernel builds.  Note that this is only available if "kprobes on
ftrace" has been implemented on the target arch and CONFIG_KPROBES_ON_FTRACE=y.

Link: http://lkml.kernel.org/r/153294601436.32740.10557881188933661239.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Francis Deslauriers <francis.deslauriers@efficios.com>
[ Slight grammar and spelling fixes ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-30 18:28:52 -04:00
Yi Wang
b305f7ed0f audit: fix potential null dereference 'context->module.name'
The variable 'context->module.name' may be null pointer when
kmalloc return null, so it's better to check it before using
to avoid null dereference.
Another one more thing this patch does is using kstrdup instead
of (kmalloc + strcpy), and signal a lost record via audit_log_lost.

Cc: stable@vger.kernel.org # 4.11
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-07-30 18:09:37 -04:00
Mukesh Ojha
d018031f56 cpu/hotplug: Clarify CPU hotplug step name for timers
After commit 249d4a9b3246 ("timers: Reinitialize per cpu bases on hotplug")
i.e. the introduction of state CPUHP_TIMERS_PREPARE instead of
CPUHP_TIMERS_DEAD the step name "timers:dead" is not longer accurate.

Rename it to "timers:prepare".

[ tglx: Massaged changelog ]

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: gkohli@codeaurora.org
Cc: neeraju@codeaurora.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Brendan Jackman <brendan.jackman@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Link: https://lkml.kernel.org/r/1532443668-26810-1-git-send-email-mojha@codeaurora.org
2018-07-30 21:30:52 +02:00
Linus Torvalds
ae3e10aba5 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes:

   - a deadline scheduler related bug fix which triggered a kernel
     warning

   - an RT_RUNTIME_SHARE fix

   - a stop_machine preemption fix

   - a potential NULL dereference fix in sched_domain_debug_one()"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE
  sched/deadline: Update rq_clock of later_rq when pushing a task
  stop_machine: Disable preemption after queueing stopper threads
  sched/topology: Check variable group before dereferencing it
2018-07-30 12:13:56 -07:00
Linus Torvalds
0634922a78 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes:

   - AMD IBS data corruptor fix (uncovered by UBSAN)

   - an Intel PEBS entry unwind error fix

   - a HW-tracing crash fix

   - a MAINTAINERS update"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix crash when using HW tracing kernel filters
  perf/x86/intel: Fix unwind errors from PEBS entries (mk-II)
  MAINTAINERS: Add Naveen N. Rao as kprobes co-maintainer
  perf/x86/amd/ibs: Don't access non-started event
2018-07-30 11:45:30 -07:00
Linus Torvalds
fb20c03d37 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "A paravirt UP-patching fix, and an I2C MUX driver lockdep warning fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/pvqspinlock/x86: Use LOCK_PREFIX in __pv_queued_spin_unlock() assembly code
  i2c/mux, locking/core: Annotate the nested rt_mutex usage
  locking/rtmutex: Allow specifying a subclass for nested locking
2018-07-30 11:37:16 -07:00
Pavel Tatashin
bd9f943e5d sched/clock: Disable interrupts when calling generic_sched_clock_init()
sched_clock_init() used be called early during boot when interrupts were
still disabled. After the recent changes to utilize sched clock early the
sched_clock_init() call happens when interrupts are already enabled, which
triggers the following warning:

WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180 sched_clock_register+0x44/0x278
[<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
[<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
[<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
[<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
[<c0519c18>] (start_kernel) from [<00000000>] (  (null))

Disable IRQs for the duration of generic_sched_clock_init().

Fixes: 857baa87b6 ("sched/clock: Enable sched clock early")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Link: https://lkml.kernel.org/r/20180730135252.24599-1-pasha.tatashin@oracle.com
2018-07-30 19:33:35 +02:00
Pavel Tatashin
684ad537ab timekeeping: Prevent false warning when persistent clock is not available
On arches with no persistent clock a message like this is printed during
boot:

[    0.000000] Persistent clock returned invalid value

The value is not invalid: Zero means that no persistent clock is available
and the absence of persistent clock should be quietly accepted.

Fixes: 3eca993740 ("timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: sboyd@kernel.org
Cc: john.stultz@linaro.org
Link: https://lkml.kernel.org/r/20180725200018.23722-1-pasha.tatashin@oracle.com
2018-07-30 19:32:29 +02:00
Rafael J. Wysocki
5a4c996764 Merge back cpufreq material for 4.19. 2018-07-30 11:27:01 +02:00
Dave Airlie
3fce461827 BackMerge v4.18-rc7 into drm-next
rmk requested this for armada and I think we've had a few
conflicts build up.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2018-07-30 10:39:22 +10:00
David S. Miller
958b4cd8fa Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-07-28

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) API fixes for libbpf's BTF mapping of map key/value types in order
   to make them compatible with iproute2's BPF_ANNOTATE_KV_PAIR()
   markings, from Martin.

2) Fix AF_XDP to not report POLLIN prematurely by using the non-cached
   consumer pointer of the RX queue, from Björn.

3) Fix __xdp_return() to check for NULL pointer after the rhashtable
   lookup that retrieves the allocator object, from Taehee.

4) Fix x86-32 JIT to adjust ebp register in prologue and epilogue
   by 4 bytes which got removed from overall stack usage, from Wang.

5) Fix bpf_skb_load_bytes_relative() length check to use actual
   packet length, from Daniel.

6) Fix uninitialized return code in libbpf bpf_perf_event_read_simple()
   handler, from Thomas.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-28 21:02:21 -07:00
kbuild test robot
518eeca05c tracing: preemptirq_delay_run() can be static
Automatically found by kbuild test robot.

Fixes: ffdc73a3b2ad ("lib: Add module for testing preemptoff/irqsoff latency tracers")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-27 17:58:34 -04:00
Linus Torvalds
864af0d40c Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "11 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kvm, mm: account shadow page tables to kmemcg
  zswap: re-check zswap_is_full() after do zswap_shrink()
  include/linux/eventfd.h: include linux/errno.h
  mm: fix vma_is_anonymous() false-positives
  mm: use vma_init() to initialize VMAs on stack and data segments
  mm: introduce vma_init()
  mm: fix exports that inadvertently make put_page() EXPORT_SYMBOL_GPL
  ipc/sem.c: prevent queue.status tearing in semop
  mm: disallow mappings that conflict for devm_memremap_pages()
  kasan: only select SLUB_DEBUG with SYSFS=y
  delayacct: fix crash in delayacct_blkio_end() after delayacct init failure
2018-07-27 10:30:47 -07:00
Robin Murphy
f07d141fe9 dma-mapping: Generalise dma_32bit_limit flag
Whilst the notion of an upstream DMA restriction is most commonly seen
in PCI host bridges saddled with a 32-bit native interface, a more
general version of the same issue can exist on complex SoCs where a bus
or point-to-point interconnect link from a device's DMA master interface
to another component along the path to memory (often an IOMMU) may carry
fewer address bits than the interfaces at both ends nominally support.
In order to properly deal with this, the first step is to expand the
dma_32bit_limit flag into an arbitrary mask.

To minimise the impact on existing code, we'll make sure to only
consider this new mask valid if set. That makes sense anyway, since a
mask of zero would represent DMA not being wired up at all, and that
would be better handled by not providing valid ops in the first place.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-27 19:01:04 +02:00
Linus Torvalds
3ebb6fb03d Various fixes to the tracing infrastructure:
- Fix double free when the reg() call fails in event_trigger_callback()
 
  - Fix anomoly of snapshot causing tracing_on flag to change
 
  - Add selftest to test snapshot and tracing_on affecting each other
 
  - Fix setting of tracepoint flag on error that prevents probes from
    being deleted.
 
  - Fix another possible double free that is similar to event_trigger_callback()
 
  - Quiet a gcc warning of a false positive unused variable
 
  - Fix crash of partial exposed task->comm to trace events
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW1pToBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qijEAQCzqQsnlO6YBCYajRBq2wFaM7J6tVnJ
 LxLZlVE8lJlHZQD/YpyGOPq98CB81BfQV7RA/CAVd4RZAhTjldDgGyfL/QI=
 =wU8I
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various fixes to the tracing infrastructure:

   - Fix double free when the reg() call fails in
     event_trigger_callback()

   - Fix anomoly of snapshot causing tracing_on flag to change

   - Add selftest to test snapshot and tracing_on affecting each other

   - Fix setting of tracepoint flag on error that prevents probes from
     being deleted.

   - Fix another possible double free that is similar to
     event_trigger_callback()

   - Quiet a gcc warning of a false positive unused variable

   - Fix crash of partial exposed task->comm to trace events"

* tag 'trace-v4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  kthread, tracing: Don't expose half-written comm when creating kthreads
  tracing: Quiet gcc warning about maybe unused link variable
  tracing: Fix possible double free in event_enable_trigger_func()
  tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure
  selftests/ftrace: Add snapshot and tracing_on test case
  ring_buffer: tracing: Inherit the tracing setting to next ring buffer
  tracing: Fix double free of event_trigger_data
2018-07-27 09:50:33 -07:00
Steven Rostedt (VMware)
87107a25a2 tracing/kprobes: Simplify the logic of enable_trace_kprobe()
The function enable_trace_kprobe() performs slightly differently if the file
parameter is passed in as NULL on non-NULL. Instead of checking file twice,
move the code between the two tests into a static inline helper function to
make the code easier to follow.

Link: http://lkml.kernel.org/r/20180725224728.7b1d5db2@vmware.local.home
Link: http://lkml.kernel.org/r/20180726121152.4dd54330@gandalf.local.home

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-27 09:36:20 -04:00
Kirill A. Shutemov
027232da7c mm: introduce vma_init()
Not all VMAs allocated with vm_area_alloc().  Some of them allocated on
stack or in data segment.

The new helper can be use to initialize VMA properly regardless where it
was allocated.

Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-26 19:38:03 -07:00
Dan Williams
31c5bda3a6 mm: fix exports that inadvertently make put_page() EXPORT_SYMBOL_GPL
Commit e763848843 ("mm: introduce MEMORY_DEVICE_FS_DAX and
CONFIG_DEV_PAGEMAP_OPS") added two EXPORT_SYMBOL_GPL() symbols, but
these symbols are required by the inlined put_page(), thus accidentally
making put_page() a GPL export only.  This breaks OpenAFS (at least).

Mark them EXPORT_SYMBOL() instead.

Link: http://lkml.kernel.org/r/153128611970.2928.11310692420711601254.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: e763848843 ("mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Joe Gorse <jhgorse@gmail.com>
Reported-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Joe Gorse <jhgorse@gmail.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mark Vitale <mvitale@sinenomine.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-26 19:38:03 -07:00
Dave Jiang
15d36fecd0 mm: disallow mappings that conflict for devm_memremap_pages()
When pmem namespaces created are smaller than section size, this can
cause an issue during removal and gpf was observed:

  general protection fault: 0000 1 SMP PTI
  CPU: 36 PID: 3941 Comm: ndctl Tainted: G W 4.14.28-1.el7uek.x86_64 #2
  task: ffff88acda150000 task.stack: ffffc900233a4000
  RIP: 0010:__put_page+0x56/0x79
  Call Trace:
    devm_memremap_pages_release+0x155/0x23a
    release_nodes+0x21e/0x260
    devres_release_all+0x3c/0x48
    device_release_driver_internal+0x15c/0x207
    device_release_driver+0x12/0x14
    unbind_store+0xba/0xd8
    drv_attr_store+0x27/0x31
    sysfs_kf_write+0x3f/0x46
    kernfs_fop_write+0x10f/0x18b
    __vfs_write+0x3a/0x16d
    vfs_write+0xb2/0x1a1
    SyS_write+0x55/0xb9
    do_syscall_64+0x79/0x1ae
    entry_SYSCALL_64_after_hwframe+0x3d/0x0

Add code to check whether we have a mapping already in the same section
and prevent additional mappings from being created if that is the case.

Link: http://lkml.kernel.org/r/152909478401.50143.312364396244072931.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-26 19:38:03 -07:00
Martin KaFai Lau
5f300e8004 bpf: btf: Use exact btf value_size match in map_check_btf()
The current map_check_btf() in BPF_MAP_TYPE_ARRAY rejects
'> map->value_size' to ensure map_seq_show_elem() will not
access things beyond an array element.

Yonghong suggested that using '!=' is a more correct
check.  The 8 bytes round_up on value_size is stored
in array->elem_size.  Hence, using '!=' on map->value_size
is a proper check.

This patch also adds new tests to check the btf array
key type and value type.  Two of these new tests verify
the btf's value_size (the change in this patch).

It also fixes two existing tests that wrongly encoded
a btf's type size (pprint_test) and the value_type_id (in one
of the raw_tests[]).  However, that do not affect these two
BTF verification tests before or after this test changes.
These two tests mainly failed at array creation time after
this patch.

Fixes: a26ca7c982 ("bpf: btf: Add pretty print support to the basic arraymap")
Suggested-by: Yonghong Song <yhs@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-27 03:45:49 +02:00
Masami Hiramatsu
1fee4f7752 doc: tracing: Fix a typo of trace_stat
The name of the directory for per-cpu function statistics
is trace_stat, not trace_stats.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-07-26 15:48:10 -06:00
Masami Hiramatsu
72809cbf67 tracing: Remove orphaned function ftrace_nr_registered_ops()
Remove ftrace_nr_registered_ops() because it is no longer used.

ftrace_nr_registered_ops() has been introduced by commit ea701f11da
("ftrace: Add selftest to test function trace recursion protection"), but
its caller has been removed by commit 05cbbf643b ("tracing: Fix selftest
function recursion accounting"). So it is not called anymore.

Link: http://lkml.kernel.org/r/153260907227.12474.5234899025934963683.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-26 10:58:43 -04:00
Masami Hiramatsu
7b144b6c79 tracing: Remove orphaned function using_ftrace_ops_list_func().
Remove using_ftrace_ops_list_func() since it is no longer used.

Using ftrace_ops_list_func() has been introduced by commit 7eea4fce02
("tracing/stack_trace: Skip 4 instead of 3 when using ftrace_ops_list_func")
as a helper function, but its caller has been removed by commit 72ac426a5b
("tracing: Clean up stack tracing and fix fentry updates").  So it is not
called anymore.

Link: http://lkml.kernel.org/r/153260904427.12474.9952096317439329851.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-26 10:53:05 -04:00
Steven Rostedt (VMware)
f6b7425cfb tracing: Make unregister_trigger() static
Nothing uses unregister_trigger() outside of trace_events_trigger.c file,
thus it should be static. Not sure why this was ever converted, because
its counter part, register_trigger(), was always static.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-26 10:50:18 -04:00
Joel Fernandes (Google)
f96e8577da lib: Add module for testing preemptoff/irqsoff latency tracers
Here we introduce a test module for introducing a long preempt or irq
disable delay in the kernel which the preemptoff or irqsoff tracers can
detect. This module is to be used only for test purposes and is default
disabled.

Following is the expected output (only briefly shown) that can be parsed
to verify that the tracers are working correctly. We will use this from
the kselftests in future patches.

For the preemptoff tracer:

echo preemptoff > /d/tracing/current_tracer
sleep 1
insmod ./preemptirq_delay_test.ko test_mode=preempt delay=500000
sleep 1
bash-4.3# cat /d/tracing/trace
preempt -1066    2...2    0us@: preemptirq_delay_run <-preemptirq_delay_run
preempt -1066    2...2 500002us : preemptirq_delay_run <-preemptirq_delay_run
preempt -1066    2...2 500004us : tracer_preempt_on <-preemptirq_delay_run
preempt -1066    2...2 500012us : <stack trace>
 => kthread
 => ret_from_fork

For the irqsoff tracer:

echo irqsoff > /d/tracing/current_tracer
sleep 1
insmod ./preemptirq_delay_test.ko test_mode=irq delay=500000
sleep 1
bash-4.3# cat /d/tracing/trace
irq dis -1069    1d..1    0us@: preemptirq_delay_run
irq dis -1069    1d..1 500001us : preemptirq_delay_run
irq dis -1069    1d..1 500002us : tracer_hardirqs_on <-preemptirq_delay_run
irq dis -1069    1d..1 500005us : <stack trace>
 => ret_from_fork

Link: http://lkml.kernel.org/r/20180712213611.GA8743@joelaf.mtv.corp.google.com

Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Julia Cartwright <julia@ni.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Glexiner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[ Erick is a co-developer of this commit ]
Signed-off-by: Erick Reyes <erickreyes@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-26 10:50:17 -04:00
Joel Fernandes (Google)
2b27ece6c5 tracing/irqsoff: Split reset into separate functions
Split reset functions into seperate functions in preparation
of future patches that need to do tracer specific reset.

Link: http://lkml.kernel.org/r/20180628182149.226164-4-joel@joelfernandes.org

Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-26 10:50:17 -04:00
Snild Dolkow
3e536e222f kthread, tracing: Don't expose half-written comm when creating kthreads
There is a window for racing when printing directly to task->comm,
allowing other threads to see a non-terminated string. The vsnprintf
function fills the buffer, counts the truncated chars, then finally
writes the \0 at the end.

	creator                     other
	vsnprintf:
	  fill (not terminated)
	  count the rest            trace_sched_waking(p):
	  ...                         memcpy(comm, p->comm, TASK_COMM_LEN)
	  write \0

The consequences depend on how 'other' uses the string. In our case,
it was copied into the tracing system's saved cmdlines, a buffer of
adjacent TASK_COMM_LEN-byte buffers (note the 'n' where 0 should be):

	crash-arm64> x/1024s savedcmd->saved_cmdlines | grep 'evenk'
	0xffffffd5b3818640:     "irq/497-pwr_evenkworker/u16:12"

...and a strcpy out of there would cause stack corruption:

	[224761.522292] Kernel panic - not syncing: stack-protector:
	    Kernel stack is corrupted in: ffffff9bf9783c78

	crash-arm64> kbt | grep 'comm\|trace_print_context'
	#6  0xffffff9bf9783c78 in trace_print_context+0x18c(+396)
	      comm (char [16]) =  "irq/497-pwr_even"

	crash-arm64> rd 0xffffffd4d0e17d14 8
	ffffffd4d0e17d14:  2f71726900000000 5f7277702d373934   ....irq/497-pwr_
	ffffffd4d0e17d24:  726f776b6e657665 3a3631752f72656b   evenkworker/u16:
	ffffffd4d0e17d34:  f9780248ff003231 cede60e0ffffff9b   12..H.x......`..
	ffffffd4d0e17d44:  cede60c8ffffffd4 00000fffffffffd4   .....`..........

The workaround in e09e28671 (use strlcpy in __trace_find_cmdline) was
likely needed because of this same bug.

Solved by vsnprintf:ing to a local buffer, then using set_task_comm().
This way, there won't be a window where comm is not terminated.

Link: http://lkml.kernel.org/r/20180726071539.188015-1-snild@sony.com

Cc: stable@vger.kernel.org
Fixes: bc0c38d139 ("ftrace: latency tracer infrastructure")
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Snild Dolkow <snild@sony.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-26 09:59:33 -04:00
Waiman Long
6f4ceee930 cpu/hotplug: Add a cpus_read_trylock() function
There are use cases where it can be useful to have a cpus_read_trylock()
function to work around circular lock dependency problem involving
the cpu_hotplug_lock.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-07-26 10:37:36 +02:00
Steven Rostedt (VMware)
2519c1bbe3 tracing: Quiet gcc warning about maybe unused link variable
Commit 57ea2a34ad ("tracing/kprobes: Fix trace_probe flags on
enable_trace_kprobe() failure") added an if statement that depends on another
if statement that gcc doesn't see will initialize the "link" variable and
gives the warning:

 "warning: 'link' may be used uninitialized in this function"

It is really a false positive, but to quiet the warning, and also to make
sure that it never actually is used uninitialized, initialize the "link"
variable to NULL and add an if (!WARN_ON_ONCE(!link)) where the compiler
thinks it could be used uninitialized.

Cc: stable@vger.kernel.org
Fixes: 57ea2a34ad ("tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-25 22:33:50 -04:00
Steven Rostedt (VMware)
15cc78644d tracing: Fix possible double free in event_enable_trigger_func()
There was a case that triggered a double free in event_trigger_callback()
due to the called reg() function freeing the trigger_data and then it
getting freed again by the error return by the caller. The solution there
was to up the trigger_data ref count.

Code inspection found that event_enable_trigger_func() has the same issue,
but is not as easy to trigger (requires harder to trigger failures). It
needs to be solved slightly different as it needs more to clean up when the
reg() function fails.

Link: http://lkml.kernel.org/r/20180725124008.7008e586@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: 7862ad1846 ("tracing: Add 'enable_event' and 'disable_event' event trigger commands")
Reivewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-25 21:25:16 -04:00
Artem Savkov
57ea2a34ad tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure
If enable_trace_kprobe fails to enable the probe in enable_k(ret)probe
it returns an error, but does not unset the tp flags it set previously.
This results in a probe being considered enabled and failures like being
unable to remove the probe through kprobe_events file since probes_open()
expects every probe to be disabled.

Link: http://lkml.kernel.org/r/20180725102826.8300-1-asavkov@redhat.com
Link: http://lkml.kernel.org/r/20180725142038.4765-1-asavkov@redhat.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 41a7dd420c ("tracing/kprobes: Support ftrace_event_file base multibuffer")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-25 11:41:08 -04:00
Masami Hiramatsu
73c8d89455 ring_buffer: tracing: Inherit the tracing setting to next ring buffer
Maintain the tracing on/off setting of the ring_buffer when switching
to the trace buffer snapshot.

Taking a snapshot is done by swapping the backup ring buffer
(max_tr_buffer). But since the tracing on/off setting is defined
by the ring buffer, when swapping it, the tracing on/off setting
can also be changed. This causes a strange result like below:

  /sys/kernel/debug/tracing # cat tracing_on
  1
  /sys/kernel/debug/tracing # echo 0 > tracing_on
  /sys/kernel/debug/tracing # cat tracing_on
  0
  /sys/kernel/debug/tracing # echo 1 > snapshot
  /sys/kernel/debug/tracing # cat tracing_on
  1
  /sys/kernel/debug/tracing # echo 1 > snapshot
  /sys/kernel/debug/tracing # cat tracing_on
  0

We don't touch tracing_on, but snapshot changes tracing_on
setting each time. This is an anomaly, because user doesn't know
that each "ring_buffer" stores its own tracing-enable state and
the snapshot is done by swapping ring buffers.

Link: http://lkml.kernel.org/r/153149929558.11274.11730609978254724394.stgit@devbox

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Hiraku Toyooka <hiraku.toyooka@cybertrust.co.jp>
Cc: stable@vger.kernel.org
Fixes: debdd57f51 ("tracing: Make a snapshot feature available from userspace")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ Updated commit log and comment in the code ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-25 10:29:41 -04:00
Steven Rostedt (VMware)
1863c38725 tracing: Fix double free of event_trigger_data
Running the following:

 # cd /sys/kernel/debug/tracing
 # echo 500000 > buffer_size_kb
[ Or some other number that takes up most of memory ]
 # echo snapshot > events/sched/sched_switch/trigger

Triggers the following bug:

 ------------[ cut here ]------------
 kernel BUG at mm/slub.c:296!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
 CPU: 6 PID: 6878 Comm: bash Not tainted 4.18.0-rc6-test+ #1066
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 RIP: 0010:kfree+0x16c/0x180
 Code: 05 41 0f b6 72 51 5b 5d 41 5c 4c 89 d7 e9 ac b3 f8 ff 48 89 d9 48 89 da 41 b8 01 00 00 00 5b 5d 41 5c 4c 89 d6 e9 f4 f3 ff ff <0f> 0b 0f 0b 48 8b 3d d9 d8 f9 00 e9 c1 fe ff ff 0f 1f 40 00 0f 1f
 RSP: 0018:ffffb654436d3d88 EFLAGS: 00010246
 RAX: ffff91a9d50f3d80 RBX: ffff91a9d50f3d80 RCX: ffff91a9d50f3d80
 RDX: 00000000000006a4 RSI: ffff91a9de5a60e0 RDI: ffff91a9d9803500
 RBP: ffffffff8d267c80 R08: 00000000000260e0 R09: ffffffff8c1a56be
 R10: fffff0d404543cc0 R11: 0000000000000389 R12: ffffffff8c1a56be
 R13: ffff91a9d9930e18 R14: ffff91a98c0c2890 R15: ffffffff8d267d00
 FS:  00007f363ea64700(0000) GS:ffff91a9de580000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055c1cacc8e10 CR3: 00000000d9b46003 CR4: 00000000001606e0
 Call Trace:
  event_trigger_callback+0xee/0x1d0
  event_trigger_write+0xfc/0x1a0
  __vfs_write+0x33/0x190
  ? handle_mm_fault+0x115/0x230
  ? _cond_resched+0x16/0x40
  vfs_write+0xb0/0x190
  ksys_write+0x52/0xc0
  do_syscall_64+0x5a/0x160
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7f363e16ab50
 Code: 73 01 c3 48 8b 0d 38 83 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 79 db 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e e3 01 00 48 89 04 24
 RSP: 002b:00007fff9a4c6378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f363e16ab50
 RDX: 0000000000000009 RSI: 000055c1cacc8e10 RDI: 0000000000000001
 RBP: 000055c1cacc8e10 R08: 00007f363e435740 R09: 00007f363ea64700
 R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000009
 R13: 0000000000000001 R14: 00007f363e4345e0 R15: 00007f363e4303c0
 Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device i915 snd_pcm snd_timer i2c_i801 snd soundcore i2c_algo_bit drm_kms_helper
86_pkg_temp_thermal video kvm_intel kvm irqbypass wmi e1000e
 ---[ end trace d301afa879ddfa25 ]---

The cause is because the register_snapshot_trigger() call failed to
allocate the snapshot buffer, and then called unregister_trigger()
which freed the data that was passed to it. Then on return to the
function that called register_snapshot_trigger(), as it sees it
failed to register, it frees the trigger_data again and causes
a double free.

By calling event_trigger_init() on the trigger_data (which only ups
the reference counter for it), and then event_trigger_free() afterward,
the trigger_data would not get freed by the registering trigger function
as it would only up and lower the ref count for it. If the register
trigger function fails, then the event_trigger_free() called after it
will free the trigger data normally.

Link: http://lkml.kernel.org/r/20180724191331.738eb819@gandalf.local.home

Cc: stable@vger.kerne.org
Fixes: 93e31ffbf4 ("tracing: Add 'snapshot' event trigger command")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-25 10:29:24 -04:00
Kees Cook
7d63fb3af8 swiotlb: clean up reporting
This removes needless use of '%p', and refactors the printk calls to
use pr_*() helpers instead.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-25 13:33:05 +02:00
Ingo Molnar
93081caaae Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:47:02 +02:00
Mathieu Poirier
7f635ff187 perf/core: Fix crash when using HW tracing kernel filters
In function perf_event_parse_addr_filter(), the path::dentry of each struct
perf_addr_filter is left unassigned (as it should be) when the pattern
being parsed is related to kernel space.  But in function
perf_addr_filter_match() the same dentries are given to d_inode() where
the value is not expected to be NULL, resulting in the following splat:

  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
  pc : perf_event_mmap+0x2fc/0x5a0
  lr : perf_event_mmap+0x2c8/0x5a0
  Process uname (pid: 2860, stack limit = 0x000000001cbcca37)
  Call trace:
   perf_event_mmap+0x2fc/0x5a0
   mmap_region+0x124/0x570
   do_mmap+0x344/0x4f8
   vm_mmap_pgoff+0xe4/0x110
   vm_mmap+0x2c/0x40
   elf_map+0x60/0x108
   load_elf_binary+0x450/0x12c4
   search_binary_handler+0x90/0x290
   __do_execve_file.isra.13+0x6e4/0x858
   sys_execve+0x3c/0x50
   el0_svc_naked+0x30/0x34

This patch is fixing the problem by introducing a new check in function
perf_addr_filter_match() to see if the filter's dentry is NULL.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: miklos@szeredi.hu
Cc: namhyung@kernel.org
Cc: songliubraving@fb.com
Fixes: 9511bce9fe ("perf/core: Fix bad use of igrab()")
Link: http://lkml.kernel.org/r/1531782831-1186-1-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:46:22 +02:00
Peter Zijlstra
6cbc304f2f perf/x86/intel: Fix unwind errors from PEBS entries (mk-II)
Vince reported the perf_fuzzer giving various unwinder warnings and
Josh reported:

> Deja vu.  Most of these are related to perf PEBS, similar to the
> following issue:
>
>   b8000586c9 ("perf/x86/intel: Cure bogus unwind from PEBS entries")
>
> This is basically the ORC version of that.  setup_pebs_sample_data() is
> assembling a franken-pt_regs which ORC isn't happy about.  RIP is
> inconsistent with some of the other registers (like RSP and RBP).

And where the previous unwinder only needed BP,SP ORC also requires
IP. But we cannot spoof IP because then the sample will get displaced,
entirely negating the point of PEBS.

So cure the whole thing differently by doing the unwind early; this
does however require a means to communicate we did the unwind early.
We (ab)use an unused sample_type bit for this, which we set on events
that fill out the data->callchain before the normal
perf_prepare_sample().

Debugged-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:46:21 +02:00
Srikar Dronamraju
b6a60cf36d sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
numa_migrate_preferred() is called periodically or when task preferred
node changes. Preferred node evaluations happen once per scan sequence.

If the scan completion happens just after the periodic NUMA migration,
then we try to migrate to the preferred node and the preferred node might
change, needing another node migration.

Avoid this by checking for scan sequence completion only when checking
for periodic migration.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25862.6     26158.1     1.14258
1     74357       72725       -2.19482

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     117019      113992      -2.58
1     179095      174947      -2.31

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      449.46      770.77      615.22      101.70
numa01.sh       Sys:      132.72      208.17      170.46       24.96
numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
numa02.sh      Real:       60.85       61.79       61.28        0.37
numa02.sh       Sys:       15.34       24.71       21.08        3.61
numa02.sh      User:     5204.41     5249.85     5231.21       17.60
numa03.sh      Real:      785.50      916.97      840.77       44.98
numa03.sh       Sys:      108.08      133.60      119.43        8.82
numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
numa04.sh      Real:      429.57      587.37      480.80       57.40
numa04.sh       Sys:      240.61      321.97      290.84       33.58
numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
numa05.sh      Real:      392.09      431.25      414.65       13.82
numa05.sh       Sys:      229.41      372.48      297.54       53.14
numa05.sh      User:    33390.86    34697.49    34222.43      556.42

Testcase       Time:         Min         Max         Avg      StdDev 	%Change
numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-20-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:08 +02:00
Srikar Dronamraju
f35678b6a1 sched/numa: Use group_weights to identify if migration degrades locality
On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
consolidated to the closest group of nodes. In such a case, relying on
group_fault metric may not always help to consolidate. There can always
be a case where a node closer to the preferred node may have lesser
faults than a node further away from the preferred node. In such a case,
moving to node with more faults might avoid numa consolidation.

Using group_weight would help to consolidate task/memory around the
preferred_node.

While here, to be on the conservative side, don't override migrate thread
degrades locality logic for CPU_NEWLY_IDLE load balancing.

Note: Similar problems exist with should_numa_migrate_memory and will be
dealt separately.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25645.4     25960       1.22
1     72142       73550       1.95

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     110199      120071      8.958
1     176303      176249      -0.03

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      490.04      774.86      596.26       96.46
numa01.sh       Sys:      151.52      242.88      184.82       31.71
numa01.sh      User:    41418.41    60844.59    48776.09     6564.27
numa02.sh      Real:       60.14       62.94       60.98        1.00
numa02.sh       Sys:       16.11       30.77       21.20        5.28
numa02.sh      User:     5184.33     5311.09     5228.50       44.24
numa03.sh      Real:      790.95      856.35      826.41       24.11
numa03.sh       Sys:      114.93      118.85      117.05        1.63
numa03.sh      User:    60990.99    64959.28    63470.43     1415.44
numa04.sh      Real:      434.37      597.92      504.87       59.70
numa04.sh       Sys:      237.63      397.40      289.74       55.98
numa04.sh      User:    34854.87    41121.83    38572.52     2615.84
numa05.sh      Real:      386.77      448.90      417.22       22.79
numa05.sh       Sys:      149.23      379.95      303.04       79.55
numa05.sh      User:    32951.76    35959.58    34562.18     1034.05

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      493.19      672.88      597.51       59.38 	 -0.20%
numa01.sh       Sys:      150.09      245.48      207.76       34.26 	 -11.0%
numa01.sh      User:    41928.51    53779.17    48747.06     3901.39 	 0.059%
numa02.sh      Real:       60.63       62.87       61.22        0.83 	 -0.39%
numa02.sh       Sys:       16.64       27.97       20.25        4.06 	 4.691%
numa02.sh      User:     5222.92     5309.60     5254.03       29.98 	 -0.48%
numa03.sh      Real:      821.52      902.15      863.60       32.41 	 -4.30%
numa03.sh       Sys:      112.04      130.66      118.35        7.08 	 -1.09%
numa03.sh      User:    62245.16    69165.14    66443.04     2450.32 	 -4.47%
numa04.sh      Real:      414.53      519.57      476.25       37.00 	 6.009%
numa04.sh       Sys:      181.84      335.67      280.41       54.07 	 3.327%
numa04.sh      User:    33924.50    39115.39    37343.78     1934.26 	 3.290%
numa05.sh      Real:      408.30      441.45      417.90       12.05 	 -0.16%
numa05.sh       Sys:      233.41      381.60      295.58       57.37 	 2.523%
numa05.sh      User:    33301.31    35972.50    34335.19      938.94 	 0.661%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-16-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:08 +02:00
Srikar Dronamraju
30619c89b1 sched/numa: Update the scan period without holding the numa_group lock
The metrics for updating scan periods are local or task specific.
Currently this update happens under the numa_group lock, which seems
unnecessary. Hence move this update outside the lock.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25355.9     25645.4     1.141
1     72812       72142       -0.92

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:08 +02:00
Srikar Dronamraju
2d4056fafa sched/numa: Remove numa_has_capacity()
task_numa_find_cpu() helps to find the CPU to swap/move the task to.
It's guarded by numa_has_capacity(). However node not having capacity
shouldn't deter a task swapping if it helps NUMA placement.

Further load_too_imbalanced(), which evaluates possibilities of move/swap,
provides similar checks as numa_has_capacity.

Hence remove numa_has_capacity() to enhance possibilities of task
swapping even if load is imbalanced.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25657.9     25804.1     0.569
1     74435       73413       -1.37

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:08 +02:00
Srikar Dronamraju
0ad4e3dfe6 sched/numa: Modify migrate_swap() to accept additional parameters
There are checks in migrate_swap_stop() that check if the task/CPU
combination is as per migrate_swap_arg before migrating.

However atleast one of the two tasks to be swapped by migrate_swap() could
have migrated to a completely different CPU before updating the
migrate_swap_arg. The new CPU where the task is currently running could
be a different node too. If the task has migrated, numa balancer might
end up placing a task in a wrong node.  Instead of achieving node
consolidation, it may end up spreading the load across nodes.

To avoid that pass the CPUs as additional parameters.

While here, place migrate_swap under CONFIG_NUMA_BALANCING.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25377.3     25226.6     -0.59
1     72287       73326       1.437

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:07 +02:00
Srikar Dronamraju
10864a9e22 sched/numa: Remove unused task_capacity from 'struct numa_stats'
The task_capacity field in 'struct numa_stats' is redundant.
Also move nr_running for better packing within the struct.

No functional changes.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25308.6     25377.3     0.271
1     72964       72287       -0.92

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:07 +02:00
Srikar Dronamraju
0ee7e74dc0 sched/numa: Skip nodes that are at 'hoplimit'
When comparing two nodes at a distance of 'hoplimit', we should consider
nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit'
distance too. Hence two nodes at a distance of 'hoplimit' will have same
groupweight. Fix this by skipping nodes at hoplimit.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25375.3     25308.6     -0.26
1     72617       72964       0.477

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     113372      108750      -4.07684
1     177403      183115      3.21979

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      478.45      565.90      515.11       30.87
numa01.sh       Sys:      207.79      271.04      232.94       21.33
numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
numa02.sh      Real:       60.00       61.46       60.78        0.49
numa02.sh       Sys:       15.71       25.31       20.69        3.42
numa02.sh      User:     5175.92     5265.86     5235.97       32.82
numa03.sh      Real:      776.42      834.85      806.01       23.22
numa03.sh       Sys:      114.43      128.75      121.65        5.49
numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
numa04.sh      Real:      456.93      511.95      482.91       20.88
numa04.sh       Sys:      178.09      460.89      356.86       94.58
numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
numa05.sh      Real:      393.98      493.48      436.61       35.59
numa05.sh       Sys:      164.49      329.15      265.87       61.78
numa05.sh      User:    33182.65    36654.53    35074.51     1187.71

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%

While there is a regression with this change, this change is needed from a
correctness perspective. Also it helps consolidation as seen from perf bench
output.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:07 +02:00
Srikar Dronamraju
67d9f6c256 sched/debug: Reverse the order of printing faults
Fix the order in which the private and shared numa faults are getting
printed.

No functional changes.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25215.7     25375.3     0.63
1     72107       72617       0.70

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:07 +02:00
Srikar Dronamraju
f03bb6760b sched/numa: Use task faults only if numa_group is not yet set up
When numa_group faults are available, task_numa_placement only uses
numa_group faults to evaluate preferred node. However it still accounts
task faults and even evaluates the preferred node just based on task
faults just to discard it in favour of preferred node chosen on the
basis of numa_group.

Instead use task faults only if numa_group is not set.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25549.6     25215.7     -1.30
1     73190       72107       -1.47

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     113437      113372      -0.05
1     196130      177403      -9.54

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      506.35      794.46      599.06      104.26
numa01.sh       Sys:      150.37      223.56      195.99       24.94
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
numa02.sh      Real:       60.33       62.40       61.31        0.90
numa02.sh       Sys:       18.12       31.66       24.28        5.89
numa02.sh      User:     5203.91     5325.32     5260.29       49.98
numa03.sh      Real:      696.47      853.62      745.80       57.28
numa03.sh       Sys:       85.68      123.71       97.89       13.48
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
numa04.sh      Real:      444.05      514.83      497.06       26.85
numa04.sh       Sys:      230.39      375.79      316.23       48.58
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
numa05.sh      Real:      423.09      460.41      439.57       13.92
numa05.sh       Sys:      287.38      480.15      369.37       68.52
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:06 +02:00
Srikar Dronamraju
8cd45eee43 sched/numa: Set preferred_node based on best_cpu
Currently preferred node is set to dst_nid which is the last node in the
iteration whose group weight or task weight is greater than the current
node. However it doesn't guarantee that dst_nid has the numa capacity
to move. It also doesn't guarantee that dst_nid has the best_cpu which
is the CPU/node ideal for node migration.

Lets consider faults on a 4 node system with group weight numbers
in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
is running on 3 and 0 is its preferred node but its capacity is full.
Consider nodes 1, 2 and 3 have capacity. Then the task should be
migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
points to the last node whose faults were greater than current node.

Modify to set the preferred node based of best_cpu. Earlier setting
preferred node was skipped if nr_active_nodes is 1. This could result in
the task being moved out of the preferred node to a random node during
regular load balancing.

Also while modifying task_numa_migrate(), use sched_setnuma to set
preferred node. This ensures out numa accounting is correct.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25122.9     25549.6     1.698
1     73850       73190       -0.89

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     105930      113437      7.08676
1     178624      196130      9.80047

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      435.78      653.81      534.58       83.20
numa01.sh       Sys:      121.93      187.18      145.90       23.47
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
numa02.sh      Real:       60.64       61.63       61.19        0.40
numa02.sh       Sys:       14.72       25.68       19.06        4.03
numa02.sh      User:     5210.95     5266.69     5233.30       20.82
numa03.sh      Real:      746.51      808.24      780.36       23.88
numa03.sh       Sys:       97.26      108.48      105.07        4.28
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
numa04.sh      Real:      465.97      519.27      484.81       19.62
numa04.sh       Sys:      304.43      359.08      334.68       20.64
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
numa05.sh      Real:      411.57      457.20      433.29       16.58
numa05.sh       Sys:      230.05      435.48      339.95       67.58
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:06 +02:00
Srikar Dronamraju
5f95ba7a43 sched/numa: Simplify load_too_imbalanced()
Currently load_too_imbalance() cares about the slope of imbalance.
It doesn't care of the direction of the imbalance.

However this may not work if nodes that are being compared have
dissimilar capacities. Few nodes might have more cores than other nodes
in the system. Also unlike traditional load balance at a NUMA sched
domain, multiple requests to migrate from the same source node to same
destination node may run in parallel. This can cause huge load
imbalance. This is specially true on a larger machines with either large
cores per node or more number of nodes in the system. Hence allow
move/swap only if the imbalance is going to reduce.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25058.2     25122.9     0.25
1     72950       73850       1.23

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      516.14      892.41      739.84      151.32
numa01.sh       Sys:      153.16      192.99      177.70       14.58
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
numa02.sh      Real:       60.91       62.35       61.58        0.63
numa02.sh       Sys:       16.47       26.16       21.20        3.85
numa02.sh      User:     5227.58     5309.61     5265.17       31.04
numa03.sh      Real:      739.07      917.73      795.75       64.45
numa03.sh       Sys:       94.46      136.08      109.48       14.58
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
numa04.sh      Real:      442.61      715.43      530.31       96.12
numa04.sh       Sys:      224.90      348.63      285.61       48.83
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
numa05.sh      Real:      386.13      489.17      434.94       43.59
numa05.sh       Sys:      144.29      438.56      278.80      105.78
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:06 +02:00
Srikar Dronamraju
305c1fac32 sched/numa: Evaluate move once per node
task_numa_compare() helps choose the best CPU to move or swap the
selected task. To achieve this task_numa_compare() is called for every
CPU in the node. Currently it evaluates if the task can be moved/swapped
for each of the CPUs. However the move evaluation is mostly independent
of the CPU. Evaluating the move logic once per node, provides scope for
simplifying task_numa_compare().

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25705.2     25058.2     -2.51
1     74433       72950       -1.99

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     96589.6     105930      9.670
1     181830      178624      -1.76

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      440.65      941.32      758.98      189.17
numa01.sh       Sys:      183.48      320.07      258.42       50.09
numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
numa02.sh      Real:       61.24       65.35       62.49        1.49
numa02.sh       Sys:       16.83       24.18       21.40        2.60
numa02.sh      User:     5219.59     5356.34     5264.03       49.07
numa03.sh      Real:      822.04      912.40      873.55       37.35
numa03.sh       Sys:      118.80      140.94      132.90        7.60
numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
numa04.sh      Real:      690.66      872.12      778.49       65.44
numa04.sh       Sys:      459.26      563.03      494.03       42.39
numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
numa05.sh      Real:      418.37      562.28      525.77       54.27
numa05.sh       Sys:      299.45      481.00      392.49       64.27
numa05.sh      User:    34115.09    41324.02    39105.30     2627.68

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:06 +02:00
Yun Wang
3d6c50c27b sched/debug: Show the sum wait time of a task group
Although we can rely on cpuacct to present the CPU usage of task
groups, it is hard to tell how intense the competition is between
these groups on CPU resources.

Monitoring the wait time or sched_debug of each process could be
very expensive, and there is no good way to accurately represent the
conflict with these info, we need the wait time on group dimension.

Thus we introduce group's wait_sum to represent the resource conflict
between task groups, which is simply the sum of the wait time of
the group's cfs_rq.

The 'cpu.stat' is modified to show the statistic, like:

   nr_periods 0
   nr_throttled 0
   throttled_time 0
   wait_sum 2035098795584

Now we can monitor the changes of wait_sum to tell how much a
a task group is suffering in the fight of CPU resources.

For example:

   (wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%

means the task group paid X percentage of period on waiting
for the CPU.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ff7dae3b-e5f9-7157-1caa-ff02c6b23dc1@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:05 +02:00
Vincent Guittot
2e62c4743a sched/fair: Remove #ifdefs from scale_rt_capacity()
Reuse cpu_util_irq() that has been defined for schedutil and set irq util
to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.

But the compiler is not able to optimize the sequence (at least with
aarch64 GCC 7.2.1):

	free *= (max - irq);
	free /= max;

when irq is fixed to 0

Add a new inline function scale_irq_capacity() that will scale utilization
when irq is accounted. Reuse this funciton in schedutil which applies
similar formula.

Suggested-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:41:05 +02:00
Ingo Molnar
4765096f4f Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:29:58 +02:00
Hailong Liu
f3d133ee0a sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE
NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
runtime with a spin-rt-task.

However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
enough rt_runtime at the beginning, rt_runtime can't be restored to
its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.

E.g. on my PC with 4 cores, procedure to reproduce:
1) Make sure  RT_RUNTIME_SHARE is enabled
 cat /sys/kernel/debug/sched_features
  GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
  CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
  LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
  NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
  ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
2) Start a spin-rt-task
 ./loop_rr &
3) set affinity to the last cpu
 taskset -p 8 $pid_of_loop_rr
4) Observe that last cpu have borrowed enough runtime.
 cat /proc/sched_debug | grep rt_runtime
  .rt_runtime                    : 950.000000
  .rt_runtime                    : 900.000000
  .rt_runtime                    : 950.000000
  .rt_runtime                    : 1000.000000
5) Disable RT_RUNTIME_SHARE
 echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
6) Observe that rt_runtime can not been restored
 cat /proc/sched_debug | grep rt_runtime
  .rt_runtime                    : 950.000000
  .rt_runtime                    : 900.000000
  .rt_runtime                    : 950.000000
  .rt_runtime                    : 1000.000000

This patch help to restore rt_runtime after we disable
RT_RUNTIME_SHARE.

Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:29:08 +02:00
Daniel Bristot de Oliveira
840d719604 sched/deadline: Update rq_clock of later_rq when pushing a task
Daniel Casini got this warn while running a DL task here at RetisLab:

  [  461.137582] ------------[ cut here ]------------
  [  461.137583] rq->clock_update_flags < RQCF_ACT_SKIP
  [  461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20
      [a ton of modules]
  [  461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3
  [  461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013
  [  461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20
  [  461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b
  [  461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082
  [  461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006
  [  461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0
  [  461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339
  [  461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20
  [  461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940
  [  461.137678] FS:  00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000
  [  461.137679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0
  [  461.137680] Call Trace:
  [  461.137684]  push_dl_task.part.46+0x3bc/0x460
  [  461.137686]  task_woken_dl+0x60/0x80
  [  461.137689]  ttwu_do_wakeup+0x4f/0x150
  [  461.137690]  ttwu_do_activate+0x77/0x80
  [  461.137692]  try_to_wake_up+0x1d6/0x4c0
  [  461.137693]  wake_up_q+0x32/0x70
  [  461.137696]  do_futex+0x7e7/0xb50
  [  461.137698]  __x64_sys_futex+0x8b/0x180
  [  461.137701]  do_syscall_64+0x5a/0x110
  [  461.137703]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [  461.137705] RIP: 0033:0x7efe4918ca26
  [  461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2
  [  461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca
  [  461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26
  [  461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24
  [  461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001
  [  461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000
  [  461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10
  [  461.137743] ---[ end trace 187df4cad2bf7649 ]---

This warning happened in the push_dl_task(), because
__add_running_bw()->cpufreq_update_util() is getting the rq_clock of
the later_rq before its update, which takes place at activate_task().
The fix then is to update the rq_clock before calling add_running_bw().

To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to
activate_task().

Reported-by: Daniel Casini <daniel.casini@santannapisa.it>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
Fixes: e0367b1267 sched/deadline: Move CPU frequency selection triggering points
Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:29:08 +02:00
Isaac J. Manjarres
2610e88946 stop_machine: Disable preemption after queueing stopper threads
This commit:

  9fb8d5dc4b ("stop_machine, Disable preemption when waking two stopper threads")

does not fully address the race condition that can occur
as follows:

On one CPU, call it CPU 3, thread 1 invokes
cpu_stop_queue_two_works(2, 3,...), and the execution is such
that thread 1 queues the works for migration/2 and migration/3,
and is preempted after releasing the locks for migration/2 and
migration/3, but before waking the threads.

Then, On CPU 2, a kworker, call it thread 2, is running,
and it invokes cpu_stop_queue_two_works(1, 2,...), such that
thread 2 queues the works for migration/1 and migration/2.
Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
migration/2 and migration/3. This means that when CPU 2
releases the locks for migration/1 and migration/2, but before
it wakes those threads, it can be preempted by migration/2.

If thread 2 is preempted by migration/2, then migration/2 will
execute the first work item successfully, since migration/3
was woken up by CPU 3, but when it goes to execute the second
work item, it disables preemption, calls multi_cpu_stop(),
and thus, CPU 2 will wait forever for migration/1, which should
have been woken up by thread 2. However migration/1 cannot be
woken up by thread 2, since it is a kworker, so it is affine to
CPU 2, but CPU 2 is running migration/2 with preemption
disabled, so thread 2 will never run.

Disable preemption after queueing works for stopper threads
to ensure that the operation of queueing the works and waking
the stopper threads is atomic.

Co-Developed-by: Prasad Sodagudi <psodagud@codeaurora.org>
Co-Developed-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bigeasy@linutronix.de
Cc: gregkh@linuxfoundation.org
Cc: matt@codeblueprint.co.uk
Fixes: 9fb8d5dc4b ("stop_machine, Disable preemption when waking two stopper threads")
Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:25:08 +02:00
Yi Wang
6cd0c583b0 sched/topology: Check variable group before dereferencing it
The 'group' variable in sched_domain_debug_one() is not checked
when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
but it might be NULL (it is checked later in the following while loop)
and may cause NULL pointer dereference.

We need to check it before using to avoid NULL dereference.

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:25:07 +02:00
Peter Rosin
62cedf3e60 locking/rtmutex: Allow specifying a subclass for nested locking
Needed for annotating rt_mutex locks.

Tested-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Peter Rosin <peda@axentia.se>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Deepa Dinamani <deepadinamani@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Chang <dpf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wolfram Sang <wsa@the-dreams.de>
Link: http://lkml.kernel.org/r/20180720083914.1950-2-peda@axentia.se
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:22:19 +02:00
David S. Miller
19725496da Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-07-24 19:21:58 -07:00
Linus Torvalds
0723090656 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle stations tied to AP_VLANs properly during mac80211 hw
    reconfig. From Manikanta Pubbisetty.

 2) Fix jump stack depth validation in nf_tables, from Taehee Yoo.

 3) Fix quota handling in aRFS flow expiration of mlx5 driver, from Eran
    Ben Elisha.

 4) Exit path handling fix in powerpc64 BPF JIT, from Daniel Borkmann.

 5) Use ptr_ring_consume_bh() in page pool code, from Tariq Toukan.

 6) Fix cached netdev name leak in nf_tables, from Florian Westphal.

 7) Fix memory leaks on chain rename, also from Florian Westphal.

 8) Several fixes to DCTCP congestion control ACK handling, from Yuchunk
    Cheng.

 9) Missing rcu_read_unlock() in CAIF protocol code, from Yue Haibing.

10) Fix link local address handling with VRF, from David Ahern.

11) Don't clobber 'err' on a successful call to __skb_linearize() in
    skb_segment(). From Eric Dumazet.

12) Fix vxlan fdb notification races, from Roopa Prabhu.

13) Hash UDP fragments consistently, from Paolo Abeni.

14) If TCP receives lots of out of order tiny packets, we do really
    silly stuff. Make the out-of-order queue ending more robust to this
    kind of behavior, from Eric Dumazet.

15) Don't leak netlink dump state in nf_tables, from Florian Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
  net: axienet: Fix double deregister of mdio
  qmi_wwan: fix interface number for DW5821e production firmware
  ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull
  bnx2x: Fix invalid memory access in rss hash config path.
  net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper
  r8169: restore previous behavior to accept BIOS WoL settings
  cfg80211: never ignore user regulatory hint
  sock: fix sg page frag coalescing in sk_alloc_sg
  netfilter: nf_tables: move dumper state allocation into ->start
  tcp: add tcp_ooo_try_coalesce() helper
  tcp: call tcp_drop() from tcp_data_queue_ofo()
  tcp: detect malicious patterns in tcp_collapse_ofo_queue()
  tcp: avoid collapses in tcp_prune_queue() if possible
  tcp: free batches of packets in tcp_prune_ofo_queue()
  ip: hash fragments consistently
  ipv6: use fib6_info_hold_safe() when necessary
  can: xilinx_can: fix power management handling
  can: xilinx_can: fix incorrect clear of non-processed interrupts
  can: xilinx_can: fix RX overflow interrupt not being enabled
  can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting
  ...
2018-07-24 17:31:47 -07:00
Josh Poimboeuf
73d5e2b472 cpu/hotplug: detect SMT disabled by BIOS
If SMT is disabled in BIOS, the CPU code doesn't properly detect it.
The /sys/devices/system/cpu/smt/control file shows 'on', and the 'l1tf'
vulnerabilities file shows SMT as vulnerable.

Fix it by forcing 'cpu_smt_control' to CPU_SMT_NOT_SUPPORTED in such a
case.  Unfortunately the detection can only be done after bringing all
the CPUs online, so we have to overwrite any previous writes to the
variable.

Reported-by: Joe Mario <jmario@redhat.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Fixes: f048c399e0 ("x86/topology: Provide topology_smt_supported()")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2018-07-24 18:17:40 +02:00
Martin KaFai Lau
6283fa38dc bpf: btf: Ensure the member->offset is in the right order
This patch ensures the member->offset of a struct
is in the correct order (i.e the later member's offset cannot
go backward).

The current "pahole -J" BTF encoder does not generate something
like this.  However, checking this can ensure future encoder
will not violate this.

Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-24 01:20:44 +02:00
Kamalesh Babulal
6e9df95b76 livepatch: Validate module/old func name length
livepatch module author can pass module name/old function name with more
than the defined character limit. With obj->name length greater than
MODULE_NAME_LEN, the livepatch module gets loaded but waits forever on
the module specified by obj->name to be loaded. It also populates a /sys
directory with an untruncated object name.

In the case of funcs->old_name length greater then KSYM_NAME_LEN, it
would not match against any of the symbol table entries. Instead loop
through the symbol table comparing them against a nonexisting function,
which can be avoided.

The same issues apply, to misspelled/incorrect names. At least gatekeep
the modules with over the limit string length, by checking for their
length during livepatch module registration.

Cc: stable@vger.kernel.org
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-07-23 12:12:00 +02:00
Linus Torvalds
48b1db7c7a Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two fixes: a stop-machine preemption fix and a SCHED_DEADLINE fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix switched_from_dl() warning
  stop_machine: Disable preemption when waking two stopper threads
2018-07-21 17:21:34 -07:00
Linus Torvalds
490fc05386 mm: make vm_area_alloc() initialize core fields
Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.

The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 15:24:03 -07:00
Linus Torvalds
95faf6992d mm: make vm_area_dup() actually copy the old vma data
.. and re-initialize th eanon_vma_chain head.

This removes some boiler-plate from the users, and also makes it clear
why it didn't need use the 'zalloc()' version.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 14:48:45 -07:00
Linus Torvalds
3928d4f5ee mm: use helper functions for allocating and freeing vm_area structs
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.

We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.

Right now those functions are literally just wrappers around the
kmem_cache_*() calls.  This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 13:48:51 -07:00
David S. Miller
eae249b27f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-07-20

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add sharing of BPF objects within one ASIC: this allows for reuse of
   the same program on multiple ports of a device, and therefore gains
   better code store utilization. On top of that, this now also enables
   sharing of maps between programs attached to different ports of a
   device, from Jakub.

2) Cleanup in libbpf and bpftool's Makefile to reduce unneeded feature
   detections and unused variable exports, also from Jakub.

3) First batch of RCU annotation fixes in prog array handling, i.e.
   there are several __rcu markers which are not correct as well as
   some of the RCU handling, from Roman.

4) Two fixes in BPF sample files related to checking of the prog_cnt
   upper limit from sample loader, from Dan.

5) Minor cleanup in sockmap to remove a set but not used variable,
   from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 23:58:30 -07:00
Dmitry Torokhov
488dee96bb kernfs: allow creating kernfs objects with arbitrary uid/gid
This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.

When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.

Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 23:44:35 -07:00
Peter Zijlstra
9407f5a7ee sched/clock: Close a hole in sched_clock_init()
All data required for the 'unstable' sched_clock must be set-up _before_
enabling it -- setting sched_clock_running. This includes the
__gtod_offset but also a recent scd stamp.

Make the gtod-offset update also set the csd stamp -- it requires the
same two clock reads _anyway_. This doesn't hurt in the
sched_clock_tick_stable() case and ensures sched_clock_init() gets
everything set-up before use.

Also switch to unconditional IRQ-disable/enable because the static key
stuff already requires this is not ran with IRQs disabled.

Fixes: 857baa87b6 ("sched/clock: Enable sched clock early")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180720080911.GM2494@hirez.programming.kicks-ass.net
2018-07-20 11:58:00 +02:00
Martin KaFai Lau
36fc3c8c28 bpf: btf: Clean up BTF_INT_BITS() in uapi btf.h
This patch shrinks the BTF_INT_BITS() mask.  The current
btf_int_check_meta() ensures the nr_bits of an integer
cannot exceed 64.  Hence, it is mostly an uapi cleanup.

The actual btf usage (i.e. seq_show()) is also modified
to use u8 instead of u16.  The verification (e.g. btf_int_check_meta())
path stays as is to deal with invalid BTF situation.

Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-20 10:25:48 +02:00
Thomas Gleixner
e5af5ff34c Merge tag 'fortglx/4.19/time-part2' of https://git.linaro.org/people/john.stultz/linux into timers/core
Pull the second set of timekeeping things for 4.19 from John Stultz

  * NTP argument clenaups and constification from Ondrej Mosnacek
  * Fix to avoid RTC injecting sleeptime when suspend fails from
    Mukesh Ojha
  * Broading suspsend-timing to include non-stop clocksources that
    aren't currently used for timekeeping from Baolin Wang
2018-07-20 06:43:05 +02:00
Baolin Wang
39232ed5a1 time: Introduce one suspend clocksource to compensate the suspend time
On some hardware with multiple clocksources, we have coarse grained
clocksources that support the CLOCK_SOURCE_SUSPEND_NONSTOP flag, but
which are less than ideal for timekeeping whereas other clocksources
can be better candidates but halt on suspend.

Currently, the timekeeping core only supports timing suspend using
CLOCK_SOURCE_SUSPEND_NONSTOP clocksources if that clocksource is the
current clocksource for timekeeping.

As a result, some architectures try to implement read_persistent_clock64()
using those non-stop clocksources, but isn't really ideal, which will
introduce more duplicate code. To fix this, provide logic to allow a
registered SUSPEND_NONSTOP clocksource, which isn't the current
clocksource, to be used to calculate the suspend time.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
[jstultz: minor tweaks to merge with previous resume changes]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2018-07-19 17:08:52 -07:00
Mukesh Ojha
f473e5f467 time: Fix extra sleeptime injection when suspend fails
Currently, there exists a corner case assuming when there is
only one clocksource e.g RTC, and system failed to go to
suspend mode. While resume rtc_resume() injects the sleeptime
as timekeeping_rtc_skipresume() returned 'false' (default value
of sleeptime_injected) due to which we can see mismatch in
timestamps.

This issue can also come in a system where more than one
clocksource are present and very first suspend fails.

Success case:
------------
                                        {sleeptime_injected=false}
rtc_suspend() => timekeeping_suspend() => timekeeping_resume() =>

(sleeptime injected)
 rtc_resume()

Failure case:
------------
         {failure in sleep path} {sleeptime_injected=false}
rtc_suspend()     =>          rtc_resume()

{sleeptime injected again which was not required as the suspend failed}

Fix this by handling the boolean logic properly.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2018-07-19 17:08:51 -07:00
Ondrej Mosnacek
985e695074 timekeeping/ntp: Constify some function arguments
Add 'const' to some function arguments and variables to make it easier
to read the code.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
[jstultz: Also fixup pre-existing checkpatch warnings for
 prototype arguments with no variable name]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2018-07-19 17:08:05 -07:00
Pavel Tatashin
46457ea464 sched/clock: Use static key for sched_clock_running
sched_clock_running may be read every time sched_clock_cpu() is called.
Yet, this variable is updated only twice during boot, and never changes
again, therefore it is better to make it a static key.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-25-pasha.tatashin@oracle.com
2018-07-20 00:02:43 +02:00
Pavel Tatashin
857baa87b6 sched/clock: Enable sched clock early
Allow sched_clock() to be used before schec_clock_init() is called.  This
provides a way to get early boot timestamps on machines with unstable
clocks.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-24-pasha.tatashin@oracle.com
2018-07-20 00:02:43 +02:00
Pavel Tatashin
5d2a4e91a5 sched/clock: Move sched clock initialization and merge with generic clock
sched_clock_postinit() initializes a generic clock on systems where no
other clock is provided. This function may be called only after
timekeeping_init().

Rename sched_clock_postinit to generic_clock_inti() and call it from
sched_clock_init(). Move the call for sched_clock_init() until after
time_init().

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-23-pasha.tatashin@oracle.com
2018-07-20 00:02:43 +02:00
Pavel Tatashin
4b1b7f8054 timekeeping: Default boot time offset to local_clock()
read_persistent_wall_and_boot_offset() is called during boot to read
both the persistent clock and also return the offset between the boot time
and the value of persistent clock.

Change the default boot_offset from zero to local_clock() so architectures,
that do not have a dedicated boot_clock but have early sched_clock(), such
as SPARCv9, x86, and possibly more will benefit from this change by getting
a better and more consistent estimate of the boot time without need for an
arch specific implementation.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-17-pasha.tatashin@oracle.com
2018-07-20 00:02:41 +02:00
Pavel Tatashin
3eca993740 timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, it cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.

This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.

Replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)

Where wall_time is returned by read_persistent_clock() And boot_offset is
wall_time - boot time, which defaults to 0.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-16-pasha.tatashin@oracle.com
2018-07-20 00:02:40 +02:00
Ondrej Mosnacek
86b2dcd4f0 ntp: Use kstrtos64 for s64 variable
...instead of kstrtol with a dirty cast.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2018-07-19 14:58:37 -07:00
Ondrej Mosnacek
0f9987b63d ntp: Remove redundant arguments
The 'ts' argument of process_adj_status() and process_adjtimex_modes()
is unused and can be safely removed.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2018-07-19 14:58:29 -07:00
Yi Wang
3058758925 timer: Fix coding style
The call to wake_up_nohz_cpu() is incorrectly indented. Remove the surplus TAB.

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: zhong.weidong@zte.com.cn
CC: Anna-Maria Gleixner <anna-maria@linutronix.de>
Link: https://lkml.kernel.org/r/1531721337-30284-1-git-send-email-wang.yi59@zte.com.cn
2018-07-19 16:52:40 +02:00
Linus Torvalds
024ddc0ce1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of fixes, here goes:

   1) NULL deref in qtnfmac, from Gustavo A. R. Silva.

   2) Kernel oops when fw download fails in rtlwifi, from Ping-Ke Shih.

   3) Lost completion messages in AF_XDP, from Magnus Karlsson.

   4) Correct bogus self-assignment in rhashtable, from Rishabh
      Bhatnagar.

   5) Fix regression in ipv6 route append handling, from David Ahern.

   6) Fix masking in __set_phy_supported(), from Heiner Kallweit.

   7) Missing module owner set in x_tables icmp, from Florian Westphal.

   8) liquidio's timeouts are HZ dependent, fix from Nicholas Mc Guire.

   9) Link setting fixes for sh_eth and ravb, from Vladimir Zapolskiy.

  10) Fix NULL deref when using chains in act_csum, from Davide Caratti.

  11) XDP_REDIRECT needs to check if the interface is up and whether the
      MTU is sufficient. From Toshiaki Makita.

  12) Net diag can do a double free when killing TCP_NEW_SYN_RECV
      connections, from Lorenzo Colitti.

  13) nf_defrag in ipv6 can unnecessarily hold onto dst entries for a
      full minute, delaying device unregister. From Eric Dumazet.

  14) Update MAC entries in the correct order in ixgbe, from Alexander
      Duyck.

  15) Don't leave partial mangles bpf program in jit_subprogs, from
      Daniel Borkmann.

  16) Fix pfmemalloc SKB state propagation, from Stefano Brivio.

  17) Fix ACK handling in DCTCP congestion control, from Yuchung Cheng.

  18) Use after free in tun XDP_TX, from Toshiaki Makita.

  19) Stale ipv6 header pointer in ipv6 gre code, from Prashant Bhole.

  20) Don't reuse remainder of RX page when XDP is set in mlx4, from
      Saeed Mahameed.

  21) Fix window probe handling of TCP rapair sockets, from Stefan
      Baranoff.

  22) Missing socket locking in smc_ioctl(), from Ursula Braun.

  23) IPV6_ILA needs DST_CACHE, from Arnd Bergmann.

  24) Spectre v1 fix in cxgb3, from Gustavo A. R. Silva.

  25) Two spots in ipv6 do a rol32() on a hash value but ignore the
      result. Fixes from Colin Ian King"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (176 commits)
  tcp: identify cryptic messages as TCP seq # bugs
  ptp: fix missing break in switch
  hv_netvsc: Fix napi reschedule while receive completion is busy
  MAINTAINERS: Drop inactive Vitaly Bordug's email
  net: cavium: Add fine-granular dependencies on PCI
  net: qca_spi: Fix log level if probe fails
  net: qca_spi: Make sure the QCA7000 reset is triggered
  net: qca_spi: Avoid packet drop during initial sync
  ipv6: fix useless rol32 call on hash
  ipv6: sr: fix useless rol32 call on hash
  net: sched: Using NULL instead of plain integer
  net: usb: asix: replace mii_nway_restart in resume path
  net: cxgb3_main: fix potential Spectre v1
  lib/rhashtable: consider param->min_size when setting initial table size
  net/smc: reset recv timeout after clc handshake
  net/smc: add error handling for get_user()
  net/smc: optimize consumer cursor updates
  net/nfc: Avoid stalls when nfc_alloc_send_skb() returned NULL.
  ipv6: ila: select CONFIG_DST_CACHE
  net: usb: rtl8150: demote allmulti message to dev_dbg()
  ...
2018-07-18 19:32:54 -07:00
Ronny Chevalier
baa2a4fdd5 audit: fix use-after-free in audit_add_watch
audit_add_watch stores locally krule->watch without taking a reference
on watch. Then, it calls audit_add_to_parent, and uses the watch stored
locally.

Unfortunately, it is possible that audit_add_to_parent updates
krule->watch.
When it happens, it also drops a reference of watch which
could free the watch.

How to reproduce (with KASAN enabled):

    auditctl -w /etc/passwd -F success=0 -k test_passwd
    auditctl -w /etc/passwd -F success=1 -k test_passwd2

The second call to auditctl triggers the use-after-free, because
audit_to_parent updates krule->watch to use a previous existing watch
and drops the reference to the newly created watch.

To fix the issue, we grab a reference of watch and we release it at the
end of the function.

Signed-off-by: Ronny Chevalier <ronny.chevalier@hp.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-07-18 11:43:36 -04:00
Jakub Kicinski
fd4f227dea bpf: offload: allow program and map sharing per-ASIC
Allow programs and maps to be re-used across different netdevs,
as long as they belong to the same struct bpf_offload_dev.
Update the bpf_offload_prog_map_match() helper for the verifier
and export a new helper for the drivers to use when checking
programs at attachment time.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18 15:10:34 +02:00
Jakub Kicinski
602144c224 bpf: offload: keep the offload state per-ASIC
Create a higher-level entity to represent a device/ASIC to allow
programs and maps to be shared between device ports.  The extra
work is required to make sure we don't destroy BPF objects as
soon as the netdev for which they were loaded gets destroyed,
as other ports may still be using them.  When netdev goes away
all of its BPF objects will be moved to other netdevs of the
device, and only destroyed when last netdev is unregistered.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18 15:10:34 +02:00
Jakub Kicinski
9fd7c55591 bpf: offload: aggregate offloads per-device
Currently we have two lists of offloaded objects - programs and maps.
Netdevice unregister notifier scans those lists to orphan objects
associated with device being unregistered.  This puts unnecessary
(even if negligible) burden on all netdev unregister calls in BPF-
-enabled kernel.  The lists of objects may potentially get long
making the linear scan even more problematic.  There haven't been
complaints about this mechanisms so far, but it is suboptimal.

Instead of relying on notifiers, make the few BPF-capable drivers
register explicitly for BPF offloads.  The programs and maps will
now be collected per-device not on a global list, and only scanned
for removal when driver unregisters from BPF offloads.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18 15:10:34 +02:00
Jakub Kicinski
09728266b6 bpf: offload: rename bpf_offload_dev_match() to bpf_offload_prog_map_match()
A set of new API functions exported for the drivers will soon use
'bpf_offload_dev_' as a prefix.  Rename the bpf_offload_dev_match()
which is internal to the core (used by the verifier) to avoid any
confusion.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18 15:10:34 +02:00
Colin Ian King
c23e014a4b bpf: sockmap: remove redundant pointer sg
Pointer sg is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'sg' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18 15:06:24 +02:00
Roman Gushchin
3960f4fd65 bpf: fix rcu annotations in compute_effective_progs()
The progs local variable in compute_effective_progs() is marked
as __rcu, which is not correct. This is a local pointer, which
is initialized by bpf_prog_array_alloc(), which also now
returns a generic non-rcu pointer.

The real rcu-protected pointer is *array (array is a pointer
to an RCU-protected pointer), so the assignment should be performed
using rcu_assign_pointer().

Fixes: 324bda9e6c ("bpf: multi program support for cgroup+bpf")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18 15:01:54 +02:00
Roman Gushchin
d29ab6e1fa bpf: bpf_prog_array_alloc() should return a generic non-rcu pointer
Currently the return type of the bpf_prog_array_alloc() is
struct bpf_prog_array __rcu *, which is not quite correct.
Obviously, the returned pointer is a generic pointer, which
is valid for an indefinite amount of time and it's not shared
with anyone else, so there is no sense in marking it as __rcu.

This change eliminate the following sparse warnings:
kernel/bpf/core.c:1544:31: warning: incorrect type in return expression (different address spaces)
kernel/bpf/core.c:1544:31:    expected struct bpf_prog_array [noderef] <asn:4>*
kernel/bpf/core.c:1544:31:    got void *
kernel/bpf/core.c:1548:17: warning: incorrect type in return expression (different address spaces)
kernel/bpf/core.c:1548:17:    expected struct bpf_prog_array [noderef] <asn:4>*
kernel/bpf/core.c:1548:17:    got struct bpf_prog_array *<noident>
kernel/bpf/core.c:1681:15: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1681:15:    expected struct bpf_prog_array *array
kernel/bpf/core.c:1681:15:    got struct bpf_prog_array [noderef] <asn:4>*

Fixes: 324bda9e6c ("bpf: multi program support for cgroup+bpf")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18 15:01:20 +02:00
Paul Moore
290e44b7dd audit: use ktime_get_coarse_real_ts64() for timestamps
Commit c72051d577 ("audit: use ktime_get_coarse_ts64() for time
access") converted audit's use of current_kernel_time64() to the
new ktime_get_coarse_ts64() function.  Unfortunately this resulted
in incorrect timestamps, e.g. events stamped with the year 1969
despite it being 2018.  This patch corrects this by using
ktime_get_coarse_real_ts64() just like the current_kernel_time64()
wrapper.

Fixes: c72051d577 ("audit: use ktime_get_coarse_ts64() for time access")
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-07-17 14:45:08 -04:00
Linus Torvalds
3c53776e29 Mark HI and TASKLET softirq synchronous
Way back in 4.9, we committed 4cd13c21b2 ("softirq: Let ksoftirqd do
its job"), and ever since we've had small nagging issues with it.  For
example, we've had:

  1ff688209e ("watchdog: core: make sure the watchdog_worker is not deferred")
  8d5755b3f7 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
  217f697436 ("net: busy-poll: allow preemption in sk_busy_loop()")

all of which worked around some of the effects of that commit.

The DVB people have also complained that the commit causes excessive USB
URB latencies, which seems to be due to the USB code using tasklets to
schedule USB traffic.  This seems to be an issue mainly when already
living on the edge, but waiting for ksoftirqd to handle it really does
seem to cause excessive latencies.

Now Hanna Hawa reports that this issue isn't just limited to USB URB and
DVB, but also causes timeout problems for the Marvell SoC team:

 "I'm facing kernel panic issue while running raid 5 on sata disks
  connected to Macchiatobin (Marvell community board with Armada-8040
  SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
  and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
  uses a tasklet to clean the done descriptors from the queue"

The latency problem causes a panic:

  mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
  Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction

We've discussed simply just reverting the original commit entirely, and
also much more involved solutions (with per-softirq threads etc).  This
patch is intentionally stupid and fairly limited, because the issue
still remains, and the other solutions either got sidetracked or had
other issues.

We should probably also consider the timer softirqs to be synchronous
and not be delayed to ksoftirqd (since they were the issue with the
earlier watchdog problems), but that should be done as a separate patch.
This does only the tasklet cases.

Reported-and-tested-by: Hanna Hawa <hannah@marvell.com>
Reported-and-tested-by: Josef Griebichler <griebichler.josef@gmx.at>
Reported-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-17 11:12:43 -07:00
Masahiro Yamada
c417fbce98 kbuild: move bin2c back to scripts/ from scripts/basic/
Commit 8370edea81 ("bin2c: move bin2c in scripts/basic") moved bin2c
to the scripts/basic/ directory, incorrectly stating "Kexec wants to
use bin2c and it wants to use it really early in the build process.
See arch/x86/purgatory/ code in later patches."

Commit bdab125c93 ("Revert "kexec/purgatory: Add clean-up for
purgatory directory"") and commit d6605b6bbe ("x86/build: Remove
unnecessary preparation for purgatory") removed the redundant
purgatory build magic entirely.

That means that the move of bin2c was unnecessary in the first place.

fixdep is the only host program that deserves to sit in the
scripts/basic/ directory.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-07-18 01:18:05 +09:00
RAGHU Halharvi
d91cfeb0aa genirq: Remove redundant NULL pointer check in __free_irq()
The NULL pointer check in __free_irq() triggers a 'dereference before NULL
pointer check' warning in static code analysis. It turns out that the check
is redundant because all callers have a NULL pointer check already.

Remove it.

Signed-off-by: RAGHU Halharvi <raghuhack78@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20180717102009.7708-1-raghuhack78@gmail.com
2018-07-17 13:35:44 +02:00
Rik van Riel
c1a2f7f0c0 mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the CPU bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
getting confused by the dynamically sized array.

Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:35:30 +02:00
Andrea Parri
7696f9910a sched/Documentation: Update wake_up() & co. memory-barrier guarantees
Both the implementation and the users' expectation [1] for the various
wakeup primitives have evolved over time, but the documentation has not
kept up with these changes: brings it into 2018.

[1] http://lkml.kernel.org/r/20180424091510.GB4064@hirez.programming.kicks-ass.net

Also applied feedback from Alan Stern.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Akira Yokosawa <akiyks@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jade Alglave <j.alglave@ucl.ac.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luc Maranget <luc.maranget@inria.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: parri.andrea@gmail.com
Link: http://lkml.kernel.org/r/20180716180605.16115-12-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:30:34 +02:00
Andrea Parri
3d85b27037 locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
There are 11 interpretations of the requirements described in the header
comment for smp_mb__after_spinlock(): one for each LKMM maintainer, and
one currently encoded in the Cat file. Stick to the latter (until a more
satisfactory solution is available).

This also reworks some snippets related to the barrier to illustrate the
requirements and to link them to the idioms which are relied upon at its
call sites.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: akiyks@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Link: http://lkml.kernel.org/r/20180716180605.16115-11-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:30:33 +02:00
Andrea Parri
76e079fefc sched/core: Use smp_mb() in wake_woken_function()
wake_woken_function() synchronizes with wait_woken() as follows:

  [wait_woken]                       [wake_woken_function]

  entry->flags &= ~wq_flag_woken;    condition = true;
  smp_mb();                          smp_wmb();
  if (condition)                     wq_entry->flags |= wq_flag_woken;
     break;

This commit replaces the above smp_wmb() with an smp_mb() in order to
guarantee that either wait_woken() sees the wait condition being true
or the store to wq_entry->flags in woken_wake_function() follows the
store in wait_woken() in the coherence order (so that the former can
eventually be observed by wait_woken()).

The commit also fixes a comment associated to set_current_state() in
wait_woken(): the comment pairs the barrier in set_current_state() to
the above smp_wmb(), while the actual pairing involves the barrier in
set_current_state() and the barrier executed by the try_to_wake_up()
in wake_woken_function().

Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akiyks@gmail.com
Cc: boqun.feng@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/20180716180605.16115-10-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:30:33 +02:00
Ingo Molnar
52b544bd38 Linux 4.18-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltLpVUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWisH/ikONMwV7OrSk36Y
 5rxzTFUoBk0Qffct88gtSNuRVCxaVb1ofCndvFJE6A6HfJkWpbBzH6eq90aakmJi
 f7uFcu4YmsQpeQaf9lpftWmY2vDf2fIadVTV0RnSMXks57wMax1cpBe7LJGpz13e
 f+g5XRVs1MdlZVtr6tG2SU3Y5AqVVVsYe/0DBPonEqeh9/JJbPFCuNkFOxxzAqPu
 VTnjyoOqG8qtZzjklNtR5rZn0Gv592tWX36eiWTQdThNmVFkGEAJwsHCQlY4OQYK
 61QN4UhOHiu8e1ZuGDNEDhNVRnKtaaYUPFeWL1wLRW73ul4P3ZkpvpS8QTMwcFJI
 JjzNOkI=
 =ckcO
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:27:43 +02:00
Ingo Molnar
ea73a5c692 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

- An optimization and a fix for RCU expedited grace periods, with
  the fix being from Boqun Feng.

- Miscellaneous fixes, including a lockdep-annotation fix from
  Boqun Feng.

- SRCU updates.

- Updates to rcutorture and associated scripting.

- Introduce grace-period sequence numbers to the RCU-bh, RCU-preempt,
  and RCU-sched flavors, replacing the old ->gpnum and ->completed
  pair of fields.  This change allows lockless code to obtain the
  complete grace-period state with a single READ_ONCE(), which is
  needed to maintain tolerable lock contention during the upcoming
  consolidation of the three RCU flavors.  Note that grace-period
  sequence numbers are already used by rcu_barrier(), expedited
  RCU grace periods, and SRCU, and are thus already heavily used
  and well-tested.  Joel Fernandes contributed a number of excellent
  fixes and improvements.

- Clean up some grace-period-reporting loose ends, including
  improving the handling of quiescent states from offline CPUs
  and fixing some false-positive WARN_ON_ONCE() invocations.
  (Strictly speaking, the WARN_ON_ONCE() invocations were quite
  correct, but their invariants were (harmlessly) violated by the
  earlier sloppy handling of quiescent states from offline CPUs.)
  In addition, improve grace-period forward-progress guarantees so
  as to allow removal of fail-safe checks that required otherwise
  needless lock acquisitions.  Finally, add more diagnostics to
  help debug the upcoming consolidation of the RCU-bh, RCU-preempt,
  and RCU-sched flavors.

- Additional miscellaneous fixes, including those contributed by
  Byungchul Park, Mauro Carvalho Chehab, Joe Perches, Joel Fernandes,
  Steven Rostedt, Andrea Parri, and Neil Brown.

- Additional torture-test changes, including several contributed by
  Arnd Bergmann and Joel Fernandes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:16:02 +02:00
Mimi Zohar
c77b8cdf74 module: replace the existing LSM hook in init_module
Both the init_module and finit_module syscalls call either directly
or indirectly the security_kernel_read_file LSM hook.  This patch
replaces the direct call in init_module with a call to the new
security_kernel_load_data hook and makes the corresponding changes
in SELinux, LoadPin, and IMA.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
2018-07-16 12:31:57 -07:00
Mimi Zohar
a210fd32a4 kexec: add call to LSM hook in original kexec_load syscall
In order for LSMs and IMA-appraisal to differentiate between kexec_load
and kexec_file_load syscalls, both the original and new syscalls must
call an LSM hook.  This patch adds a call to security_kernel_load_data()
in the original kexec_load syscall.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
2018-07-16 12:31:57 -07:00
Kamalesh Babulal
1d98a69e5c livepatch: Remove reliable stacktrace check in klp_try_switch_task()
Support for immediate flag was removed by commit d0807da78e
("livepatch: Remove immediate feature").  We bail out during
patch registration for architectures, those don't support
reliable stack trace. Remove the check in klp_try_switch_task(),
as its not required.

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-07-16 17:50:33 +02:00
Tobias Tefke
788faab70d perf, tools: Use correct articles in comments
Some of the comments in the perf events code use articles incorrectly,
using 'a' for words beginning with a vowel sound, where 'an' should be
used.

Signed-off-by: Tobias Tefke <tobias.tefke@tutanota.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/20180709105715.22938-1-tobias.tefke@tutanota.com
[ Fix a few more perf related 'a event' typo fixes from all around the kernel and tooling tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:21:03 +02:00
Sebastian Andrzej Siewior
af0fffd930 sched/core: Remove get_cpu() from sched_fork()
get_cpu() disables preemption for the entire sched_fork() function.
This get_cpu() was introduced in commit:

  dd41f596cd ("sched: cfs core code")

... which also invoked sched_balance_self() and this function
required preemption do be off.

Today, sched_balance_self() seems to be moved to ->task_fork callback
which is invoked while the ->pi_lock is held.

set_load_weight() could invoke reweight_task() which then via $callchain
might end up in smp_processor_id() but since `update_load' is false
this won't happen.

I didn't find any this_cpu*() or similar usage during the initialisation
of the task_struct.

The `cpu' value (from get_cpu()) is only used later in __set_task_cpu()
while the ->pi_lock lock is held.

Based on this it is possible to remove get_cpu() and use
smp_processor_id() for the `cpu' variable without breaking anything.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180706130615.g2ex2kmfu5kcvlq6@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:29 +02:00
Peter Zijlstra
45f5519ec5 sched/cpufreq: Clarify sugov_get_util()
Add a few comments to (hopefully) clarifying some of the magic in
sugov_get_util().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/20180705123617.GM2458@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:29 +02:00
Vincent Guittot
5fd778915a sched/sysctl: Remove unused sched_time_avg_ms sysctl
/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere,
remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:29 +02:00
Vincent Guittot
bbb62c0b02 sched/core: Remove the rt_avg code
rt_avg is not used anywhere anymore, so we can remove all related code.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:29 +02:00
Vincent Guittot
523e979d31 sched/core: Use PELT for scale_rt_capacity()
The utilization of the CPU by RT, DL and IRQs are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for CFS class.

scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for CFS instead of a scaling factor because RT, DL and
IRQ provide now absolute utilization value.

The same formula as schedutil is used:

  IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:25 +02:00
Vincent Guittot
dfa444dc2f sched/cpufreq: Remove sugov_aggregate_util()
There is no reason why sugov_get_util() and sugov_aggregate_util()
were in fact separate functions.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Rebased after adding irq tracking and fixed some compilation errors. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-9-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:21 +02:00
Vincent Guittot
9033ea1188 cpufreq/schedutil: Take time spent in interrupts into account
The time spent executing IRQ handlers can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.

RQS utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization

The CPU utilization is:

  IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

A test with iperf on hikey (octo arm64) gives the following speedup:

 iperf -c server_address -r -t 5

 w/o patch		w/ patch
 Tx 276 Mbits/sec	304 Mbits/sec +10%
 Rx 299 Mbits/sec	328 Mbits/sec  +9%

 8 iterations
 stdev is lower than 1%

Only WFI idle state is enabled (shallowest idle state).

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:21 +02:00
Vincent Guittot
91c27493e7 sched/irq: Add IRQ utilization tracking
interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because:

  rq_clock == rq_clock_task + interrupt time

and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.

The CPU utilization is:

  avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:21 +02:00
Vincent Guittot
8cc90515a4 cpufreq/schedutil: Use DL utilization tracking
Now that we have both the DL class bandwidth requirement and the DL class
utilization, we can detect when CPU is fully used so we should run at max.
Otherwise, we keep using the DL bandwidth requirement to define the
utilization of the CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:21 +02:00