The tool perf is useful for the performance analysis on the Hygon Dhyana
platform. But right now there is no Hygon support for it to analyze the
KVM guest os data. So add Hygon Dhyana support to it by checking vendor
string to share the code path of AMD.
Signed-off-by: Pu Wen <puwen@hygon.cn>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1542008451-31735-1-git-send-email-puwen@hygon.cn
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This program benchmarks concurrent epoll_wait(2) for file descriptors
that are monitored with with EPOLLIN along various semantics, by a
single epoll instance. Such conditions can be found when using
single/combined or multiple queuing when load balancing.
Each thread has a number of private, nonblocking file descriptors,
referred to as fdmap. A writer thread will constantly be writing to the
fdmaps of all threads, minimizing each threads's chances of epoll_wait
not finding any ready read events and blocking as this is not what we
want to stress. Full details in the start of the C file.
Committer testing:
# perf bench
Usage:
perf bench [<common options>] <collection> <benchmark> [<options>]
# List of all available benchmark collections:
sched: Scheduler and IPC benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
all: All benchmarks
# perf bench epoll
# List of available benchmarks for collection 'epoll':
wait: Benchmark epoll concurrent epoll_waits
all: Run all futex benchmarks
# perf bench epoll wait
# Running 'epoll/wait' benchmark:
Run summary [PID 19295]: 3 threads monitoring on 64 file-descriptors for 8 secs.
[thread 0] fdmap: 0xdaa650 ... 0xdaa74c [ 328241 ops/sec ]
[thread 1] fdmap: 0xdaa900 ... 0xdaa9fc [ 351695 ops/sec ]
[thread 2] fdmap: 0xdaabb0 ... 0xdaacac [ 381423 ops/sec ]
Averaged 353786 operations/sec (+- 4.35%), total secs = 8
#
Committer notes:
Fix the build on debian:experimental-x-mips, debian:experimental-x-mipsel
and others:
CC /tmp/build/perf/bench/epoll-wait.o
bench/epoll-wait.c: In function 'writerfn':
bench/epoll-wait.c:399:12: error: format '%ld' expects argument of type 'long int', but argument 2 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
printinfo("exiting writer-thread (total full-loops: %ld)\n", iter);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~
bench/epoll-wait.c:86:31: note: in definition of macro 'printinfo'
do { if (__verbose) { printf(fmt, ## arg); fflush(stdout); } } while (0)
^~~
cc1: all warnings being treated as errors
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com> <jbaron@akamai.com>
Link: http://lkml.kernel.org/r/20181106152226.20883-2-dave@stgolabs.net
Link: http://lkml.kernel.org/r/20181106182349.thdkpvshkna5vd7o@linux-r8p5>
[ Applied above fixup as per Davidlohr's request ]
[ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ]
[ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
A new 'perf bench epoll' will use this, and to disable it for older
systems, add a feature test for this API.
This is just a simple program that if successfully compiled, means that
the feature is present, at least at the library level, in a build that
sets the output directory to /tmp/build/perf (using O=/tmp/build/perf),
we end up with:
$ ls -la /tmp/build/perf/feature/test-eventfd*
-rwxrwxr-x. 1 acme acme 8176 Nov 21 15:58 /tmp/build/perf/feature/test-eventfd.bin
-rw-rw-r--. 1 acme acme 588 Nov 21 15:58 /tmp/build/perf/feature/test-eventfd.d
-rw-rw-r--. 1 acme acme 0 Nov 21 15:58 /tmp/build/perf/feature/test-eventfd.make.output
$ ldd /tmp/build/perf/feature/test-eventfd.bin
linux-vdso.so.1 (0x00007fff3bf3f000)
libc.so.6 => /lib64/libc.so.6 (0x00007fa984061000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa984417000)
$ grep eventfd -A 2 -B 2 /tmp/build/perf/FEATURE-DUMP
feature-dwarf=1
feature-dwarf_getlocations=1
feature-eventfd=1
feature-fortify-source=1
feature-sync-compare-and-swap=1
$
The main thing here is that in the end we'll have -DHAVE_EVENTFD in
CFLAGS, and then the 'perf bench' entry needing that API can be
selectively pruned.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-wkeldwob7dpx6jvtuzl8164k@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Both futex and epoll need this call, and can cause build failure on
systems that don't have it pthread_attr_setaffinity_np().
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Link: http://lkml.kernel.org/r/20181109210719.pr7ohayuwqmfp2wl@linux-r8p5
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
While working on augmented syscalls I got into this error:
# trace -vv --filter-pids 2469,1663 -e tools/perf/examples/bpf/augmented_raw_syscalls.c sleep 1
<SNIP>
libbpf: map 0 is "__augmented_syscalls__"
libbpf: map 1 is "__bpf_stdout__"
libbpf: map 2 is "pids_filtered"
libbpf: map 3 is "syscalls"
libbpf: collecting relocating info for: '.text'
libbpf: relo for 13 value 84 name 133
libbpf: relocation: insn_idx=3
libbpf: relocation: find map 3 (pids_filtered) for insn 3
libbpf: collecting relocating info for: 'raw_syscalls:sys_enter'
libbpf: relo for 8 value 0 name 0
libbpf: relocation: insn_idx=1
libbpf: relo for 8 value 0 name 0
libbpf: relocation: insn_idx=3
libbpf: relo for 9 value 28 name 178
libbpf: relocation: insn_idx=36
libbpf: relocation: find map 1 (__augmented_syscalls__) for insn 36
libbpf: collecting relocating info for: 'raw_syscalls:sys_exit'
libbpf: relo for 8 value 0 name 0
libbpf: relocation: insn_idx=0
libbpf: relo for 8 value 0 name 0
libbpf: relocation: insn_idx=2
bpf: config program 'raw_syscalls:sys_enter'
bpf: config program 'raw_syscalls:sys_exit'
libbpf: create map __bpf_stdout__: fd=3
libbpf: create map __augmented_syscalls__: fd=4
libbpf: create map syscalls: fd=5
libbpf: create map pids_filtered: fd=6
libbpf: added 13 insn from .text to prog raw_syscalls:sys_enter
libbpf: added 13 insn from .text to prog raw_syscalls:sys_exit
libbpf: load bpf program failed: Operation not permitted
libbpf: failed to load program 'raw_syscalls:sys_exit'
libbpf: failed to load object 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
bpf: load objects failed: err=-4009: (Incorrect kernel version)
event syntax error: 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
\___ Failed to load program for unknown reason
(add -v to see detail)
Run 'perf list' for a list of valid events
Usage: perf trace [<options>] [<command>]
or: perf trace [<options>] -- <command> [<options>]
or: perf trace record [<options>] [<command>]
or: perf trace record [<options>] -- <command> [<options>]
-e, --event <event> event/syscall selector. use 'perf list' to list available events
If I then try to use strace (perf trace'ing 'perf trace' needs some more work
before its possible) to get a bit more info I get:
# strace -e bpf trace --filter-pids 2469,1663 -e tools/perf/examples/bpf/augmented_raw_syscalls.c sleep 1
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4, max_entries=4, map_flags=0, inner_map_fd=0, map_name="__bpf_stdout__", map_ifindex=0}, 72) = 3
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4, max_entries=4, map_flags=0, inner_map_fd=0, map_name="__augmented_sys", map_ifindex=0}, 72) = 4
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=1, max_entries=500, map_flags=0, inner_map_fd=0, map_name="syscalls", map_ifindex=0}, 72) = 5
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH, key_size=4, value_size=1, max_entries=512, map_flags=0, inner_map_fd=0, map_name="pids_filtered", map_ifindex=0}, 72) = 6
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_TRACEPOINT, insn_cnt=57, insns=0x1223f50, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(4, 18, 10), prog_flags=0, prog_name="sys_enter", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS}, 72) = 7
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_TRACEPOINT, insn_cnt=18, insns=0x1224120, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(4, 18, 10), prog_flags=0, prog_name="sys_exit", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS}, 72) = -1 EPERM (Operation not permitted)
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_TRACEPOINT, insn_cnt=18, insns=0x1224120, license="GPL", log_level=1, log_size=262144, log_buf="", kern_version=KERNEL_VERSION(4, 18, 10), prog_flags=0, prog_name="sys_exit", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS}, 72) = -1 EPERM (Operation not permitted)
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x1224120, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(4, 18, 10), prog_flags=0, prog_name="sys_exit", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS}, 72) = -1 EPERM (Operation not permitted)
event syntax error: 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
\___ Failed to load program for unknown reason
<SNIP similar output as without 'strace'>
#
I managed to create the maps, etc, but then installing the "sys_exit" hook into
the "raw_syscalls:sys_exit" tracepoint somehow gets -EPERMed...
I then go and try reducing the size of this new table:
+++ b/tools/perf/examples/bpf/augmented_raw_syscalls.c
@@ -47,6 +47,17 @@ struct augmented_filename {
#define SYS_OPEN 2
#define SYS_OPENAT 257
+struct syscall {
+ bool filtered;
+};
+
+struct bpf_map SEC("maps") syscalls = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(struct syscall),
+ .max_entries = 500,
+};
And after reducing that .max_entries a tad, it works. So yeah, the "unknown
reason" should be related to the number of bytes all this is taking, reduce the
default for pid_map()s so that we can have a "syscalls" map with enough slots
for all syscalls in most arches. And take notes about this error message,
improve it :-)
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: Edward Cree <ecree@solarflare.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Yonghong Song <yhs@fb.com>
Link: https://lkml.kernel.org/n/tip-yjzhak8asumz9e9hts2dgplp@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Now that we have the "filtered_pids" logic in place, no need to do this
rough filter to avoid the feedback loop from 'perf trace's own syscalls,
revert it.
This reverts commit 7ed71f124284359676b6496ae7db724fee9da753.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-88vh02cnkam0vv5f9vp02o3h@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This makes the augmented_syscalls support the --filter-pids and
auto-filtered feedback loop pids just like when working without BPF,
i.e. with just raw_syscalls:sys_{enter,exit} and tracepoint filters.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-zc5n453sxxm0tz1zfwwelyti@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Lookup for the first map named "filtered_pids" and, if augmenting
syscalls, i.e. if a BPF event is present and the
"__augmented_syscalls__" is present, then fill in that map with the pids
to filter, be it feedback loop ones (perf trace's pid, its father if it
is "sshd", more auto-filtered in the future) or the ones explicitely
stated in the tool command line via --filter-pids.
The code to actually fill in the map comes next.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-rhzytmw7qpe6lqyjxi1ded9t@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
As we'll need that name for a new function to set filters for both
tracepoints and BPF maps for filtering pids.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-mdkck6hf3fnd21rz2766280q@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To better reflect that this is a tracepoint filter, as opposed, for
instance to map based BPF filters.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-9138svli6ddcphrr3ymy9oy3@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Just to test filtering a bunch of pids, now its time to go and get that
hooked up in 'perf trace', right after we load the bpf program, if we
find a "pids_filtered" map defined, we'll populate it with the filtered
pids.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-1i9s27wqqdhafk3fappow84x@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When testing system wide tracing without filtering the syscalls called
by 'perf trace' itself we get into a feedback loop, drop for now those
two syscalls, that are the ones that 'perf trace' does in its loop for
writing the syscalls it intercepts, to help with testing till we get
that filtering in place.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-rkbu536af66dbsfx51sr8yof@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Will be used in the augmented_raw_syscalls.c to implement 'perf trace
--filter-pids'.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-9sybmz4vchlbpqwx2am13h9e@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Starting with a helper for a basic pid_map(), a hash using a pid as a
key.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-gdwvq53wltvq6b3g5tdmh0cw@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Leftover from when we started augmented_raw_syscalls.c from
tools/perf/examples/bpf/augmented_syscalls.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: e58a0322dbac ("perf examples bpf: Start augmenting raw_syscalls:sys_{start,exit}")
Link: https://lkml.kernel.org/n/tip-pmts9ls2skh8n3zisb4txudd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Just to show where we'll hook pid based filters, and what we use to
obtain the current pid, using a BPF getpid() equivalent.
Now we need to remove that hardcoded PID with a BPF hash map, so that we
start by filtering 'perf trace's own PID, implement the --filter-pid
functionality, etc.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-oshrcgcekiyhd0whwisxfvtv@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Start with a getpid() function wrapping BPF_FUNC_get_current_pid_tgid,
idea is to mimic the system headers.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-zo8hv22onidep7tm785dzxfk@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
- Update kernel ABI headers, one of them lead to a small change in
the ioctl 'cmd' beautifier in 'perf trace' to support the new ISO7816
commands. (Arnaldo Carvalho de Melo)
- Restore proper cwd on return from mnt namespace (Jiri Olsa)
- Add feature check for the get_current_dir_name() function used in the
namespace fix from Jiri, that is not available in systems such as
Alpine Linux, which uses the musl libc (Arnaldo Carvalho de Melo)
- Fix crash in 'perf record' when synthesizing the unit for events such
as 'cpu-clock' (Jiri Olsa)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCW/Va0AAKCRCyPKLppCJ+
J0g4AP9dJ2c0YLzJjSzq7nr1L69sYITdYrACF39h/A9yiD2OTgEA5ISY12u1nkfL
uLNbCLwdbEQ+6oUdUlID0MyFjynjywc=
=eLir
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-for-mingo-4.20-20181121' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes:
- Update kernel ABI headers, one of them lead to a small change in
the ioctl 'cmd' beautifier in 'perf trace' to support the new ISO7816
commands. (Arnaldo Carvalho de Melo)
- Restore proper cwd on return from mnt namespace (Jiri Olsa)
- Add feature check for the get_current_dir_name() function used in the
namespace fix from Jiri, that is not available in systems such as
Alpine Linux, which uses the musl libc (Arnaldo Carvalho de Melo)
- Fix crash in 'perf record' when synthesizing the unit for events such
as 'cpu-clock' (Jiri Olsa)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kyle Huey reported that 'rr', a replay debugger, broke due to the following commit:
af3bdb991a ("perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler")
Rework the 'disable_counter_freezing' __setup() parameter such that we
can explicitly enable/disable it and switch to default disabled.
To this purpose, rename the parameter to "perf_v4_pmi=" which is a much
better description and allows requiring a bool argument.
[ mingo: Improved the changelog some more. ]
Reported-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert O'Callahan <robert@ocallahan.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/20181120170842.GZ2131@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduced in:
ad8c0eaa0a ("tty/serial_core: add ISO7816 infrastructure")
Now 'perf trace' will be able to pretty-print the 'cmd' ioctl arg when
used in capable systems with software emitting those commands.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-7bds48dhckfnleie08mit314@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up the changes in:
ad8c0eaa0a ("tty/serial_core: add ISO7816 infrastructure")
That is a change that imply a change to be made in tools/perf/trace/beauty/ioctl.c to
make 'perf trace' ioctl syscall argument beautifier to support these new
commands: TIOCGISO7816 and TIOCSISO7816.
This is not yet done automatically by a script like is done for some
other headers, for instance:
$ tools/perf/trace/beauty/drm_ioctl.sh | head
#ifndef DRM_COMMAND_BASE
#define DRM_COMMAND_BASE 0x40
#endif
static const char *drm_ioctl_cmds[] = {
[0x00] = "VERSION",
[0x01] = "GET_UNIQUE",
[0x02] = "GET_MAGIC",
[0x03] = "IRQ_BUSID",
[0x04] = "GET_MAP",
[0x05] = "GET_CLIENT",
$
So we will need to change tools/perf/trace/beauty/ioctl.c in a follow up
patch until we switch to a generator script.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-zin76fe6iykqsilvo6u47f9o@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To get the changes in the following csets:
ace6485a03 ("x86/cpufeatures: Enumerate MOVDIR64B instruction")
33823f4d63 ("x86/cpufeatures: Enumerate MOVDIRI instruction")
No tools were affected, copy it to silence this perf tool build warning:
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs from latest version at 'arch/x86/include/asm/cpufeatures.h'
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lkml.kernel.org/n/tip-83kcyqa1qkxkhm1s7q3hbpel@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up the changes in:
900ccf30f9 ("drm/i915: Only force GGTT coherency w/a on required chipsets")
No changes are required in tools/ nor does anything gets automatically
generated to be used in the 'perf trace' syscall arg beautifiers.
This silences this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h'
diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-t2vor2wegv41gt5n49095kly@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When reporting on 'record' server we try to retrieve/use the mnt
namespace of the profiled tasks. We use following API with cookie to
hold the return namespace, roughly:
nsinfo__mountns_enter(struct nsinfo *nsi, struct nscookie *nc)
setns(newns, 0);
...
new ns related open..
...
nsinfo__mountns_exit(struct nscookie *nc)
setns(nc->oldns)
Once finished we setns to old namespace, which also sets the current
working directory (cwd) to "/", trashing the cwd we had.
This is mostly fine, because we use absolute paths almost everywhere,
but it screws up 'perf diff':
# perf diff
failed to open perf.data: No such file or directory (try 'perf record' first)
...
Adding the current working directory to be part of the cookie and
restoring it in the nsinfo__mountns_exit call.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Krister Johansen <kjlx@templeofstupid.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 843ff37bb5 ("perf symbols: Find symbols in different mount namespace")
Link: http://lkml.kernel.org/r/20181101170001.30019-1-jolsa@kernel.org
[ No need to check for NULL args for free(), use zfree() for struct members ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
As the namespace support code will use this, which is not available in
some non _GNU_SOURCE libraries such as Android's bionic used in my
container build tests (r12b and r15c at the moment).
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-x56ypm940pwclwu45d7jfj47@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
- Address Range Scrub overflow continuation handling has been broken
since it was initially merged. It was only recently that error injection
and platform-BIOS support enabled this corner case to be exercised.
- The recent attempt to provide more isolation for the kernel Address
Range Scrub state machine from userapace initiated sessions triggers a
lockdep report. Revert and try again at the next merge window.
- Fix a kasan reported buffer overflow in libnvdimm unit test
infrastrucutre (nfit_test)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb8MdUAAoJEB7SkWpmfYgCifYP/A+OQ19HybqcY2nfvqXUdQum
Q5x3qcKmGmEbKbnCUMOHZJpEjW4c/Cpm6OKhuFDQJ4tijn1XG3/ATSi7PXrZxs/o
CRK8MIg5Wz/mMvYRvypIkCHHHr9+Y1NjqmQynM4LLzNG24GMXaeHHuZUTnrZCDmu
0+jBTylNgVYdykoIxgHDYDB+cd6w4NtAP5OD9D46pdsmzX9ac+OQyZMyNB3glUhd
/ZFAoywVNfvfJVWEci9RoHiKttWxgVoCuNbSlCs2Y6ymepA44ApR9AgLHtaC9pFO
DrPkfCzPSmf4PVSxLJd79+/sw9YOcBD7LZ5IxzozxRMuRn5pIofdZIsBg9PlwT5B
NL9jQK87XPiG0vNxhJu3wzP+FlyCXxGxkWfApp7w4rlWBV7RgugOZHyH051rdKzQ
44JAPzLLCfA5Mj4o2tIbSx42f2JNX93XDEX8fkUB+qs3GzyOcMtlcmz9UjmnrT0R
o9KHKhDn81Vivxh33Ts2G0iHktO83XSUBDWApSd6erjEUXMsCLY0D8y+nDGTOMUh
kVcY8q93sgZGLVbcxt0eGc8Q7osZYawQGRGucflTETFcxNwMyLL4F9lWgPirGeYF
i1JDWeTrhcImYufNj8o78LsbT5xh6YjbZZ8Q1obIgPXpDtxHNIXO6COId49Zp2cK
obftWyVp+7kYe79NWzmD
=sfNx
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A small batch of fixes for v4.20-rc3.
The overflow continuation fix addresses something that has been broken
for several releases. Arguably it could wait even longer, but it's a
one line fix and this finishes the last of the known address range
scrub bug reports. The revert addresses a lockdep regression. The unit
tests are not critical to fix, but no reason to hold this fix back.
Summary:
- Address Range Scrub overflow continuation handling has been broken
since it was initially merged. It was only recently that error
injection and platform-BIOS support enabled this corner case to be
exercised.
- The recent attempt to provide more isolation for the kernel Address
Range Scrub state machine from userapace initiated sessions
triggers a lockdep report. Revert and try again at the next merge
window.
- Fix a kasan reported buffer overflow in libnvdimm unit test
infrastrucutre (nfit_test)"
* tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
Revert "acpi, nfit: Further restrict userspace ARS start requests"
acpi, nfit: Fix ARS overflow continuation
tools/testing/nvdimm: Fix the array size for dimm devices.
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
mm, page_alloc: check for max order in hot path
scripts/spdxcheck.py: make python3 compliant
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
mm/vmstat.c: fix NUMA statistics updates
mm/gup.c: fix follow_page_mask() kerneldoc comment
ocfs2: free up write context when direct IO failed
scripts/faddr2line: fix location of start_kernel in comment
mm: don't reclaim inodes with many attached pages
mm, memory_hotplug: check zone_movable in has_unmovable_pages
mm/swapfile.c: use kvzalloc for swap_info_struct allocation
MAINTAINERS: update OMAP MMC entry
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
kernel/sched/psi.c: simplify cgroup_move_task()
z3fold: fix possible reclaim races
Pull scheduler fix from Ingo Molnar:
"Fix an exec() related scalability/performance regression, which was
caused by incorrectly calculating load and migrating tasks on exec()
when they shouldn't be"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix cpu_util_wake() for 'execl' type workloads
Pull perf fixes from Ingo Molnar:
"Fix uncore PMU enumeration for CofeeLake CPUs"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Support CoffeeLake 8th CBOX
perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs
Pull EFI fixes from Ingo Molnar:
"Misc fixes: two warning splat fixes, a leak fix and persistent memory
allocation fixes for ARM"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Permit calling efi_mem_reserve_persistent() from atomic context
efi/arm: Defer persistent reservations until after paging_init()
efi/arm/libstub: Pack FDT after populating it
efi/arm: Revert deferred unmap of early memmap mapping
efi: Fix debugobjects warning on 'efi_rts_work'
Pull ARM spectre updates from Russell King:
"These are the currently known final bits that resolve the Spectre
issues. big.Little systems used to be sufficiently identical in that
there were no differences between individual CPUs in the system that
mattered to the kernel. With the advent of the Spectre problem, the
CPUs now have differences in how the workaround is applied.
As a result of previous Spectre patches, these systems ended up
reporting quite a lot of:
"CPUx: Spectre v2: incorrect context switching function, system vulnerable"
messages due to the action of the big.Little switcher causing the CPUs
to be re-initialised regularly. This series resolves that issue by
making the CPU vtable unique to each CPU.
However, since this is used very early, before per-cpu is setup,
per-cpu can't be used. We also have a problem that two of the methods
are not called from preempt-safe paths, but thankfully these remain
identical between all CPUs in the system. To make sure, we validate
that these are identical during boot"
* 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: spectre-v2: per-CPU vtables to work around big.Little systems
ARM: add PROC_VTABLE and PROC_TABLE macros
ARM: clean up per-processor check_bugs method call
ARM: split out processor lookup
ARM: make lookup_processor_type() non-__init
Konstantin has noticed that kvmalloc might trigger the following
warning:
WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
[...]
Call Trace:
fragmentation_index+0x76/0x90
compaction_suitable+0x4f/0xf0
shrink_node+0x295/0x310
node_reclaim+0x205/0x250
get_page_from_freelist+0x649/0xad0
__alloc_pages_nodemask+0x12a/0x2a0
kmalloc_large_node+0x47/0x90
__kmalloc_node+0x22b/0x2e0
kvmalloc_node+0x3e/0x70
xt_alloc_table_info+0x3a/0x80 [x_tables]
do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
nf_setsockopt+0x44/0x60
SyS_setsockopt+0x6f/0xc0
do_syscall_64+0x67/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already. This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile. A recent UBSAN report just underlines that
by the following report
UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
shift exponent 51 is too large for 32-bit type 'int'
CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xd2/0x148 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
__ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
__zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
zone_watermark_fast mm/page_alloc.c:3216 [inline]
get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
__alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
alloc_pages include/linux/gfp.h:509 [inline]
__get_free_pages+0x12/0x60 mm/page_alloc.c:4414
dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
__blkdev_driver_ioctl block/ioctl.c:303 [inline]
blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
block_ioctl+0x105/0x150 fs/block_dev.c:1883
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
__do_sys_ioctl fs/ioctl.c:709 [inline]
__se_sys_ioctl fs/ioctl.c:707 [inline]
__x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Note that this is not a kvmalloc path. It is just that the fast path
really depends on having sanitzed order as well. Therefore move the
order check to the fast path.
Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without this change the following happens when using Python3 (3.6.6):
$ echo "GPL-2.0" | python3 scripts/spdxcheck.py -
FAIL: 'str' object has no attribute 'decode'
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 253, in <module>
parser.parse_lines(sys.stdin, args.maxlines, '-')
File "scripts/spdxcheck.py", line 171, in parse_lines
line = line.decode(locale.getpreferredencoding(False), errors='ignore')
AttributeError: 'str' object has no attribute 'decode'
So as the line is already a string, there is no need to decode it and
the line can be dropped.
/usr/bin/python on Arch is Python 3. So this would indeed be worth
going into 4.19.
Link: http://lkml.kernel.org/r/20181023070802.22558-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Perches <joe@perches.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.
man 2 lseek says
: EINVAL whence is not valid. Or: the resulting file offset would be
: negative, or beyond the end of a seekable device.
:
: ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond
: the end of the file.
Make tmpfs return ENXIO under these circumstances as well. After this,
tmpfs also passes xfstests's generic/448.
[akpm@linux-foundation.org: rewrite changelog]
Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc-8 complains about the prototype for this function:
lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes]
This is actually a GCC's bug. In GCC internals
__ubsan_handle_builtin_unreachable() declared with both 'noreturn' and
'const' attributes instead of only 'noreturn':
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84210
Workaround this by removing the noreturn attribute.
[aryabinin: add information about GCC bug in changelog]
Link: http://lkml.kernel.org/r/20181107144516.4587-1-aryabinin@virtuozzo.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Scan through the whole array to see if an update is needed. While we're
at it, use sizeof() to be safe against any possible type changes in the
future.
The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus. Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway. So I wouldn't bother to mark
this for stable, yet something nice to fix.
[mhocko@suse.com: changelog enhancement]
Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
Fixes: 1d90ca897c ("mm: update NUMA counter threshold size")
Signed-off-by: Janne Huttunen <janne.huttunen@nokia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit df06b37ffe ("mm/gup: cache dev_pagemap while pinning pages")
modified the signature of follow_page_mask() but left the parameter
description behind.
Update the description to make the code and comments agree again.
While at it, update formatting of the return value description to match
Documentation/doc-guide/kernel-doc.rst guidelines.
Link: http://lkml.kernel.org/r/1541603316-27832-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:
ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
ERROR: Clear inode of 215043, inode has unwritten extents
...
Call Trace:
? __set_current_blocked+0x42/0x68
ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
? bit_waitqueue+0x40/0x33
evict+0xdb/0x1af
iput+0x1a2/0x1f7
do_unlinkat+0x194/0x28f
SyS_unlinkat+0x1b/0x2f
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x151/0x0
This patch also logs, with frequency limit, direct IO failures.
Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Spock reported that commit 172b06c32b ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page state checks are racy. Under a heavy memory workload (e.g. stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet. A debugging patch to
dump the struct page state shows
has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)
Note that the state has been checked for both PageLRU and PageSwapBacked
already. Closing this race completely would require some sort of retry
logic. This can be tricky and error prone (think of potential endless
or long taking loops).
Workaround this problem for movable zones at least. Such a zone should
only contain movable pages. Commit 15c30bc090 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though. Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check. Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.
Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
Fixes: 15c30bc090 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a2468cc9bf ("swap: choose swap device according to numa node")
changed 'avail_lists' field of 'struct swap_info_struct' to an array.
In popular linux distros it increased size of swap_info_struct up to 40
Kbytes and now swap_info_struct allocation requires order-4 page.
Switch to kvzmalloc allows to avoid unexpected allocation failures.
Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
Fixes: a2468cc9bf ("swap: choose swap device according to numa node")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jarkko's e-mail address hasn't worked for a long time. We still want to
keep this driver working as it is critical for some of the OMAP boards.
I use and test this driver frequently, so change myself as a maintainer
with "Odd Fixes" status.
Link: http://lkml.kernel.org/r/20181106222750.12939-1-aaro.koskinen@iki.fi
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This bug has been experienced several times by the Oracle DB team. The
BUG is in remove_inode_hugepages() as follows:
/*
* If page is mapped, it was faulted in after being
* unmapped in caller. Unmap (again) now after taking
* the fault mutex. The mutex will prevent faults
* until we finish removing the page.
*
* This race can only happen in the hole punch case.
* Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code. Consider the following:
- Process A maps a hugetlbfs file of sufficient size and alignment
(PUD_SIZE) that a pmd page could be shared.
- Process B maps the same hugetlbfs file with the same size and
alignment such that a pmd page is shared.
- Process B then calls mprotect() to change protections for the mapping
with the shared pmd. As a result, the pmd is 'unshared'.
- Process B then calls mprotect() again to chage protections for the
mapping back to their original value. pmd remains unshared.
- Process B then forks and process C is created. During the fork
process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
tables. Copying page tables for hugetlb mappings is done in the
routine copy_hugetlb_page_range.
In copy_hugetlb_page_range(), the destination pte is obtained by:
dst_pte = huge_pte_alloc(dst, addr, sz);
If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table. In the situation above, process C could share with
either process A or process B. Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.
However, the check for pmd sharing in copy_hugetlb_page_range is:
/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;
Since process C is sharing with process A instead of process B, the
above test fails. The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page. This is how we end up with an elevated map count.
To solve, check the dst_pte entry for huge_pte_none. If !none, this
implies PMD sharing so do not copy.
Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes: c5c99429fa ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>