When I added the missing memory outputs, I failed to update the
index of the first argument (ebx) on 32-bit builds, which broke the
fallbacks. Somehow I must have screwed up my testing or gotten
lucky.
Add another test to cover gettimeofday() as well.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 715bd9d12f ("x86/vdso: Fix asm constraints on vDSO syscall fallbacks")
Link: http://lkml.kernel.org/r/21bd45ab04b6d838278fa5bebfa9163eceffa13c.1538608971.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In resctrl filesystem, mount options exist to enable L3/L2 CDP and MBA
Software Controller features if the platform supports them:
mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
But currently only "cdp" option is displayed in /proc/mounts. "cdpl2" and
"mba_MBps" options are not shown even when they are active.
Before:
# mount -t resctrl resctrl -o cdp,mba_MBps /sys/fs/resctrl
# grep resctrl /proc/mounts
/sys/fs/resctrl /sys/fs/resctrl resctrl rw,relatime,cdp 0 0
After:
# mount -t resctrl resctrl -o cdp,mba_MBps /sys/fs/resctrl
# grep resctrl /proc/mounts
/sys/fs/resctrl /sys/fs/resctrl resctrl rw,relatime,cdp,mba_MBps 0 0
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1536796118-60135-1-git-send-email-fenghua.yu@intel.com
Switch to bitmap_zalloc() to show clearly what is allocated. Besides that
it returns a pointer of bitmap type instead of opaque void *.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180830115039.63430-1-andriy.shevchenko@linux.intel.com
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.
The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.
So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.
The definition of struct kernel_siginfo I have put in include/signal_types.h
A set of buildtime checks has been added to verify the two structures have
the same field offsets.
To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Rework the defintion of struct siginfo so that the array padding
struct siginfo to SI_MAX_SIZE can be placed in a union along side of
the rest of the struct siginfo members. The result is that we no
longer need the __ARCH_SI_PREAMBLE_SIZE or SI_PAD_SIZE definitions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update the in_suspend variable to reflect the actual hibernation
status. Back-port from 64bit system.
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Set up the temporary text mapping for the final jump address
so that the system could jump to the right address after all
the pages have been copied back to their original address -
otherwise the final mapping for the jump address is invalid.
Analogous changes were made for 64-bit in commit 65c0554b73
(x86/power/64: Fix kernel text mapping corruption during image
restoration).
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On 64bit system, code should be executed in a safe page
during page restoring, as the page where instruction is
running during resume might be scribbled and causes issues.
Although on 32 bit, we only suspend resuming by same kernel
that did the suspend, we'd like to remove that restriction
in the future.
Porting corresponding code from
64bit system: Allocate a safe page, and copy the restore
code to it, then jump to the safe page to run the code.
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After all the pages are restored to previous address, the page
table switches back to current swapper_pg_dir. However the
swapper_pg_dir currently in used might not be consistent with
previous page table, which might cause issue after resume.
Fix this issue by switching to original page table after resume,
and the address of the original page table is saved in the hibernation
image header.
Move the manipulation of restore_cr3 into common code blocks.
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Convert the hard code into PAGE_SIZE for better scalability.
No functional change.
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This is to reuse the temp_pgt for both 32bit and 64bit
system.
No intentional behavior change.
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As 32bit system is not using 4-level page, rename it
to temp_pgt so that it can be reused for both 32bit
and 64bit hibernation.
No functional change.
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Enable CONFIG_ARCH_HIBERNATION_HEADER for 32bit system so that
1. arch_hibernation_header_save/restore() are invoked across
hibernation on 32bit system.
2. The checksum handling as well as 'magic' number checking
for 32bit system are enabled.
Controlled by CONFIG_X86_64 in hibernate.c
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reduce the hibernation code duplication between x86-32 and x86-64
by extracting the common code into hibernate.c.
Currently only pfn_is_nosave() is the activated common
function in hibernate.c
No functional change.
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
swsusp_arch_suspend() is callable non-leaf function which doesn't
honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
Also it's not annotated as ELF callable function which can confuse tooling.
Create a stack frame for it when CONFIG_FRAME_POINTER is enabled and
give it proper ELF function annotation.
Also in this patch introduces the restore_registers() symbol and
gives it ELF function annotation, thus to prepare for later register
restore.
Analogous changes were made for 64bit before in commit ef0f3ed5a4
(x86/asm/power: Create stack frames in hibernate_asm_64.S) and
commit 4ce827b4cc (x86/power/64: Fix hibernation return address
corruption).
Signed-off-by: Zhimin Gu <kookoo.gu@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently if get_e820_md5() fails, then it will hibernate nevertheless.
Actually the error code should be propagated to upper caller so that
the hibernation could be aware of the result and terminates the process
if md5 digest fails.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Clang warns when multiple pairs of parentheses are used for a single
conditional statement.
arch/x86/kernel/cpu/amd.c:925:14: warning: equality comparison with
extraneous parentheses [-Wparentheses-equality]
if ((c->x86 == 6)) {
~~~~~~~^~~~
arch/x86/kernel/cpu/amd.c:925:14: note: remove extraneous parentheses
around the comparison to silence this warning
if ((c->x86 == 6)) {
~ ^ ~
arch/x86/kernel/cpu/amd.c:925:14: note: use '=' to turn this equality
comparison into an assignment
if ((c->x86 == 6)) {
^~
=
1 warning generated.
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181002224511.14929-1-natechancellor@gmail.com
Link: https://github.com/ClangBuiltLinux/linux/issues/187
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When I fixed the vDSO build to use inline retpolines, I messed up
the Makefile logic and made it unconditional. It should have
depended on CONFIG_RETPOLINE and on the availability of compiler
support. This broke the build on some older compilers.
Reported-by: nikola.ciprich@linuxbox.cz
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Rickard <matt@softrans.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jason.vas.dias@gmail.com
Cc: stable@vger.kernel.org
Fixes: 2e549b2ee0 ("x86/vdso: Fix vDSO build if a retpoline is emitted")
Link: http://lkml.kernel.org/r/08a1f29f2c238dd1f493945e702a521f8a5aa3ae.1538540801.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A root port Device ID changed between simulation and production. Rather
than match Device IDs which may not be future-proof if left unmaintained,
match all root ports which exist in a VMD domain.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The recent rework of the TSC calibration code introduced a regression on UV
systems as it added a call to tsc_early_init() which initializes the TSC
ADJUST values before acpi_boot_table_init(). In the case of UV systems,
that is a necessary step that calls uv_system_init(). This informs
tsc_sanitize_first_cpu() that the kernel runs on a platform with async TSC
resets as documented in commit 341102c3ef ("x86/tsc: Add option that TSC
on Socket 0 being non-zero is valid")
Fix it by skipping the early tsc initialization on UV systems and let TSC
init tests take place later in tsc_init().
Fixes: cf7a63ef4e ("x86/tsc: Calibrate tsc only once")
Suggested-by: Hedi Berriche <hedi.berriche@hpe.com>
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Russ Anderson <rja@hpe.com>
Reviewed-by: Dimitri Sivanich <sivanich@hpe.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Russ Anderson <russ.anderson@hpe.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Xiaoming Gao <gxm.linux.kernel@gmail.com>
Cc: Rajvi Jingar <rajvi.jingar@intel.com>
Link: https://lkml.kernel.org/r/20181002180144.923579706@stormcage.americas.sgi.com
Introduce is_early_uv_system() which uses efi.uv_systab to decide early
in the boot process whether the kernel runs on a UV system.
This is needed to skip other early setup/init code that might break
the UV platform if done too early such as before necessary ACPI tables
parsing takes place.
Suggested-by: Hedi Berriche <hedi.berriche@hpe.com>
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Russ Anderson <rja@hpe.com>
Reviewed-by: Dimitri Sivanich <sivanich@hpe.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Russ Anderson <russ.anderson@hpe.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Xiaoming Gao <gxm.linux.kernel@gmail.com>
Cc: Rajvi Jingar <rajvi.jingar@intel.com>
Link: https://lkml.kernel.org/r/20181002180144.801700401@stormcage.americas.sgi.com
The "pciserial" earlyprintk variant helps much on many modern x86
platforms, but unfortunately there are still some platforms with PCI
UART devices which have the wrong PCI class code. In that case, the
current class code check does not allow for them to be used for logging.
Add a sub-option "force" which overrides the class code check and thus
the use of such device can be enforced.
[ bp: massage formulations. ]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Stuart R . Anderson" <stuart.r.anderson@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H Peter Anvin <hpa@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thymo van Beers <thymovanbeers@gmail.com>
Cc: alan@linux.intel.com
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/20181002164921.25833-1-feng.tang@intel.com
A ucode patch is needed for Goldmont Plus while counter freezing feature
is enabled. Otherwise, there will be some issues, e.g. PMI flood with
some events.
Add a quirk to check microcode version. If the system starts with the
wrong ucode, leave the counter-freezing feature permanently disabled.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/1533712328-2834-3-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Going primarily by:
https://en.wikipedia.org/wiki/List_of_Intel_Atom_microprocessors
with additional information gleaned from other related pages; notably:
- Bonnell shrink was called Saltwell
- Moorefield is the Merriefield refresh which makes it Airmont
The general naming scheme is: FAM6_ATOM_UARCH_SOCTYPE
for i in `git grep -l FAM6_ATOM` ; do
sed -i -e 's/ATOM_PINEVIEW/ATOM_BONNELL/g' \
-e 's/ATOM_LINCROFT/ATOM_BONNELL_MID/' \
-e 's/ATOM_PENWELL/ATOM_SALTWELL_MID/g' \
-e 's/ATOM_CLOVERVIEW/ATOM_SALTWELL_TABLET/g' \
-e 's/ATOM_CEDARVIEW/ATOM_SALTWELL/g' \
-e 's/ATOM_SILVERMONT1/ATOM_SILVERMONT/g' \
-e 's/ATOM_SILVERMONT2/ATOM_SILVERMONT_X/g' \
-e 's/ATOM_MERRIFIELD/ATOM_SILVERMONT_MID/g' \
-e 's/ATOM_MOOREFIELD/ATOM_AIRMONT_MID/g' \
-e 's/ATOM_DENVERTON/ATOM_GOLDMONT_X/g' \
-e 's/ATOM_GEMINI_LAKE/ATOM_GOLDMONT_PLUS/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: dave.hansen@linux.intel.com
Cc: len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Implements counter freezing for Arch Perfmon v4 (Skylake and
newer). This allows to speed up the PMI handler by avoiding
unnecessary MSR writes and make it more accurate.
The Arch Perfmon v4 PMI handler is substantially different than
the older PMI handler.
Differences to the old handler:
- It relies on counter freezing, which eliminates several MSR
writes from the PMI handler and lowers the overhead significantly.
It makes the PMI handler more accurate, as all counters get
frozen atomically as soon as any counter overflows. So there is
much less counting of the PMI handler itself.
With the freezing we don't need to disable or enable counters or
PEBS. Only BTS which does not support auto-freezing still needs to
be explicitly managed.
- The PMU acking is done at the end, not the beginning.
This makes it possible to avoid manual enabling/disabling
of the PMU, instead we just rely on the freezing/acking.
- The APIC is acked before reenabling the PMU, which avoids
problems with LBRs occasionally not getting unfreezed on Skylake.
- Looping is only needed to workaround a corner case which several PMIs
are very close to each other. For common cases, the counters are freezed
during PMI handler. It doesn't need to do re-check.
This patch:
- Adds code to enable v4 counter freezing
- Fork <=v3 and >=v4 PMI handlers into separate functions.
- Add kernel parameter to disable counter freezing. It took some time to
debug counter freezing, so in case there are new problems we added an
option to turn it off. Would not expect this to be used until there
are new bugs.
- Only for big core. The patch for small core will be posted later
separately.
Performance:
When profiling a kernel build on Kabylake with different perf options,
measuring the length of all NMI handlers using the nmi handler
trace point:
V3 is without counter freezing.
V4 is with counter freezing.
The value is the average cost of the PMI handler.
(lower is better)
perf options ` V3(ns) V4(ns) delta
-c 100000 1088 894 -18%
-g -c 100000 1862 1646 -12%
--call-graph lbr -c 100000 3649 3367 -8%
--c.g. dwarf -c 100000 2248 1982 -12%
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The Arch Perfmon v4 PMI handler is substantially different than
the older PMI handler. Instead of adding more and more ifs cleanly
fork the new handler into a new function, with the main common
code factored out into a common function.
Fix complaint from checkpatch.pl by removing "false" from "static bool
warned".
No functional change.
Based-on-code-from: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/1533712328-2834-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In Family 17h, some L3 Cache Performance events require the ThreadMask
and SliceMask to be set. For other events, these fields do not affect
the count either way.
Set ThreadMask and SliceMask to 0xFF and 0xF respectively.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee <Suravee.Suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/Message-ID:
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The counters on M3UPI Link 0 and Link 3 don't count properly, and writing
0 to these counters may causes system crash on some machines.
The PCI BDF addresses of the M3UPI in the current code are incorrect.
The correct addresses should be:
D18:F1 0x204D
D18:F2 0x204E
D18:F5 0x204D
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: cd34cd97b7 ("perf/x86/intel/uncore: Add Skylake server uncore support")
Link: http://lkml.kernel.org/r/1537538826-55489-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The syscall fallbacks in the vDSO have incorrect asm constraints.
They are not marked as writing to their outputs -- instead, they are
marked as clobbering "memory", which is useless. In particular, gcc
is smart enough to know that the timespec parameter hasn't escaped,
so a memory clobber doesn't clobber it. And passing a pointer as an
asm *input* does not tell gcc that the pointed-to value is changed.
Add in the fact that the asm instructions weren't volatile, and gcc
was free to omit them entirely unless their sole output (the return
value) is used. Which it is (phew!), but that stops happening with
some upcoming patches.
As a trivial example, the following code:
void test_fallback(struct timespec *ts)
{
vdso_fallback_gettime(CLOCK_MONOTONIC, ts);
}
compiles to:
00000000000000c0 <test_fallback>:
c0: c3 retq
To add insult to injury, the RCX and R11 clobbers on 64-bit
builds were missing.
The "memory" clobber is also unnecessary -- no ordering with respect to
other memory operations is needed, but that's going to be fixed in a
separate not-for-stable patch.
Fixes: 2aae950b21 ("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/2c0231690551989d2fafa60ed0e7b5cc8b403908.1538422295.git.luto@kernel.org
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
=dMl4
-----END PGP SIGNATURE-----
Merge tag 'v4.19-rc6' into for-4.20/block
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One defense against L1TF in KVM is to always set the upper five bits
of the *legal* physical address in the SPTEs for non-present and
reserved SPTEs, e.g. MMIO SPTEs. In the MMIO case, the GFN of the
MMIO SPTE may overlap with the upper five bits that are being usurped
to defend against L1TF. To preserve the GFN, the bits of the GFN that
overlap with the repurposed bits are shifted left into the reserved
bits, i.e. the GFN in the SPTE will be split into high and low parts.
When retrieving the GFN from the MMIO SPTE, e.g. to check for an MMIO
access, get_mmio_spte_gfn() unshifts the affected bits and restores
the original GFN for comparison. Unfortunately, get_mmio_spte_gfn()
neglects to mask off the reserved bits in the SPTE that were used to
store the upper chunk of the GFN. As a result, KVM fails to detect
MMIO accesses whose GPA overlaps the repurprosed bits, which in turn
causes guest panics and hangs.
Fix the bug by generating a mask that covers the lower chunk of the
GFN, i.e. the bits that aren't shifted by the L1TF mitigation. The
alternative approach would be to explicitly zero the five reserved
bits that are used to store the upper chunk of the GFN, but that
requires additional run-time computation and makes an already-ugly
bit of code even more inscrutable.
I considered adding a WARN_ON_ONCE(low_phys_bits-1 <= PAGE_SHIFT) to
warn if GENMASK_ULL() generated a nonsensical value, but that seemed
silly since that would mean a system that supports VMX has less than
18 bits of physical address space...
Reported-by: Sakari Ailus <sakari.ailus@iki.fi>
Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
Cc: Junaid Shahid <junaids@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Junaid Shahid <junaids@google.com>
Tested-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
L2 IA32_BNDCFGS should be updated with vmcs12->guest_bndcfgs only
when VM_ENTRY_LOAD_BNDCFGS is specified in vmcs12->vm_entry_controls.
Otherwise, L2 IA32_BNDCFGS should be set to vmcs01->guest_bndcfgs which
is L1 IA32_BNDCFGS.
Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit a87036add0 ("KVM: x86: disable MPX if host did not enable
MPX XSAVE features") introduced kvm_mpx_supported() to return true
iff MPX is enabled in the host.
However, that commit seems to have missed replacing some calls to
kvm_x86_ops->mpx_supported() to kvm_mpx_supported().
Complete original commit by replacing remaining calls to
kvm_mpx_supported().
Fixes: a87036add0 ("KVM: x86: disable MPX if host did not enable
MPX XSAVE features")
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Before this commit, KVM exposes MPX VMX controls to L1 guest only based
on if KVM and host processor supports MPX virtualization.
However, these controls should be exposed to guest only in case guest
vCPU supports MPX.
Without this change, a L1 guest running with kernel which don't have
commit 691bd4340b ("kvm: vmx: allow host to access guest
MSR_IA32_BNDCFGS") asserts in QEMU on the following:
qemu-kvm: error: failed to set MSR 0xd90 to 0x0
qemu-kvm: .../qemu-2.10.0/target/i386/kvm.c:1801 kvm_put_msrs:
Assertion 'ret == cpu->kvm_msr_buf->nmsrs failed'
This is because L1 KVM kvm_init_msr_list() will see that
vmx_mpx_supported() (As it only checks MPX VMX controls support) and
therefore KVM_GET_MSR_INDEX_LIST IOCTL will include MSR_IA32_BNDCFGS.
However, later when L1 will attempt to set this MSR via KVM_SET_MSRS
IOCTL, it will fail because !guest_cpuid_has_mpx(vcpu).
Therefore, fix the issue by exposing MPX VMX controls to L1 guest only
when vCPU supports MPX.
Fixes: 36be0b9deb ("KVM: x86: Add nested virtualization support for MPX")
Reported-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CONFIG_AS_CRC32 is not used anywhere. Its last user was removed by
0cb6c969ed ("net, lib: kill arch_fast_hash library bits")
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/1538389443-28514-1-git-send-email-yamada.masahiro@socionext.com
Replace open-coded use of the SETcc instruction with CC_SET()/CC_OUT()
in __cmpxchg_double().
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/CAFULd4YdvwwhXWHqqPsGk5+TLG71ozgSscTZNsqmrm+Jzg941w@mail.gmail.com
Thomas writes:
"A single fix for the AMD memory encryption boot code so it does not
read random garbage instead of the cached encryption bit when a kexec
kernel is allocated above the 32bit address limit."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Fix kexec booting failure in the SEV bit detection code
The success of a cache pseudo-locked region is measured using
performance monitoring events that are programmed directly at the time
the user requests a measurement.
Modifying the performance event registers directly is not appropriate
since it circumvents the in-kernel perf infrastructure that exists to
manage these resources and provide resource arbitration to the
performance monitoring hardware.
The cache pseudo-locking measurements are modified to use the in-kernel
perf infrastructure. Performance events are created and validated with
the appropriate perf API. The performance counters are still read as
directly as possible to avoid the additional cache hits. This is
done safely by first ensuring with the perf API that the counters have
been programmed correctly and only accessing the counters in an
interrupt disabled section where they are not able to be moved.
As part of the transition to the in-kernel perf infrastructure the L2
and L3 measurements are split into two separate measurements that can
be triggered independently. This separation prevents additional cache
misses incurred during the extra testing code used to decide if a
L2 and/or L3 measurement should be made.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: peterz@infradead.org
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/fc24e728b446404f42c78573c506e98cd0599873.1537468643.git.reinette.chatre@intel.com
A perf event has many attributes that are maintained in a separate
structure that should be provided when a new perf_event is created.
In preparation for the transition to perf_events the required attribute
structures are created for all the events that may be used in the
measurements. Most attributes for all the events are identical. The
actual configuration, what specifies what needs to be measured, is what
will be different between the events used. This configuration needs to
be done with X86_CONFIG that cannot be used as part of the designated
initializers used here, this will be introduced later.
Although they do look identical at this time the attribute structures
needs to be maintained separately since a perf_event will maintain a
pointer to its unique attributes.
In support of patch testing the new structs are given the unused attribute
until their use in later patches.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/1822f6164e221a497648d108913d056ab675d5d0.1537377064.git.reinette.chatre@intel.com
Local register variables were used in an effort to improve the
accuracy of the measurement of cache residency of a pseudo-locked
region. This was done to ensure that only the cache residency of
the memory is measured and not the cache residency of the variables
used to perform the measurement.
While local register variables do accomplish the goal they do require
significant care since different architectures have different registers
available. Local register variables also cannot be used with valuable
developer tools like KASAN.
Significant testing has shown that similar accuracy in measurement
results can be obtained by replacing local register variables with
regular local variables.
Make use of local variables in the critical code but do so with
READ_ONCE() to prevent the compiler from merging or refetching reads.
Ensure these variables are initialized before the measurement starts,
and ensure it is only the local variables that are accessed during
the measurement.
With the removal of the local register variables and using READ_ONCE()
there is no longer a motivation for using a direct wrmsr call (that
avoids the additional tracing code that may clobber the local register
variables).
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/f430f57347414e0691765d92b144758ab93d8407.1537377064.git.reinette.chatre@intel.com
perf_event_read_local() is the safest way to obtain measurements
associated with performance events. In some cases the overhead
introduced by perf_event_read_local() affects the measurements and the
use of rdpmcl() is needed. rdpmcl() requires the index
of the performance counter used so a helper is introduced to determine
the index used by a provided performance event.
The index used by a performance event may change when interrupts are
enabled. A check is added to ensure that the index is only accessed
with interrupts disabled. Even with this check the use of this counter
needs to be done with care to ensure it is queried and used within the
same disabled interrupts section.
This change introduces a new checkpatch warning:
CHECK: extern prototypes should be avoided in .h files
+extern int x86_perf_rdpmc_index(struct perf_event *event);
This warning was discussed and designated as a false positive in
http://lkml.kernel.org/r/20180919091759.GZ24124@hirez.programming.kicks-ass.net
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/b277ffa78a51254f5414f7b1bc1923826874566e.1537377064.git.reinette.chatre@intel.com
Use the for_each_of_cpu_node iterator to iterate over cpu nodes. This
has the side effect of defaulting to iterating using "cpu" node names in
preference to the deprecated (for FDT) device_type == "cpu".
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rob Herring <robh@kernel.org>
In the quest to remove all stack VLA usage from the kernel[1], this
replaces struct crypto_skcipher and SKCIPHER_REQUEST_ON_STACK() usage
with struct crypto_sync_skcipher and SYNC_SKCIPHER_REQUEST_ON_STACK(),
which uses a fixed stack size.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Cc: x86@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
A Generation-2 Linux VM on Hyper-V doesn't have the legacy PCI bus, and
users always see the scary warning, which is actually harmless.
Suppress it.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: "devel@linuxdriverproject.org" <devel@linuxdriverproject.org>
Cc: Olaf Aepfle <olaf@aepfle.de>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Marcelo Cerri <marcelo.cerri@canonical.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Link: https://lkml.kernel.org/r/ <KU1P153MB0166D977DC930996C4BF538ABF1D0@KU1P153MB0166.APCP153.PROD.OUTLOOK.COM
If we IPI for WBINDV, then we might as well kill the entire TLB too.
But if we don't have to invalidate cache, there is no reason not to
use a range TLB flush.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085948.195633798@infradead.org
The start of cpa_flush_range() and cpa_flush_array() is the same, use
a common function.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085948.138859183@infradead.org
Rather than guarding cpa_flush_array() users with a CLFLUSH test, put
it inside.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085948.087848187@infradead.org
Rather than guarding all cpa_flush_range() uses with a CLFLUSH test,
put it inside.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085948.036195503@infradead.org
Both cpa_flush_range() and cpa_flush_array() have a well specified
range, use that to do a range based TLB invalidate.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085947.985193217@infradead.org
CAT has happened, WBINDV is bad (even before CAT blowing away the
entire cache on a multi-core platform wasn't nice), try not to use it
ever.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085947.933674526@infradead.org
There is an atom errata, where we do a local TLB invalidate right
before we return and then do a global TLB invalidate.
Move the global invalidate up a little bit and avoid the local
invalidate entirely.
This does put the global invalidate under pgd_lock, but that shouldn't
matter.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085947.882287392@infradead.org
Instead of open-coding it..
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085947.831102058@infradead.org
The extra loop which tries hard to preserve large pages in case of conflicts
with static protection regions turns out to be not preserving anything, at
least not in the experiments which have been conducted.
There might be corner cases in which the code would be able to preserve a
large page oaccsionally, but it's really not worth the extra code and the
cycles wasted in the common case.
Before:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 541
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 514
4K pages set-checked: 7668
After:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 538
2M pages sameprot: 466
2M pages preserved: 47
4K pages set-checked: 7668
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.589642503@linutronix.de
To avoid excessive 4k wise checks in the common case do a quick check first
whether the requested new page protections conflict with a static
protection area in the large page. If there is no conflict then the
decision whether to preserve or to split the page can be made immediately.
If the requested range covers the full large page, preserve it. Otherwise
split it up. No point in doing a slow crawl in 4k steps.
Before:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 538
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 560642
4K pages set-checked: 7668
After:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 541
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 514
4K pages set-checked: 7668
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.507259989@linutronix.de
When the existing mapping is correct and the new requested page protections
are the same as the existing ones, then further checks can be omitted and the
large page can be preserved. The slow path 4k wise check will not come up with
a different result.
Before:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 540
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 800709
4K pages set-checked: 7668
After:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 538
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 560642
4K pages set-checked: 7668
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.424477581@linutronix.de
With the range check it is possible to do a quick verification that the
current mapping is correct vs. the static protection areas.
In case a incorrect mapping is detected a warning is emitted and the large
page is split up. If the large page is a 2M page, then the split code is
forced to check the static protections for the PTE entries to fix up the
incorrectness. For 1G pages this can't be done easily because that would
require to either find the offending 2M areas before the split or
afterwards. For now just warn about that case and revisit it when reported.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.331408643@linutronix.de
The large page preservation mechanism is just magic and provides no
information at all. Add optional statistic output in debugfs so the magic can
be evaluated. Defaults is off.
Output:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 540
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 800770
4K pages set-checked: 7668
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.160867778@linutronix.de
The whole static protection magic is silently fixing up anything which is
handed in. That's just wrong. The offending call sites need to be fixed.
Add a debug mechanism which emits a warning if a requested mapping needs to be
fixed up. The DETECT debug mechanism is really not meant to be enabled except
for developers, so limit the output hard to the protection fixups.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.078998733@linutronix.de
Checking static protections only page by page is slow especially for huge
pages. To allow quick checks over a complete range, add the ability to do
that.
Make the checks inclusive so the ranges can be directly used for debug output
later.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143545.995734490@linutronix.de
static_protections() is pretty unreadable. Split it up into separate checks
for each protection area.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143545.913005317@linutronix.de
Avoid the extra variable and gotos by splitting the function into the
actual algorithm and a callable function which contains the lock
protection.
Rename it to should_split_large_page() while at it so the return values make
actually sense.
Clean up the code flow, comments and general whitespace damage while at it. No
functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143545.830507216@linutronix.de
The sequence of marking text and rodata read-only in 32bit init is:
set_ro(text);
kernel_set_to_readonly = 1;
set_ro(rodata);
When kernel_set_to_readonly is 1 it enables the protection mechanism in CPA
for the read only regions. With the upcoming checks for existing mappings
this consequently triggers the warning about an existing mapping being
incorrect vs. static protections because rodata has not been converted yet.
There is no technical reason to split the two, so just combine the RO
protection to convert text and rodata in one go.
Convert the printks to pr_info while at it.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143545.731701535@linutronix.de
Commit
1958b5fc40 ("x86/boot: Add early boot support when running with SEV active")
can occasionally cause system resets when kexec-ing a second kernel even
if SEV is not active.
That's because get_sev_encryption_bit() uses 32-bit rIP-relative
addressing to read the value of enc_bit - a variable which caches a
previously detected encryption bit position - but kexec may allocate
the early boot code to a higher location, beyond the 32-bit addressing
limit.
In this case, garbage will be read and get_sev_encryption_bit() will
return the wrong value, leading to accessing memory with the wrong
encryption setting.
Therefore, remove enc_bit, and thus get rid of the need to do 32-bit
rIP-relative addressing in the first place.
[ bp: massage commit message heavily. ]
Fixes: 1958b5fc40 ("x86/boot: Add early boot support when running with SEV active")
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-kernel@vger.kernel.org
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: brijesh.singh@amd.com
Cc: kexec@lists.infradead.org
Cc: dyoung@redhat.com
Cc: bhe@redhat.com
Cc: ghook@redhat.com
Link: https://lkml.kernel.org/r/20180927123845.32052-1-kasong@redhat.com
The Hygon Dhyana CPU has a topology extensions bit in CPUID. With
this bit, the kernel can get the cache information. So add support in
cpuid4_cache_lookup_regs() to get the correct cache size.
The Hygon Dhyana CPU also discovers num_cache_leaves via CPUID leaf
0x8000001d, so add support to it in find_num_cache_leaves().
Also add cacheinfo_hygon_init_llc_id() and init_hygon_cacheinfo()
functions to initialize Dhyana cache info. Setup cache cpumap in the
same way as AMD does.
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: bp@alien8.de
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: x86@kernel.org
Cc: thomas.lendacky@amd.com
Link: https://lkml.kernel.org/r/2a686b2ac0e2f5a1f2f5f101124d9dd44f949731.1537533369.git.puwen@hygon.cn
Similar to the arm64 case, 64-bit x86 can benefit from using relative
references rather than absolute ones when emitting struct jump_entry
instances. Not only does this reduce the memory footprint of the entries
themselves by 33%, it also removes the need for carrying relocation
metadata on relocatable builds (i.e., for KASLR) which saves a fair
chunk of .init space as well (although the savings are not as dramatic
as on arm64)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-7-ard.biesheuvel@linaro.org
In preparation of switching x86 to use place-relative references for
the code, target and key members of struct jump_entry, replace direct
references to the struct members with invocations of the new accessors.
This will allow us to make the switch by modifying the accessors only.
This incorporates a cleanup of __jump_label_transform() proposed by
Peter.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-6-ard.biesheuvel@linaro.org
Add support for R_X86_64_PC64 relocations, which operate on 64-bit
quantities holding a relative symbol reference. Also remove the
definition of R_X86_64_NUM: given that it is currently unused, it
is unclear what the new value should be.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-5-ard.biesheuvel@linaro.org
- Add support for enlisting the help of the EFI firmware to create memory
reservations that persist across kexec.
- Add page fault handling to the runtime services support code on x86 so
we can gracefully recover from buggy EFI firmware.
- Fix command line handling on x86 for the boot path that omits the stub's
PE/COFF entry point.
- Other assorted fixes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEnNKg2mrY9zMBdeK7wjcgfpV0+n0FAlurXR8ACgkQwjcgfpV0
+n2CGwf/V4exixXjTDwkqE6gY5bq0Y3AL8tp89wdbJzjgGOIJLKh3CrGr8xEFHrv
oYObcvB3SfNEIyGeBjc/8ZMw1P/j98s6ucsMm0u+V52k7xxu/xJoIPw3bX2R8LLc
QhedUmKWLFQXxottaqzRFi1m0rP9TlAlc2n2pjIPCywjTPzeT/jBTtnRGRRdpDkN
uxwv59eXc6MXuwJGhM9lGIBCu8ra54SiSByJSKoMwNYXQRCLtiBUg5iibWkKigHp
9rQiimQnDOuPiZ6JGFx6pwSu7cqv3d8LYk5EnU3zYfzxAvHRfxuf40joSeZzySby
vZ4zRog79DxkSnuvaQ0+phQHiq+yQg==
=HZGk
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core
Pull EFI updates for v4.20 from Ard Biesheuvel:
- Add support for enlisting the help of the EFI firmware to create memory
reservations that persist across kexec.
- Add page fault handling to the runtime services support code on x86 so
we can gracefully recover from buggy EFI firmware.
- Fix command line handling on x86 for the boot path that omits the stub's
PE/COFF entry point.
- Other assorted fixes.
Add x86 architecture support for a new processor: Hygon Dhyana Family
18h. Carve out initialization code needed by Dhyana into a separate
compilation unit.
To identify Hygon Dhyana CPU, add a new vendor type X86_VENDOR_HYGON.
Since Dhyana uses AMD functionality to a large degree, select
CPU_SUP_AMD which provides that functionality.
[ bp: drop explicit license statement as it has an SPDX tag already. ]
Signed-off-by: Pu Wen <puwen@hygon.cn>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: x86@kernel.org
Cc: thomas.lendacky@amd.com
Link: https://lkml.kernel.org/r/1a882065223bacbde5726f3beaa70cebd8dcd814.1537533369.git.puwen@hygon.cn
Nothing Xen specific in these headers, which get included from a lot
of code in the kernel. So prune the includes and move them to the
Xen-specific files that actually use them instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Take the Xen check into the core code instead of delegating it to
the architectures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having multiple externs in arch headers is not a good way to provide
a common interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If spectrev2 mitigation has been enabled, RSB is filled on context switch
in order to protect from various classes of spectrev2 attacks.
If this mitigation is enabled, say so in sysfs for spectrev2.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "WoodhouseDavid" <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: "SchauflerCasey" <casey.schaufler@intel.com>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438580.15880@cbobk.fhfr.pm
STIBP is a feature provided by certain Intel ucodes / CPUs. This feature
(once enabled) prevents cross-hyperthread control of decisions made by
indirect branch predictors.
Enable this feature if
- the CPU is vulnerable to spectre v2
- the CPU supports SMT and has SMT siblings online
- spectre_v2 mitigation autoselection is enabled (default)
After some previous discussion, this leaves STIBP on all the time, as wrmsr
on crossing kernel boundary is a no-no. This could perhaps later be a bit
more optimized (like disabling it in NOHZ, experiment with disabling it in
idle, etc) if needed.
Note that the synchronization of the mask manipulation via newly added
spec_ctrl_mutex is currently not strictly needed, as the only updater is
already being serialized by cpu_add_remove_lock, but let's make this a
little bit more future-proof.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "WoodhouseDavid" <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: "SchauflerCasey" <casey.schaufler@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm
Currently, IBPB is only issued in cases when switching into a non-dumpable
process, the rationale being to protect such 'important and security
sensitive' processess (such as GPG) from data leaking into a different
userspace process via spectre v2.
This is however completely insufficient to provide proper userspace-to-userpace
spectrev2 protection, as any process can poison branch buffers before being
scheduled out, and the newly scheduled process immediately becomes spectrev2
victim.
In order to minimize the performance impact (for usecases that do require
spectrev2 protection), issue the barrier only in cases when switching between
processess where the victim can't be ptraced by the potential attacker (as in
such cases, the attacker doesn't have to bother with branch buffers at all).
[ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
fine-grained ]
Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
Originally-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "WoodhouseDavid" <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "SchauflerCasey" <casey.schaufler@intel.com>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm
We currently align the end of the compressed image to a multiple of
16. However, the PE-COFF header included in the EFI stub says that
the file alignment is 32 bytes, and when adding an EFI signature to
the file it must first be padded to this alignment.
sbsigntool commands warn about this:
warning: file-aligned section .text extends beyond end of file
warning: checksum areas are greater than image size. Invalid section table?
Worse, pesign -at least when creating a detached signature- uses the
hash of the unpadded file, resulting in an invalid signature if
padding is required.
Avoid both these problems by increasing alignment to 32 bytes when
CONFIG_EFI_STUB is enabled.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Before this commit we were only calling efi_parse_options() from
make_boot_params(), but make_boot_params() only gets called if the
kernel gets booted directly as an EFI executable. So when booted through
e.g. grub we ended up not parsing the commandline in the boot code.
This makes the drivers/firmware/efi/libstub code ignore the "quiet"
commandline argument resulting in the following message being printed:
"EFI stub: UEFI Secure Boot is enabled."
Despite the quiet request. This commits adds an extra call to
efi_parse_options() to efi_main() to make sure that the options are
always processed. This fixes quiet not working.
This also fixes the libstub code ignoring nokaslr and efi=nochunk.
Reported-by: Peter Robinson <pbrobinson@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
EFI GOP uses 64-bit frame buffer address in some BIOS.
Add 64bit address support in efi earlyprintk.
Signed-off-by: Aaron Ma <aaron.ma@canonical.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
efi_switch_mm() is a wrapper around switch_mm() which saves current's
->active_mm, sets the requests mm as ->active_mm and invokes
switch_mm().
I don't think that task_lock() is required during that procedure. It
protects ->mm which isn't changed here.
It needs to be mentioned that during the whole procedure (switch to
EFI's mm and back) the preemption needs to be disabled. A context switch
at this point would reset the cr3 value based on current->mm. Also, this
function may not be invoked at the same time on a different CPU because
it would overwrite the efi_scratch.prev_mm information.
Remove task_lock() and also update the comment to reflect it.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Memory accesses performed by UEFI runtime services should be limited to:
- reading/executing from EFI_RUNTIME_SERVICES_CODE memory regions
- reading/writing from/to EFI_RUNTIME_SERVICES_DATA memory regions
- reading/writing by-ref arguments
- reading/writing from/to the stack.
Accesses outside these regions may cause the kernel to hang because the
memory region requested by the firmware isn't mapped in efi_pgd, which
causes a page fault in ring 0 and the kernel fails to handle it, leading
to die(). To save kernel from hanging, add an EFI specific page fault
handler which recovers from such faults by
1. If the efi runtime service is efi_reset_system(), reboot the machine
through BIOS.
2. If the efi runtime service is _not_ efi_reset_system(), then freeze
efi_rts_wq and schedule a new process.
The EFI page fault handler offers us two advantages:
1. Avoid potential hangs caused by buggy firmware.
2. Shout loud that the firmware is buggy and hence is not a kernel bug.
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Suggested-by: Matt Fleming <matt@codeblueprint.co.uk>
Based-on-code-from: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
[ardb: clarify commit log]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To reuse the static functions and the struct declarations, move them to
corresponding header files and export the needed functions.
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Turn the macro into an inline, move it to blk.h and simplify the
arch hooks a bit.
Also rename the function to biovec_phys_mergeable as there is no need
to shout.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
KVM has an old optimization whereby accesses to the kernel GS base MSR
are trapped when the guest is in 32-bit and not when it is in 64-bit mode.
The idea is that swapgs is not available in 32-bit mode, thus the
guest has no reason to access the MSR unless in 64-bit mode and
32-bit applications need not pay the price of switching the kernel GS
base between the host and the guest values.
However, this optimization adds complexity to the code for little
benefit (these days most guests are going to be 64-bit anyway) and in fact
broke after commit 678e315e78 ("KVM: vmx: add dedicated utility to
access guest's kernel_gs_base", 2018-08-06); the guest kernel GS base
can be corrupted across SMIs and UEFI Secure Boot is therefore broken
(a secure boot Linux guest, for example, fails to reach the login prompt
about half the time). This patch just removes the optimization; the
kernel GS base MSR is now never trapped by KVM, similarly to the FS and
GS base MSRs.
Fixes: 678e315e78
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
..so that they match their asm counterpart.
Add the missing ANNOTATE_NOSPEC_ALTERNATIVE in CALL_NOSPEC, while at it.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wang YanQing <udknight@gmail.com>
Cc: dhaval.giani@oracle.com
Cc: srinivas.eeda@oracle.com
Link: http://lkml.kernel.org/r/c3975665-173e-4d70-8dee-06c926ac26ee@default
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW6dGhgAKCRCAXGG7T9hj
vs1UAPwPSDmelfUus+P5ndRQZdK/iQWuRgQUe9Gd3RUVTfcQ7AEAljcN4/dSj7SB
hOgRlCp5WB1s5/vFF7z4jc2wtqvOPAk=
=8P9c
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Juergen writes:
"xen:
Two small fixes for xen drivers."
* tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: issue warning message when out of grant maptrack entries
xen/x86/vpmu: Zero struct pt_regs before calling into sample handling code
Thomas writes:
"A set of fixes for x86:
- Resolve the kvmclock regression on AMD systems with memory
encryption enabled. The rework of the kvmclock memory allocation
during early boot results in encrypted storage, which is not
shareable with the hypervisor. Create a new section for this data
which is mapped unencrypted and take care that the later
allocations for shared kvmclock memory is unencrypted as well.
- Fix the build regression in the paravirt code introduced by the
recent spectre v2 updates.
- Ensure that the initial static page tables cover the fixmap space
correctly so early console always works. This worked so far by
chance, but recent modifications to the fixmap layout can -
depending on kernel configuration - move the relevant entries to a
different place which is not covered by the initial static page
tables.
- Address the regressions and issues which got introduced with the
recent extensions to the Intel Recource Director Technology code.
- Update maintainer entries to document reality"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Expand static page table for fixmap space
MAINTAINERS: Add X86 MM entry
x86/intel_rdt: Add Reinette as co-maintainer for RDT
MAINTAINERS: Add Borislav to the x86 maintainers
x86/paravirt: Fix some warning messages
x86/intel_rdt: Fix incorrect loop end condition
x86/intel_rdt: Fix exclusive mode handling of MBA resource
x86/intel_rdt: Fix incorrect loop end condition
x86/intel_rdt: Do not allow pseudo-locking of MBA resource
x86/intel_rdt: Fix unchecked MSR access
x86/intel_rdt: Fix invalid mode warning when multiple resources are managed
x86/intel_rdt: Global closid helper to support future fixes
x86/intel_rdt: Fix size reporting of MBA resource
x86/intel_rdt: Fix data type in parsing callbacks
x86/kvm: Use __bss_decrypted attribute in shared variables
x86/mm: Add .bss..decrypted section to hold shared variables
Presently we check first if CPUID is enabled. If it is not already
enabled, then we next call identify_cpu_without_cpuid() and clear
X86_FEATURE_CPUID.
Unfortunately, identify_cpu_without_cpuid() is the function where CPUID
becomes _enabled_ on Cyrix 6x86/6x86L CPUs.
Reverse the calling sequence so that CPUID is first enabled, and then
check a second time to see if the feature has now been activated.
[ bp: Massage commit message and remove trailing whitespace. ]
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180921212041.13096-3-tedheadster@gmail.com
There are comments in processor-cyrix.h advising you to _not_ make calls
using the deprecated macros in this style:
setCx86_old(CX86_CCR4, getCx86_old(CX86_CCR4) | 0x80);
This is because it expands the macro into a non-functioning calling
sequence. The calling order must be:
outb(CX86_CCR2, 0x22);
inb(0x23);
From the comments:
* When using the old macros a line like
* setCx86(CX86_CCR2, getCx86(CX86_CCR2) | 0x88);
* gets expanded to:
* do {
* outb((CX86_CCR2), 0x22);
* outb((({
* outb((CX86_CCR2), 0x22);
* inb(0x23);
* }) | 0x88), 0x23);
* } while (0);
The new macros fix this problem, so use them instead.
Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jia Zhang <qianyue.zj@alibaba-inc.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180921212041.13096-2-tedheadster@gmail.com
I swear I would have sent it the same to Linus! The main cause for
this is that I was on vacation until two weeks ago and it took a while
to sort all the pending patches between 4.19 and 4.20, test them and
so on.
It's mostly small bugfixes and cleanups, mostly around x86 nested
virtualization. One important change, not related to nested
virtualization, is that the ability for the guest kernel to trap CPUID
instructions (in Linux that's the ARCH_SET_CPUID arch_prctl) is now
masked by default. This is because the feature is detected through an
MSR; a very bad idea that Intel seems to like more and more. Some
applications choke if the other fields of that MSR are not initialized
as on real hardware, hence we have to disable the whole MSR by default,
as was the case before Linux 4.12.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbpPo1AAoJEL/70l94x66DdxgH/is0qe6ZBtzb6Qc0W+8mHHD7
nxIkWAs2V5NsouJ750YwRQ+0Ym407+wlNt30acdBUEoXhrnA5/TvyGq999XvCL96
upWEIxpIgbvTMX/e2nLhe4wQdhsboUK4r0/B9IFgVFYrdCt5uRXjB2G4ewxcqxL/
GxxqrAKhaRsbQG9Xv0Fw5Vohh/Ls6fQDJcyuY1EBnbMpVenq2QDLI6cOAPXncyFb
uLN6ov4GNCWIPckwxejri5XhZesUOsafrmn48sApShh4T6TrisrdtSYdzl+DGza+
j5vhIEwdFO5kulZ3viuhqKJOnS2+F6wvfZ75IKT0tEKeU2bi+ifGDyGRefSF6Q0=
=YXLw
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Paolo writes:
"It's mostly small bugfixes and cleanups, mostly around x86 nested
virtualization. One important change, not related to nested
virtualization, is that the ability for the guest kernel to trap
CPUID instructions (in Linux that's the ARCH_SET_CPUID arch_prctl) is
now masked by default. This is because the feature is detected
through an MSR; a very bad idea that Intel seems to like more and
more. Some applications choke if the other fields of that MSR are
not initialized as on real hardware, hence we have to disable the
whole MSR by default, as was the case before Linux 4.12."
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (23 commits)
KVM: nVMX: Fix bad cleanup on error of get/set nested state IOCTLs
kvm: selftests: Add platform_info_test
KVM: x86: Control guest reads of MSR_PLATFORM_INFO
KVM: x86: Turbo bits in MSR_PLATFORM_INFO
nVMX x86: Check VPID value on vmentry of L2 guests
nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2
KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv
KVM: VMX: check nested state and CR4.VMXE against SMM
kvm: x86: make kvm_{load|put}_guest_fpu() static
x86/hyper-v: rename ipi_arg_{ex,non_ex} structures
KVM: VMX: use preemption timer to force immediate VMExit
KVM: VMX: modify preemption timer bit only when arming timer
KVM: VMX: immediately mark preemption timer expired only for zero value
KVM: SVM: Switch to bitmap_zalloc()
KVM/MMU: Fix comment in walk_shadow_page_lockless_end()
kvm: selftests: use -pthread instead of -lpthread
KVM: x86: don't reset root in kvm_mmu_setup()
kvm: mmu: Don't read PDPTEs when paging is not enabled
x86/kvm/lapic: always disable MMIO interface in x2APIC mode
KVM: s390: Make huge pages unavailable in ucontrol VMs
...
Now that si_code == SEGV_PKUERR is the flag indicating that a pkey
is present there is no longer a need to pass a pointer to a local
pkey value, instead pkey can be passed more efficiently by value.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that the pkey handling has been removed force_sig_info_fault and
force_sig_fault perform identical work. Just the type of the address
paramter is different. So replace calls to force_sig_info_fault with
calls to force_sig_fault, and remove force_sig_info_fault.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There is only one code path that can generate a pkuerr signal. That
code path calls __bad_area_nosemaphore and can be dectected by testing
if si_code == SEGV_PKUERR. It can be seen from inspection that all of
the other tests in fill_sig_info_pkey are unnecessary.
Therefore call force_sig_pkuerr directly from __bad_area_semaphore and
remove fill_sig_info_pkey.
At the same time move the comment above force_sig_info_pkey into
bad_area_access_error, so that the documentation about pkey generation
races is not lost.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There is only one caller of __bad_area that passes in PKUERR and thus
will generate a siginfo with si_pkey set. Therefore simplify the
logic and hoist reading of vma_pkey up into that caller, and just
pass *pkey into __bad_area.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are no more users of the computed pkey value in __do_page_fault
so stop computing the value.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
After the previous cleanups to do_sigbus and and bad_area_nosemaphore
mm_fault_error no now longer uses it's pkey parameter. Therefore
remove the unused parameter.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function do_sigbus never sets si_code to PKUERR so it can never
return a pkey to userspace. Therefore remove the unusable pkey
parameter from do_sigbus.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function bad_area_nosemaphore always sets si_code to SEGV_MAPERR
and as such can never return a pkey parameter. Therefore remove the
unusable pkey parameter from bad_area_nosemaphore.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update the DO_ERROR macro to take si_code and si_addr values for a siginfo,
removing the need for the fill_trap_info function.
Update do_trap to also take the sicode and si_addr values for a sigininfo
and modify the code to call force_sig when a sicode is not passed in
and to call force_sig_fault when all of the information is present.
Making this a more obvious, simpler and less error prone construction.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function "force_sig(sig, tsk)" is equivalent to "
force_sig_info(sig, SEND_SIG_PRIV, tsk)". Using the siginfo variants can
be error prone so use the simpler old fashioned force_sig variant,
and with luck the force_sig_info variant can go away.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Instead of generating the siginfo in x86 specific code use the new
helper function force_sig_bnderr to separate the concerns of
collecting the information and generating a proper siginfo.
Making the code easier to understand and maintain.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function do_trap_no_signal embodies almost all of the work of the
function do_trap. The exceptions are setting of thread.error_code and
thread.trap_nr in the case when the signal will be sent, and reporting
which signal will be sent with show_signal.
Filling in struct siginfo and then calling do_trap is problematic as
filling in struct siginfo is an fiddly process that can through
inattention has resulted in fields not initialized and the wrong
fields being filled in.
To avoid this error prone situation I am replacing force_sig_info with
a set of functions that take as arguments the information needed to
send a specific kind of signal.
The function do_trap is called in the context of several different
kinds of signals today. Having a solid do_trap_no_signal that
can be reused allows call sites that send different kinds of
signals to reuse all of the code in do_trap_no_signal.
Modify do_trap_no_signal to have a single exit there signals
where be sent (aka returning -1) to allow more of the signal
sending path to be moved to from do_trap to do_trap_no_signal.
Move setting thread.trap_nr and thread.error_code into do_trap_no_signal
so the code does not need to be duplicated.
Make the type of the string that is passed into do_trap_no_signal to
const. The only user of that str is die and it already takes a const
string, so this just makes it explicit that the string won't change.
All of this prepares the way for using do_trap_no_signal outside
of do_trap.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Clear the MCE struct which is used for collecting the injection details
after injection.
Also, populate it with more details from the machine.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20180905081954.10391-1-bp@alien8.de
We met a kernel panic when enabling earlycon, which is due to the fixmap
address of earlycon is not statically setup.
Currently the static fixmap setup in head_64.S only covers 2M virtual
address space, while it actually could be in 4M space with different
kernel configurations, e.g. when VSYSCALL emulation is disabled.
So increase the static space to 4M for now by defining FIXMAP_PMD_NUM to 2,
and add a build time check to ensure that the fixmap is covered by the
initial static page tables.
Fixes: 1ad83c858c ("x86_64,vsyscall: Make vsyscall emulation configurable")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com> (Xen parts)
Cc: H Peter Anvin <hpa@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180920025828.23699-1-feng.tang@intel.com
The handlers of IOCTLs in kvm_arch_vcpu_ioctl() are expected to set
their return value in "r" local var and break out of switch block
when they encounter some error.
This is because vcpu_load() is called before the switch block which
have a proper cleanup of vcpu_put() afterwards.
However, KVM_{GET,SET}_NESTED_STATE IOCTLs handlers just return
immediately on error without performing above mentioned cleanup.
Thus, change these handlers to behave as expected.
Fixes: 8fcc4b5923 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Patrick Colp <patrick.colp@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All the cache maintainance is already stubbed out when not enabled,
but merging the two allows us to nicely handle the case where
cache maintainance is required for some devices, but not others.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
to reads of MSR_PLATFORM_INFO.
Disabling access to reads of this MSR gives userspace the control to "expose"
this platform-dependent information to guests in a clear way. As it exists
today, guests that read this MSR would get unpopulated information if userspace
hadn't already set it (and prior to this patch series, only the CPUID faulting
information could have been populated). This existing interface could be
confusing if guests don't handle the potential for incorrect/incomplete
information gracefully (e.g. zero reported for base frequency).
Signed-off-by: Drew Schmitt <dasch@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only
the CPUID faulting bit was settable. But now any bit in
MSR_PLATFORM_INFO would be settable. This can be used, for example, to
convey frequency information about the platform on which the guest is
running.
Signed-off-by: Drew Schmitt <dasch@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
following check needs to be enforced on vmentry of L2 guests:
If the 'enable VPID' VM-execution control is 1, the value of the
of the VPID VM-execution control field must not be 0000H.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to section "Checks on VMX Controls" in Intel SDM vol 3C,
the following check needs to be enforced on vmentry of L2 guests:
- Bits 5:0 of the posted-interrupt descriptor address are all 0.
- The posted-interrupt descriptor address does not set any bits
beyond the processor's physical-address width.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
it is possible for a vCPU to be blocked while it is in guest-mode.
According to Intel SDM 26.6.5 Interrupt-Window Exiting and
Virtual-Interrupt Delivery: "These events wake the logical processor
if it just entered the HLT state because of a VM entry".
Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
should be waken from the HLT state and injected with the interrupt.
In addition, if while the vCPU is blocked (while it is in guest-mode),
it receives a nested posted-interrupt, then the vCPU should also be
waken and injected with the posted interrupt.
To handle these cases, this patch enhances kvm_vcpu_has_events() to also
check if there is a pending interrupt in L2 virtual APICv provided by
L1. That is, it evaluates if there is a pending virtual interrupt for L2
by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
Evaluation of Pending Interrupts.
Note that this also handles the case of nested posted-interrupt by the
fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
kvm_vcpu_running() -> vmx_check_nested_events() ->
vmx_complete_nested_posted_interrupt().
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX cannot be enabled under SMM, check it when CR4 is set and when nested
virtualization state is restored.
This should fix some WARNs reported by syzkaller, mostly around
alloc_shadow_vmcs.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The functions
kvm_load_guest_fpu()
kvm_put_guest_fpu()
are only used locally, make them static. This requires also that both
functions are moved because they are used before their implementation.
Those functions were exported (via EXPORT_SYMBOL) before commit
e5bb40251a ("KVM: Drop kvm_{load,put}_guest_fpu() exports").
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These structures are going to be used from KVM code so let's make
their names reflect their Hyper-V origin.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A VMX preemption timer value of '0' is guaranteed to cause a VMExit
prior to the CPU executing any instructions in the guest. Use the
preemption timer (if it's supported) to trigger immediate VMExit
in place of the current method of sending a self-IPI. This ensures
that pending VMExit injection to L1 occurs prior to executing any
instructions in the guest (regardless of nesting level).
When deferring VMExit injection, KVM generates an immediate VMExit
from the (possibly nested) guest by sending itself an IPI. Because
hardware interrupts are blocked prior to VMEnter and are unblocked
(in hardware) after VMEnter, this results in taking a VMExit(INTR)
before any guest instruction is executed. But, as this approach
relies on the IPI being received before VMEnter executes, it only
works as intended when KVM is running as L0. Because there are no
architectural guarantees regarding when IPIs are delivered, when
running nested the INTR may "arrive" long after L2 is running e.g.
L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
For the most part, this unintended delay is not an issue since the
events being injected to L1 also do not have architectural guarantees
regarding their timing. The notable exception is the VMX preemption
timer[1], which is architecturally guaranteed to cause a VMExit prior
to executing any instructions in the guest if the timer value is '0'
at VMEnter. Specifically, the delay in injecting the VMExit causes
the preemption timer KVM unit test to fail when run in a nested guest.
Note: this approach is viable even on CPUs with a broken preemption
timer, as broken in this context only means the timer counts at the
wrong rate. There are no known errata affecting timer value of '0'.
[1] I/O SMIs also have guarantees on when they arrive, but I have
no idea if/how those are emulated in KVM.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Use a hook for SVM instead of leaving the default in x86.c - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Provide a singular location where the VMX preemption timer bit is
set/cleared so that future usages of the preemption timer can ensure
the VMCS bit is up-to-date without having to modify unrelated code
paths. For example, the preemption timer can be used to force an
immediate VMExit. Cache the status of the timer to avoid redundant
VMREAD and VMWRITE, e.g. if the timer stays armed across multiple
VMEnters/VMExits.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A VMX preemption timer value of '0' at the time of VMEnter is
architecturally guaranteed to cause a VMExit prior to the CPU
executing any instructions in the guest. This architectural
definition is in place to ensure that a previously expired timer
is correctly recognized by the CPU as it is possible for the timer
to reach zero and not trigger a VMexit due to a higher priority
VMExit being signalled instead, e.g. a pending #DB that morphs into
a VMExit.
Whether by design or coincidence, commit f4124500c2 ("KVM: nVMX:
Fully emulate preemption timer") special cased timer values of '0'
and '1' to ensure prompt delivery of the VMExit. Unlike '0', a
timer value of '1' has no has no architectural guarantees regarding
when it is delivered.
Modify the timer emulation to trigger immediate VMExit if and only
if the timer value is '0', and document precisely why '0' is special.
Do this even if calibration of the virtual TSC failed, i.e. VMExit
will occur immediately regardless of the frequency of the timer.
Making only '0' a special case gives KVM leeway to be more aggressive
in ensuring the VMExit is injected prior to executing instructions in
the nested guest, and also eliminates any ambiguity as to why '1' is
a special case, e.g. why wasn't the threshold for a "short timeout"
set to 10, 100, 1000, etc...
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
This patch is to fix the commit.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Here is the code path which shows kvm_mmu_setup() is invoked after
kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path,
this means the root_hpa and prev_roots are guaranteed to be invalid. And
it is not necessary to reset it again.
kvm_vm_ioctl_create_vcpu()
kvm_arch_vcpu_create()
vmx_create_vcpu()
kvm_vcpu_init()
kvm_arch_vcpu_init()
kvm_mmu_create()
kvm_arch_vcpu_setup()
kvm_mmu_setup()
kvm_init_mmu()
This patch set reset_roots to false in kmv_mmu_setup().
Fixes: 50c28f21d0
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
CR4.PAE = 1.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When VMX is used with flexpriority disabled (because of no support or
if disabled with module parameter) MMIO interface to lAPIC is still
available in x2APIC mode while it shouldn't be (kvm-unit-tests):
PASS: apic_disable: Local apic enabled in x2APIC mode
PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
FAIL: apic_disable: *0xfee00030: 50014
The issue appears because we basically do nothing while switching to
x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
only check if lAPIC is disabled before proceeding to actual write.
When APIC access is virtualized we correctly manipulate with VMX controls
in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
in x2APIC mode so there's no issue.
Disabling MMIO interface seems to be easy. The question is: what do we
do with these reads and writes? If we add apic_x2apic_mode() check to
apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
go to userspace. When lAPIC is in kernel, Qemu uses this interface to
inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
somehow works with disabled lAPIC but when we're in xAPIC mode we will
get a real injected MSI from every write to lAPIC. Not good.
The simplest solution seems to be to just ignore writes to the region
and return ~0 for all reads when we're in x2APIC mode. This is what this
patch does. However, this approach is inconsistent with what currently
happens when flexpriority is enabled: we allocate APIC access page and
create KVM memory region so in x2APIC modes all reads and writes go to
this pre-allocated page which is, btw, the same for all vCPUs.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Otherwise we may leak kernel stack for events that sample user
registers.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
The code for conditionally printing unhanded signals is duplicated twice
in arch/x86/kernel/traps.c. Factor it out into it's own subroutine
called show_signal to make the code clearer and easier to maintain.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This separates the logic of generating the signal from the logic of
gathering the information about the bounds violation.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The value passed in to addr_referenced is of type void __user *, so update
the addr_referenced parameter in trace_mpx_bounds_register_exception to match.
Also update the addr_referenced paramater in TP_STRUCT__entry as it again
holdes the same value.
I don't know why this was missed earlier but sparse was complaining when
testing test branch so fix this now.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This simplifies the code making it clearer what is going on.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Only the call from do_sigbus will send SIGBUS due to a memory machine
check error. Consolidate all of the machine check signal generation
code in do_sigbus and remove the now unnecessary fault parameter from
force_sig_info_fault.
Explicitly use the now constant si_code BUS_ADRERR in the call
to force_sig_info_fault from do_sigbus.
This makes the code in arch/x86/mm/fault.c easier to follower and
simpler to maintain.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function fill_sigtrap_info now only has one caller so remove
it and put it's contents in it's caller.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace user_single_step_siginfo with user_single_step_report
that allocates siginfo structure on the stack and sends it.
This allows tracehook_report_syscall_exit to become a simple
if statement that calls user_single_step_report or ptrace_report_syscall
depending on the value of step.
Update the default helper function now called user_single_step_report
to explicitly set si_code to SI_USER and to set si_uid and si_pid to 0.
The default helper has always been doing this (using memset) but it
was far from obvious.
The powerpc helper can now just call force_sig_fault.
The x86 helper can now just call send_sigtrap.
Unfortunately the default implementation of user_single_step_report
can not use force_sig_fault as it does not use a SIGTRAP si_code.
So it has to carefully setup the siginfo and use use force_sig_info.
The net result is code that is easier to understand and simpler
to maintain.
Ref: 85ec7fd9f8 ("ptrace: introduce user_single_step_siginfo() helper")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The first argument to WARN_ONCE() is a condition.
Fixes: 5800dc5c19 ("x86/paravirt: Fix spectre-v2 mitigations for paravirt guests")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: virtualization@lists.linux-foundation.org
Cc: kernel-janitors@vger.kernel.org
Link: https://lkml.kernel.org/r/20180919103553.GD9238@mwanda
Crypto stuff from Herbert:
"This push fixes a potential boot hang in ccp and an incorrect
CPU capability check in aegis/morus on x86."
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: x86/aegis,morus - Do not require OSXSAVE for SSE2
crypto: ccp - add timeout support in the SEV command
In order to determine a sane default cache allocation for a new CAT/CDP
resource group, all resource groups are checked to determine which cache
portions are available to share. At this time all possible CLOSIDs
that can be supported by the resource is checked. This is problematic
if the resource supports more CLOSIDs than another CAT/CDP resource. In
this case, the number of CLOSIDs that could be allocated are fewer than
the number of CLOSIDs that can be supported by the resource.
Limit the check of closids to that what is supported by the system based
on the minimum across all resources.
Fixes: 95f0b77ef ("x86/intel_rdt: Initialize new resource group with sane defaults")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Cc: "Chen Yu" <yu.c.chen@intel.com>
Link: https://lkml.kernel.org/r/1537048707-76280-10-git-send-email-fenghua.yu@intel.com
It is possible for a resource group to consist out of MBA as well as
CAT/CDP resources. The "exclusive" resource mode only applies to the
CAT/CDP resources since MBA allocations cannot be specified to overlap
or not. When a user requests a resource group to become "exclusive" then it
can only be successful if there are CAT/CDP resources in the group
and none of their CBMs associated with the group's CLOSID overlaps with
any other resource group.
Fix the "exclusive" mode setting by failing if there isn't any CAT/CDP
resource in the group and ensuring that the CBM checking is only done on
CAT/CDP resources.
Fixes: 49f7b4efa ("x86/intel_rdt: Enable setting of exclusive mode")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Cc: "Chen Yu" <yu.c.chen@intel.com>
Link: https://lkml.kernel.org/r/1537048707-76280-9-git-send-email-fenghua.yu@intel.com
A loop is used to check if a CAT resource's CBM of one CLOSID
overlaps with the CBM of another CLOSID of the same resource. The loop
is run over all CLOSIDs supported by the resource.
The problem with running the loop over all CLOSIDs supported by the
resource is that its number of supported CLOSIDs may be more than the
number of supported CLOSIDs on the system, which is the minimum number of
CLOSIDs supported across all resources.
Fix the loop to only consider the number of system supported CLOSIDs,
not all that are supported by the resource.
Fixes: 49f7b4efa ("x86/intel_rdt: Enable setting of exclusive mode")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Cc: "Chen Yu" <yu.c.chen@intel.com>
Link: https://lkml.kernel.org/r/1537048707-76280-8-git-send-email-fenghua.yu@intel.com
A system supporting pseudo-locking may have MBA as well as CAT
resources of which only the CAT resources could support cache
pseudo-locking. When the schemata to be pseudo-locked is provided it
should be checked that that schemata does not attempt to pseudo-lock a
MBA resource.
Fixes: e0bdfe8e3 ("x86/intel_rdt: Support creation/removal of pseudo-locked region")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Cc: "Chen Yu" <yu.c.chen@intel.com>
Link: https://lkml.kernel.org/r/1537048707-76280-7-git-send-email-fenghua.yu@intel.com
When a new resource group is created, it is initialized with sane
defaults that currently assume the resource being initialized is a CAT
resource. This code path is also followed by a MBA resource that is not
allocated the same as a CAT resource and as a result we encounter the
following unchecked MSR access error:
unchecked MSR access error: WRMSR to 0xd51 (tried to write 0x0000
000000000064) at rIP: 0xffffffffae059994 (native_write_msr+0x4/0x20)
Call Trace:
mba_wrmsr+0x41/0x80
update_domains+0x125/0x130
rdtgroup_mkdir+0x270/0x500
Fix the above by ensuring the initial allocation is only attempted on a
CAT resource.
Fixes: 95f0b77ef ("x86/intel_rdt: Initialize new resource group with sane defaults")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Cc: "Chen Yu" <yu.c.chen@intel.com>
Link: https://lkml.kernel.org/r/1537048707-76280-6-git-send-email-fenghua.yu@intel.com
When multiple resources are managed by RDT, the number of CLOSIDs used
is the minimum of the CLOSIDs supported by each resource. In the function
rdt_bit_usage_show(), the annotated bitmask is created to depict how the
CAT supporting caches are being used. During this annotated bitmask
creation, each resource group is queried for its mode that is used as a
label in the annotated bitmask.
The maximum number of resource groups is currently assumed to be the
number of CLOSIDs supported by the resource for which the information is
being displayed. This is incorrect since the number of active CLOSIDs is
the minimum across all resources.
If information for a cache instance with more CLOSIDs than another is
being generated we thus encounter a warning like:
invalid mode for closid 8
WARNING: CPU: 88 PID: 1791 at [SNIP]/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
:827 rdt_bit_usage_show+0x221/0x2b0
Fix this by ensuring that only the number of supported CLOSIDs are
considered.
Fixes: e651901187 ("x86/intel_rdt: Introduce "bit_usage" to display cache allocations details")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Cc: "Chen Yu" <yu.c.chen@intel.com>
Link: https://lkml.kernel.org/r/1537048707-76280-5-git-send-email-fenghua.yu@intel.com
The number of CLOSIDs supported by a system is the minimum number of
CLOSIDs supported by any of its resources. Care should be taken when
iterating over the CLOSIDs of a resource since it may be that the number
of CLOSIDs supported on the system is less than the number of CLOSIDs
supported by the resource.
Introduce a helper function that can be used to query the number of
CLOSIDs that is supported by all resources, irrespective of how many
CLOSIDs are supported by a particular resource.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Cc: "Chen Yu" <yu.c.chen@intel.com>
Link: https://lkml.kernel.org/r/1537048707-76280-4-git-send-email-fenghua.yu@intel.com
Chen Yu reported a divide-by-zero error when accessing the 'size'
resctrl file when a MBA resource is enabled.
divide error: 0000 [#1] SMP PTI
CPU: 93 PID: 1929 Comm: cat Not tainted 4.19.0-rc2-debug-rdt+ #25
RIP: 0010:rdtgroup_cbm_to_size+0x7e/0xa0
Call Trace:
rdtgroup_size_show+0x11a/0x1d0
seq_read+0xd8/0x3b0
Quoting Chen Yu's report: This is because for MB resource, the
r->cache.cbm_len is zero, thus calculating size in rdtgroup_cbm_to_size()
will trigger the exception.
Fix this issue in the 'size' file by getting correct memory bandwidth value
which is in MBps when MBA software controller is enabled or in percentage
when MBA software controller is disabled.
Fixes: d9b48c86eb ("x86/intel_rdt: Display resource groups' allocations in bytes")
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Link: https://lkml.kernel.org/r/20180904174614.26682-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/1537048707-76280-3-git-send-email-fenghua.yu@intel.com
Each resource is associated with a parsing callback to parse the data
provided from user space when writing schemata file.
The 'data' parameter in the callbacks is defined as a void pointer which
is error prone due to lack of type check.
parse_bw() processes the 'data' parameter as a string while its caller
actually passes the parameter as a pointer to struct rdt_cbm_parse_data.
Thus, parse_bw() takes wrong data and causes failure of parsing MBA
throttle value.
To fix the issue, the 'data' parameter in all parsing callbacks is defined
and handled as a pointer to struct rdt_parse_data (renamed from struct
rdt_cbm_parse_data).
Fixes: 7604df6e16 ("x86/intel_rdt: Support flexible data to parsing callbacks")
Fixes: 9ab9aa15c3 ("x86/intel_rdt: Ensure requested schemata respects mode")
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Chen Yu" <yu.c.chen@intel.com>
Link: https://lkml.kernel.org/r/1537048707-76280-2-git-send-email-fenghua.yu@intel.com
Linux spreads out the non managed interrupt across the possible target CPUs
to avoid vector space exhaustion.
Managed interrupts are treated differently, as for them the vectors are
reserved (with guarantee) when the interrupt descriptors are initialized.
When the interrupt is requested a real vector is assigned. The assignment
logic uses the first CPU in the affinity mask for assignment. If the
interrupt has more than one CPU in the affinity mask, which happens when a
multi queue device has less queues than CPUs, then doing the same search as
for non managed interrupts makes sense as it puts the interrupt on the
least interrupt plagued CPU. For single CPU affine vectors that's obviously
a NOOP.
Restructre the matrix allocation code so it does the 'best CPU' search, add
the sanity check for an empty affinity mask and adapt the call site in the
x86 vector management code.
[ tglx: Added the empty mask check to the core and improved change log ]
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180908175838.14450-2-dou_liyang@163.com
Memory for host controller data structures is allocated local to the node
to which the controller is associated with. This has been the behaviour
since 965cd0e4a5 ("x86, PCI, ACPI: Use kmalloc_node() to optimize for
performance") where the node local allocation was added without additional
context.
Drop the node local allocation as there is no benefit from doing so - the
usage of these structures is independent from where the controller is
located.
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
As we augmented the regulator core to accept a GPIO descriptor instead
of a GPIO number, we can augment the fixed GPIO regulator to look up
and pass that descriptor directly from device tree or board GPIO
descriptor look up tables.
Some boards just auto-enumerate their fixed regulator platform devices
and I have assumed they get names like "fixed-regulator.0" but it's
pretty hard to guess this. I need some testing from board maintainers to
be sure. Other boards are straight forward, using just plain
"fixed-regulator" (ID -1) or "fixed-regulator.1" hammering down the
device ID.
It seems the da9055 and da9211 has never got around to actually passing
any enable gpio into its platform data (not the in-tree code anyway) so we
can just decide to simply pass a descriptor instead.
The fixed GPIO-controlled regulator in mach-pxa/ezx.c was confusingly named
"*_dummy_supply_device" while it is a very real device backed by a GPIO
line. There is nothing dummy about it at all, so I renamed it with the
infix *_regulator_* as part of this patch set.
Intel MID portions tested by Andy.
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> # Check the x86 BCM stuff
Acked-by: Tony Lindgren <tony@atomide.com> # OMAP1,2,3 maintainer
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Janusz Krzysztofik <jmkrzyszt@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
The recent removal of the memblock dependency from kvmclock caused a SEV
guest regression because the wall_clock and hv_clock_boot variables are
no longer mapped decrypted when SEV is active.
Use the __bss_decrypted attribute to put the static wall_clock and
hv_clock_boot in the .bss..decrypted section so that they are mapped
decrypted during boot.
In the preparatory stage of CPU hotplug, the per-cpu pvclock data pointer
assigns either an element of the static array or dynamically allocated
memory for the pvclock data pointer. The static array are now mapped
decrypted but the dynamically allocated memory is not mapped decrypted.
However, when SEV is active this memory range must be mapped decrypted.
Add a function which is called after the page allocator is up, and
allocate memory for the pvclock data pointers for the all possible cpus.
Map this memory range as decrypted when SEV is active.
Fixes: 368a540e02 ("x86/kvmclock: Remove memblock dependency")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/1536932759-12905-3-git-send-email-brijesh.singh@amd.com
kvmclock defines few static variables which are shared with the
hypervisor during the kvmclock initialization.
When SEV is active, memory is encrypted with a guest-specific key, and
if the guest OS wants to share the memory region with the hypervisor
then it must clear the C-bit before sharing it.
Currently, we use kernel_physical_mapping_init() to split large pages
before clearing the C-bit on shared pages. But it fails when called from
the kvmclock initialization (mainly because the memblock allocator is
not ready that early during boot).
Add a __bss_decrypted section attribute which can be used when defining
such shared variable. The so-defined variables will be placed in the
.bss..decrypted section. This section will be mapped with C=0 early
during boot.
The .bss..decrypted section has a big chunk of memory that may be unused
when memory encryption is not active, free it when memory encryption is
not active.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Radim Krčmář<rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com
Pull x86 fixes from Ingol Molnar:
"Misc fixes:
- EFI crash fix
- Xen PV fixes
- do not allow PTI on 2-level 32-bit kernels for now
- documentation fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/APM: Fix build warning when PROC_FS is not enabled
Revert "x86/mm/legacy: Populate the user page-table with user pgd's"
x86/efi: Load fixmap GDT in efi_call_phys_epilog() before setting %cr3
x86/xen: Disable CPU0 hotplug for Xen PV
x86/EISA: Don't probe EISA bus for Xen PV guests
x86/doc: Fix Documentation/x86/earlyprintk.txt
Get rid of local @cpu variable which is unused in the
!CONFIG_IA32_EMULATION case.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1536806985-24197-1-git-send-email-zhongjiang@huawei.com
[ Clean up commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Fix build warning in apm_32.c when CONFIG_PROC_FS is not enabled:
../arch/x86/kernel/apm_32.c:1643:12: warning: 'proc_apm_show' defined but not used [-Wunused-function]
static int proc_apm_show(struct seq_file *m, void *v)
Fixes: 3f3942aca6 ("proc: introduce proc_create_single{,_data}")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jiri Kosina <jikos@kernel.org>
Link: https://lkml.kernel.org/r/be39ac12-44c2-4715-247f-4dcc3c525b8b@infradead.org
This reverts commit 1f40a46cf4.
It turned out that this patch is not sufficient to enable PTI on 32 bit
systems with legacy 2-level page-tables. In this paging mode the huge-page
PTEs are in the top-level page-table directory, where also the mirroring to
the user-space page-table happens. So every huge PTE exits twice, in the
kernel and in the user page-table.
That means that accessed/dirty bits need to be fetched from two PTEs in
this mode to be safe, but this is not trivial to implement because it needs
changes to generic code just for the sake of enabling PTI with 32-bit
legacy paging. As all systems that need PTI should support PAE anyway,
remove support for PTI when 32-bit legacy paging is used.
Fixes: 7757d607c6 ('x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32')
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: https://lkml.kernel.org/r/1536922754-31379-1-git-send-email-joro@8bytes.org
This patch fixes gcmaes_crypt_by_sg so that it won't use memory
allocation if the data doesn't cross a page boundary.
Authenticated encryption may be used by dm-crypt. If the encryption or
decryption fails, it would result in I/O error and filesystem corruption.
The function gcmaes_crypt_by_sg is using GFP_ATOMIC allocation that can
fail anytime. This patch fixes the logic so that it won't attempt the
failing allocation if the data doesn't cross a page boundary.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
It turns out OSXSAVE needs to be checked only for AVX, not for SSE.
Without this patch the affected modules refuse to load on CPUs with SSE2
but without AVX support.
Fixes: 877ccce7cb ("crypto: x86/aegis,morus - Fix and simplify CPUID checks")
Cc: <stable@vger.kernel.org> # 4.18
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Commit eeb89e2bb1 ("x86/efi: Load fixmap GDT in efi_call_phys_epilog()")
moved loading the fixmap in efi_call_phys_epilog() after load_cr3() since
it was assumed to be more logical.
Turns out this is incorrect: In efi_call_phys_prolog(), the gdt with its
physical address is loaded first, and when the %cr3 is reloaded in _epilog
from initial_page_table to swapper_pg_dir again the gdt is no longer
mapped. This results in a triple fault if an interrupt occurs after
load_cr3() and before load_fixmap_gdt(0). Calling load_fixmap_gdt(0) first
restores the execution order prior to commit eeb89e2bb1 and fixes the
problem.
Fixes: eeb89e2bb1 ("x86/efi: Load fixmap GDT in efi_call_phys_epilog()")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: linux-efi@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/1536689892-21538-1-git-send-email-linux@roeck-us.net
The SYSCALL64 trampoline has a couple of nice properties:
- The usual sequence of SWAPGS followed by two GS-relative accesses to
set up RSP is somewhat slow because the GS-relative accesses need
to wait for SWAPGS to finish. The trampoline approach allows
RIP-relative accesses to set up RSP, which avoids the stall.
- The trampoline avoids any percpu access before CR3 is set up,
which means that no percpu memory needs to be mapped in the user
page tables. This prevents using Meltdown to read any percpu memory
outside the cpu_entry_area and prevents using timing leaks
to directly locate the percpu areas.
The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up. The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text. With retpolines enabled, the indirect jump is extremely slow.
Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI. This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless. It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless. As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.
On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.
There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU. This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:
+ It avoids a pipeline stall
- It executes from an extra page and read from another extra page during
the syscall. The latter is because it needs to use a relative
addressing mode to find sp1 -- it's the same *cacheline*, but accessed
using an alias, so it's an extra TLB entry.
- Slightly more memory. This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a more
complex implementation.
- More code complexity.
The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.
[ tglx: Added the alternative discussion to the changelog ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
'pt_cap_group' is written to in pt_pmu_hw_init() and not modified after.
This makes it a suitable candidate for annotating as __ro_after_init.
Signed-off-by: Zubin Mithra <zsm@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20180912164510.23444-1-zsm@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For userspace to tell the difference between an random signal
and an exception, the exception must include siginfo information.
Using SEND_SIG_FORCED for SIGSEGV is thus wrong, and it will result in
userspace seeing si_code == SI_USER (like a random signal) instead of
si_code == SI_KERNEL or a more specific si_code as all exceptions
deliver.
Therefore replace force_sig_info(SIGSEGV, SEND_SIG_FORCE, current)
with force_sig(SIG_SEGV, current) which gets this right and is shorter
and easier to type.
Fixes: 791eca1010 ("uretprobes/x86: Hijack return address")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When CONFIG_PARAVIRT_SPINLOCKS=n, it generates a warning:
arch/x86/kernel/paravirt_patch_64.c: In function ‘native_patch’:
arch/x86/kernel/paravirt_patch_64.c:89:1: warning: label ‘patch_site’ defined but not used [-Wunused-label]
patch_site:
... but those labels can simply be removed by directly calling the
respective functions there.
Get rid of local variables too, while at it. Also, simplify function
flow for better readability.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180911091510.GA12094@zn.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I use memcpy_flushcache() in my persistent memory driver for metadata
updates, there are many 8-byte and 16-byte updates and it turns out that
the overhead of memcpy_flushcache causes 2% performance degradation
compared to "movnti" instruction explicitly coded using inline assembler.
The tests were done on a Skylake processor with persistent memory emulated
using the "memmap" kernel parameter. dd was used to copy data to the
dm-writecache target.
This patch recognizes memcpy_flushcache calls with constant short length
and turns them into inline assembler - so that I don't have to use inline
assembler in the driver.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: device-mapper development <dm-devel@redhat.com>
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1808081720460.24747@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's not used by its sole user, so remove this unused functionality.
Also remove a stray unused variable that GCC didn't warn about for some reason.
Suggested-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kirill.shutemov@linux.intel.com
Link: http://lkml.kernel.org/r/20180807015705.21697-1-fanc.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x86_pmu_{format,events,attr,caps}_group is written to in
init_hw_perf_events and not modified after. This makes them suitable
candidates for annotating as __ro_after_init.
Signed-off-by: Zubin Mithra <zsm@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: groeck@chromium.org
Link: http://lkml.kernel.org/r/20180810154314.96710-1-zsm@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
memory_corruption_check[{_period|_size}]()'s handlers do not check input
argument before passing it to kstrtoul() or simple_strtoull(). The argument
would be a NULL pointer if each of the kernel parameters, without its
value, is set in command line and thus cause the following panic.
PANIC: early exception 0xe3 IP 10:ffffffff73587c22 error 0 cr2 0x0
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.18-rc8+ #2
[ 0.000000] RIP: 0010:kstrtoull+0x2/0x10
...
[ 0.000000] Call Trace
[ 0.000000] ? set_corruption_check+0x21/0x49
[ 0.000000] ? do_early_param+0x4d/0x82
[ 0.000000] ? parse_args+0x212/0x330
[ 0.000000] ? rdinit_setup+0x26/0x26
[ 0.000000] ? parse_early_options+0x20/0x23
[ 0.000000] ? rdinit_setup+0x26/0x26
[ 0.000000] ? parse_early_param+0x2d/0x39
[ 0.000000] ? setup_arch+0x2f7/0xbf4
[ 0.000000] ? start_kernel+0x5e/0x4c2
[ 0.000000] ? load_ucode_bsp+0x113/0x12f
[ 0.000000] ? secondary_startup_64+0xa5/0xb0
This patch adds checks to prevent the panic.
Signed-off-by: He Zhe <zhe.he@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gregkh@linuxfoundation.org
Cc: kstewart@linuxfoundation.org
Cc: pombredanne@nexb.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1534260823-87917-1-git-send-email-zhe.he@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Problem: perf did not show branch predicted/mispredicted bit in brstack.
Output of perf -F brstack for profile collected
Before:
0x4fdbcd/0x4fdc03/-/-/-/0
0x45f4c1/0x4fdba0/-/-/-/0
0x45f544/0x45f4bb/-/-/-/0
0x45f555/0x45f53c/-/-/-/0
0x7f66901cc24b/0x45f555/-/-/-/0
0x7f66901cc22e/0x7f66901cc23d/-/-/-/0
0x7f66901cc1ff/0x7f66901cc20f/-/-/-/0
0x7f66901cc1e8/0x7f66901cc1fc/-/-/-/0
After:
0x4fdbcd/0x4fdc03/P/-/-/0
0x45f4c1/0x4fdba0/P/-/-/0
0x45f544/0x45f4bb/P/-/-/0
0x45f555/0x45f53c/P/-/-/0
0x7f66901cc24b/0x45f555/P/-/-/0
0x7f66901cc22e/0x7f66901cc23d/P/-/-/0
0x7f66901cc1ff/0x7f66901cc20f/P/-/-/0
0x7f66901cc1e8/0x7f66901cc1fc/P/-/-/0
Cause:
As mentioned in Software Development Manual vol 3, 17.4.8.1,
IA32_PERF_CAPABILITIES[5:0] indicates the format of the address that is
stored in the LBR stack. Knights Landing reports 1 (LBR_FORMAT_LIP) as
its format. Despite that, registers containing FROM address of the branch,
do have MISPREDICT bit but because of the format indicated in
IA32_PERF_CAPABILITIES[5:0], LBR did not read MISPREDICT bit.
Solution:
Teach LBR about above Knights Landing quirk and make it read MISPREDICT bit.
Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180802013830.10600-1-jacekt@dugeo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for x86:
- Prevent multiplication result truncation on 32bit. Introduced with
the early timestamp reworrk.
- Ensure microcode revision storage to be consistent under all
circumstances
- Prevent write tearing of PTEs
- Prevent confusion of user and kernel reegisters when dumping fatal
signals verbosely
- Make an error return value in a failure path of the vector
allocation negative. Returning EINVAL might the caller assume
success and causes further wreckage.
- A trivial kernel doc warning fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Use WRITE_ONCE() when setting PTEs
x86/apic/vector: Make error return value negative
x86/process: Don't mix user/kernel regs in 64bit __show_regs()
x86/tsc: Prevent result truncation on 32bit
x86: Fix kernel-doc atomic.h warnings
x86/microcode: Update the new microcode revision unconditionally
x86/microcode: Make sure boot_cpu_data.microcode is up-to-date
When page-table entries are set, the compiler might optimize their
assignment by using multiple instructions to set the PTE. This might
turn into a security hazard if the user somehow manages to use the
interim PTE. L1TF does not make our lives easier, making even an interim
non-present PTE a security hazard.
Using WRITE_ONCE() to set PTEs and friends should prevent this potential
security hazard.
I skimmed the differences in the binary with and without this patch. The
differences are (obviously) greater when CONFIG_PARAVIRT=n as more
code optimizations are possible. For better and worse, the impact on the
binary with this patch is pretty small. Skimming the code did not cause
anything to jump out as a security hazard, but it seems that at least
move_soft_dirty_pte() caused set_pte_at() to use multiple writes.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com
activate_managed() returns EINVAL instead of -EINVAL in case of
error. While this is unlikely to happen, the positive return value would
cause further malfunction at the call site.
Fixes: 2db1f959d9 ("x86/vector: Handle managed interrupts proper")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
In the non-trampoline SYSCALL64 path, a percpu variable is used to
temporarily store the user RSP value.
Instead of a separate variable, use the otherwise unused sp2 slot in the
TSS. This will improve cache locality, as the sp1 slot is already used in
the same code to find the kernel stack. It will also simplify a future
change to make the non-trampoline path work in PTI mode.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/08e769a0023dbad4bac6f34f3631dbaf8ad59f4f.1536015544.git.luto@kernel.org
The idtentry macro is complicated and magical. Document what it
does to help future readers and to allow future patches to adjust
the code and docs at the same time.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/6e56c3ad94879e41afe345750bc28ccc0e820ea8.1536015544.git.luto@kernel.org
Dan Carpenter reported that the untrusted data returns from kvm_register_read()
results in the following static checker warning:
arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
error: buffer underflow 'map->phys_map' 's32min-s32max'
KVM guest can easily trigger this by executing the following assembly sequence
in Ring0:
mov $10, %rax
mov $0xFFFFFFFF, %rbx
mov $0xFFFFFFFF, %rdx
mov $0, %rsi
vmcall
As this will cause KVM to execute the following code-path:
vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
which will reach out-of-bounds access.
This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
ignoring destinations that are not present and delivering the rest. We also check
whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Consider the case L1 had a IRQ/NMI event until it executed
VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
(e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
L0 needs to evaluate if this pending event should cause an exit from
L2 to L1 or delivered directly to L2 (e.g. In case L1 don't intercept
EXTERNAL_INTERRUPT).
Usually this would be handled by L0 requesting a IRQ/NMI window
by setting VMCS accordingly. However, this setting was done on
VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
requesting a KVM_REQ_EVENT.
Note that above scenario exists when L1 KVM is about to enter L2 but
requests an "immediate-exit". As in this case, L1 will
disable-interrupts and then send a self-IPI before entering L2.
Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
- Fix a VFP corruption in 32-bit guest
- Add missing cache invalidation for CoW pages
- Two small cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJbkngmAAoJEEtpOizt6ddyeaoH/15bbGHlwWf23tGjSoDzhyD4
zAXfy+SJdm4cR8K7jEkVrNffkEMAby7Zl28hTHKB9jsY1K8DD+EuCE3Nd4kkVAsc
iHJwV4aiHil/zC5SyE0MqMzELeS8UhsxESYebG6yNF0ElQDQ0SG+QAFr47/OBN9S
u4I7x0rhyJP6Kg8z9U4KtEX0hM6C7VVunGWu44/xZSAecTaMuJnItCIM4UMdEkSs
xpAoI59lwM6BWrXLvEunekAkxEXoR7AVpQER2PDINoLK2I0i0oavhPim9Xdt2ZXs
rqQqfmwmPOVvYbexDp97JtfWo3/psGLqvgoK1tq9bzF3u6Y3ylnUK5IspyVYwuQ=
=TK8A
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-fixes-for-v4.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
Fixes for KVM/ARM for Linux v4.19 v2:
- Fix a VFP corruption in 32-bit guest
- Add missing cache invalidation for CoW pages
- Two small cleanups
kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to
deal with. Drop the now obsolete code.
Fixes: fb1522e099 ("KVM: update to new mmu_notifier semantic v2")
Cc: James Hogan <jhogan@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
The PARAVIRT_XXL changes introduced a redefinition of SAVE_FLAGS under
certain configurations. Cure it
Fixes: 6da63eb241 ("x86/paravirt: Move the pv_irq_ops under the PARAVIRT_XXL umbrella").
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180905053720.13710-1-jgross@suse.com
When the kernel.print-fatal-signals sysctl has been enabled, a simple
userspace crash will cause the kernel to write a crash dump that contains,
among other things, the kernel gsbase into dmesg.
As suggested by Andy, limit output to pt_regs, FS_BASE and KERNEL_GS_BASE
in this case.
This also moves the bitness-specific logic from show_regs() into
process_{32,64}.c.
Fixes: 45807a1df9 ("vdso: print fatal signals")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180831194151.123586-1-jannh@google.com
Loops per jiffy is calculated by multiplying tsc_khz with 1e3 and then
dividing it by HZ.
Both tsc_khz and the temporary variable holding the multiplication result
are of type unsigned long, so on 32bit the result is truncated to the lower
32bit.
Use u64 as type for the temporary variable and cast tsc_khz to it before
multiplying.
[ tglx: Massaged changelog and removed pointless braces ]
Fixes: cf7a63ef4e ("x86/tsc: Calibrate tsc only once")
Signed-off-by: Chuanhua Lei <chuanhua.lei@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yixin.zhu@linux.intel.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@microsoft.com>
Cc: Rajvi Jingar <rajvi.jingar@intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Link: https://lkml.kernel.org/r/1536228203-18701-1-git-send-email-chuanhua.lei@linux.intel.com
The STACKLEAK feature (initially developed by PaX Team) has the following
benefits:
1. Reduces the information that can be revealed through kernel stack leak
bugs. The idea of erasing the thread stack at the end of syscalls is
similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
crypto, which all comply with FDP_RIP.2 (Full Residual Information
Protection) of the Common Criteria standard.
2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
CVE-2010-2963). That kind of bugs should be killed by improving C
compilers in future, which might take a long time.
This commit introduces the code filling the used part of the kernel
stack with a poison value before returning to userspace. Full
STACKLEAK feature also contains the gcc plugin which comes in a
separate commit.
The STACKLEAK feature is ported from grsecurity/PaX. More information at:
https://grsecurity.net/https://pax.grsecurity.net/
This code is modified from Brad Spengler/PaX Team's code in the last
public patch of grsecurity/PaX based on our understanding of the code.
Changes or omissions from the original code are ours and don't reflect
the original grsecurity/PaX code.
Performance impact:
Hardware: Intel Core i7-4770, 16 GB RAM
Test #1: building the Linux kernel on a single core
0.91% slowdown
Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
4.2% slowdown
So the STACKLEAK description in Kconfig includes: "The tradeoff is the
performance impact: on a single CPU system kernel compilation sees a 1%
slowdown, other systems and workloads may vary and you are advised to
test this feature on your expected workload before deploying it".
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
The TS5500 GPIO driver apparently supports platform data
without making any use of it whatsoever. Delete this code,
last chance to speak up if you think it is needed.
Cc: kernel@savoirfairelinux.com
Cc: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Cc: Jerome Oufella <jerome.oufella@savoirfairelinux.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
As it turns out, the AVX2 multibuffer SHA routines are currently
broken [0], in a way that would have likely been noticed if this
code were in wide use. Since the code is too complicated to be
maintained by anyone except the original authors, and since the
performance benefits for real-world use cases are debatable to
begin with, it is better to drop it entirely for the moment.
[0] https://marc.info/?l=linux-crypto-vger&m=153476243825350&w=2
Suggested-by: Eric Biggers <ebiggers@google.com>
Cc: Megha Dey <megha.dey@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
All functions in arch/x86/xen/irq.c and arch/x86/xen/xen-asm*.S are
specific to PV guests. Include them in the kernel with CONFIG_XEN_PV only.
Make the PV specific code in arch/x86/entry/entry_*.S dependent on
CONFIG_XEN_PV instead of CONFIG_XEN.
The HVM specific code should depend on CONFIG_XEN_PVHVM.
While at it reformat the Makefile to make it more readable.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: xen-devel@lists.xenproject.org
Cc: virtualization@lists.linux-foundation.org
Cc: akataria@vmware.com
Cc: rusty@rustcorp.com.au
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180828074026.820-2-jgross@suse.com
There have been multiple kernel vulnerabilities that permitted userspace to
pass completely unchecked pointers through to userspace accessors:
- the waitid() bug - commit 96ca579a1e ("waitid(): Add missing
access_ok() checks")
- the sg/bsg read/write APIs
- the infiniband read/write APIs
These don't happen all that often, but when they do happen, it is hard to
test for them properly; and it is probably also hard to discover them with
fuzzing. Even when an unmapped kernel address is supplied to such buggy
code, it just returns -EFAULT instead of doing a proper BUG() or at least
WARN().
Try to make such misbehaving code a bit more visible by refusing to do a
fixup in the pagefault handler code when a userspace accessor causes a #PF
on a kernel address and the current context isn't whitelisted.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
This is preparation for looking at trap number and fault address in the
handlers for uaccess errors. No functional change.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-kernel@vger.kernel.org
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-6-jannh@google.com
Currently, most fixups for attempting to access userspace memory are
handled using _ASM_EXTABLE, which is also used for various other types of
fixups (e.g. safe MSR access, IRET failures, and a bunch of other things).
In order to make it possible to add special safety checks to uaccess fixups
(in particular, checking whether the fault address is actually in
userspace), introduce a new exception table handler ex_handler_uaccess()
and wire it up to all the user access fixups (excluding ones that
already use _ASM_EXTABLE_EX).
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-5-jannh@google.com
This removes the call into exception fixup that was added in commit
c28f896634 ("[PATCH] kprobes: fix broken fault handling for x86_64").
On X86, kprobe_fault_handler() is called from two places:
do_general_protection() (for #GP) and kprobes_fault() (for #PF). In both
paths, the fixup_exception() call in the kprobe fault handler is redundant.
In case of #GP, fixup_exception() is called immediately before
kprobe_fault_handler() is invoked, so no need to try that again. This
assumes that the kprobe's fault handler isn't going to do something crazy
like changing RIP so that it suddenly points to an instruction that does
userspace access.
For #PF on a kernel address from kernel space, after the kprobe fault
handler has run, no_context() is invoked, which calls fixup_exception().
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-kernel@vger.kernel.org
Cc: dvyukov@google.com
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-4-jannh@google.com
The opaque plumbing of #GP from do_general_protection() through
notify_die() into kprobe_exceptions_notify() makes it hard to understand
what's going on.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-3-jannh@google.com
This is an extension of commit b506a9d08b ("x86: code clarification patch
to Kprobes arch code"). As that commit explains, even though
kprobe_running() can't be called with preemption enabled, preemption does
not need to be disabled. If preemption is enabled, then this can't be
originate from a kprobe.
Also, use X86_TRAP_PF instead of 14.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-2-jannh@google.com
Fix kernel-doc warnings in arch/x86/include/asm/atomic.h that are caused by
having a #define macro between the kernel-doc notation and the function
name. Fixed by moving the #define macro to after the function
implementation.
Make the same change for atomic64_{32,64}.h for consistency even though
there were no kernel-doc warnings found in these header files, but there
would be if they were used in generation of documentation.
Fixes these kernel-doc warnings:
../arch/x86/include/asm/atomic.h:84: warning: Excess function parameter 'i' description in 'arch_atomic_sub_and_test'
../arch/x86/include/asm/atomic.h:84: warning: Excess function parameter 'v' description in 'arch_atomic_sub_and_test'
../arch/x86/include/asm/atomic.h:96: warning: Excess function parameter 'v' description in 'arch_atomic_inc'
../arch/x86/include/asm/atomic.h:109: warning: Excess function parameter 'v' description in 'arch_atomic_dec'
../arch/x86/include/asm/atomic.h:124: warning: Excess function parameter 'v' description in 'arch_atomic_dec_and_test'
../arch/x86/include/asm/atomic.h:138: warning: Excess function parameter 'v' description in 'arch_atomic_inc_and_test'
../arch/x86/include/asm/atomic.h:153: warning: Excess function parameter 'i' description in 'arch_atomic_add_negative'
../arch/x86/include/asm/atomic.h:153: warning: Excess function parameter 'v' description in 'arch_atomic_add_negative'
Fixes: 18cc1814d4 ("atomics/treewide: Make test ops optional")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/0a1e678d-c8c5-b32c-2640-ed4e94d399d2@infradead.org
Pull x86 fixes from Thomas Gleixner:
"Speculation:
- Make the microcode check more robust
- Make the L1TF memory limit depend on the internal cache physical
address space and not on the CPUID advertised physical address
space, which might be significantly smaller. This avoids disabling
L1TF on machines which utilize the full physical address space.
- Fix the GDT mapping for EFI calls on 32bit PTI
- Fix the MCE nospec implementation to prevent #GP
Fixes and robustness:
- Use the proper operand order for LSL in the VDSO
- Prevent NMI uaccess race against CR3 switching
- Add a lockdep check to verify that text_mutex is held in
text_poke() functions
- Repair the fallout of giving native_restore_fl() a prototype
- Prevent kernel memory dumps based on usermode RIP
- Wipe KASAN shadow stack before rewinding the stack to prevent false
positives
- Move the AMS GOTO enforcement to the actual build stage to allow
user API header extraction without a compiler
- Fix a section mismatch introduced by the on demand VDSO mapping
change
Miscellaneous:
- Trivial typo, GCC quirk removal and CC_SET/OUT() cleanups"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pti: Fix section mismatch warning/error
x86/vdso: Fix lsl operand order
x86/mce: Fix set_mce_nospec() to avoid #GP fault
x86/efi: Load fixmap GDT in efi_call_phys_epilog()
x86/nmi: Fix NMI uaccess race against CR3 switching
x86: Allow generating user-space headers without a compiler
x86/dumpstack: Don't dump kernel memory based on usermode RIP
x86/asm: Use CC_SET()/CC_OUT() in __gen_sigismember()
x86/alternatives: Lockdep-enforce text_mutex in text_poke*()
x86/entry/64: Wipe KASAN stack shadow before rewind_stack_do_exit()
x86/irqflags: Mark native_restore_fl extern inline
x86/build: Remove jump label quirk for GCC older than 4.5.2
x86/Kconfig: Fix trivial typo
x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+
x86/spectre: Add missing family 6 check to microcode check
Handle the case where microcode gets loaded on the BSP's hyperthread
sibling first and the boot_cpu_data's microcode revision doesn't get
updated because of early exit due to the siblings sharing a microcode
engine.
For that, simply write the updated revision on all CPUs unconditionally.
Signed-off-by: Filippo Sironi <sironi@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: prarit@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1533050970-14385-1-git-send-email-sironi@amazon.de
When preparing an MCE record for logging, boot_cpu_data.microcode is used
to read out the microcode revision on the box.
However, on systems where late microcode update has happened, the microcode
revision output in a MCE log record is wrong because
boot_cpu_data.microcode is not updated when the microcode gets updated.
But, the microcode revision saved in boot_cpu_data's microcode member
should be kept up-to-date, regardless, for consistency.
Make it so.
Fixes: fa94d0c6e0 ("x86/MCE: Save microcode revision in machine check records")
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: sironi@amazon.de
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20180731112739.32338-1-prarit@redhat.com
The microcode revision is already readable for non-root users via
/proc/cpuinfo. Thus, there's no reason to keep the same information
readable by root only in /sys/devices/system/cpu/cpuX/microcode/.
Make .../processor_flags world-readable too, while at it.
Reported-by: Tim Burgess <timb@dug.com>
Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180825035039.14409-1-jacekt@dugeo.com
Fix the section mismatch warning in arch/x86/mm/pti.c:
WARNING: vmlinux.o(.text+0x6972a): Section mismatch in reference from the function pti_clone_pgtable() to the function .init.text:pti_user_pagetable_walk_pte()
The function pti_clone_pgtable() references
the function __init pti_user_pagetable_walk_pte().
This is often because pti_clone_pgtable lacks a __init
annotation or the annotation of pti_user_pagetable_walk_pte is wrong.
FATAL: modpost: Section mismatches detected.
Fixes: 85900ea515 ("x86/pti: Map the vsyscall page if needed")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/43a6d6a3-d69d-5eda-da09-0b1c88215a2a@infradead.org
In the __getcpu function, lsl is using the wrong target and destination
registers. Luckily, the compiler tends to choose %eax for both variables,
so it has been working so far.
Fixes: a582c540ac ("x86/vdso: Use RDPID in preference to LSL when available")
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180901201452.27828-1-sneves@dei.uc.pt
The trick with flipping bit 63 to avoid loading the address of the 1:1
mapping of the poisoned page while the 1:1 map is updated used to work when
unmapping the page. But it falls down horribly when attempting to directly
set the page as uncacheable.
The problem is that when the cache mode is changed to uncachable, the pages
needs to be flushed from the cache first. But the decoy address is
non-canonical due to bit 63 flipped, and the CLFLUSH instruction throws a
#GP fault.
Add code to change_page_attr_set_clr() to fix the address before calling
flush.
Fixes: 284ce4011b ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Link: https://lkml.kernel.org/r/20180831165506.GA9605@agluck-desk
When PTI is enabled on x86-32 the kernel uses the GDT mapped in the fixmap
for the simple reason that this address is also mapped for user-space.
The efi_call_phys_prolog()/efi_call_phys_epilog() wrappers change the GDT
to call EFI runtime services and switch back to the kernel GDT when they
return. But the switch-back uses the writable GDT, not the fixmap GDT.
When that happened and and the CPU returns to user-space it switches to the
user %cr3 and tries to restore user segment registers. This fails because
the writable GDT is not mapped in the user page-table, and without a GDT
the fault handlers also can't be launched. The result is a triple fault and
reboot of the machine.
Fix that by restoring the GDT back to the fixmap GDT which is also mapped
in the user page-table.
Fixes: 7757d607c6 x86/pti: ('Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32')
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: hpa@zytor.com
Cc: linux-efi@vger.kernel.org
Link: https://lkml.kernel.org/r/1535702738-10971-1-git-send-email-joro@8bytes.org
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW4lM6AAKCRCAXGG7T9hj
vs8AAQDysFccg97UdopW3B7yklIaRqkfEIAsxe65f191MXsH2AEAp5SKxZqRPqBP
a9WHDj8ShB3BhZ/IxpdO9Y59U3Jo4wA=
=Gt4c
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.19b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- minor cleanup avoiding a warning when building with new gcc
- a patch to add a new sysfs node for Xen frontend/backend drivers to
make it easier to obtain the state of a pv device
- two fixes for 32-bit pv-guests to avoid intermediate L1TF vulnerable
PTEs
* tag 'for-linus-4.19b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: remove redundant variable save_pud
xen: export device state to sysfs
x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear
x86/xen: don't write ptes directly in 32-bit PV guests
A NMI can hit in the middle of context switching or in the middle of
switch_mm_irqs_off(). In either case, CR3 might not match current->mm,
which could cause copy_from_user_nmi() and friends to read the wrong
memory.
Fix it by adding a new nmi_uaccess_okay() helper and checking it in
copy_from_user_nmi() and in __copy_from_user_nmi()'s callers.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/dd956eba16646fd0b15c3c0741269dfd84452dac.1535557289.git.luto@kernel.org
When bootstrapping an architecture, it's usual to generate the kernel's
user-space headers (make headers_install) before building a compiler. Move
the compiler check (for asm goto support) to the archprepare target so that
it is only done when building code for the target.
Fixes: e501ce957a ("x86: Force asm-goto")
Reported-by: Helmut Grohne <helmutg@debian.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180829194317.GA4765@decadent.org.uk
show_opcodes() is used both for dumping kernel instructions and for dumping
user instructions. If userspace causes #PF by jumping to a kernel address,
show_opcodes() can be reached with regs->ip controlled by the user,
pointing to kernel code. Make sure that userspace can't trick us into
dumping kernel memory into dmesg.
Fixes: 7cccf0725c ("x86/dumpstack: Add a show_ip() function")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: security@kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180828154901.112726-1-jannh@google.com
Allowing x86_emulate_instruction() to be called directly has led to
subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE
in the emulation type. While most of the blame lies on re-execute
being opt-out, exporting x86_emulate_instruction() also exposes its
cr2 parameter, which may have contributed to commit d391f12070
("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO
when running nested") using x86_emulate_instruction() instead of
emulate_instruction() because "hey, I have a cr2!", which in turn
introduced its EMULTYPE_NO_REEXECUTE bug.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
specific function, and there's no conflict that prevents adding the
kvm_ prefix.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Commit a6f177efaa ("KVM: Reenter guest after emulation failure if
due to access to non-mmio address") added reexecute_instruction() to
handle the scenario where two (or more) vCPUS race to write a shadowed
page, i.e. reexecute_instruction() is intended to return true if and
only if the instruction being emulated was accessing a shadowed page.
As L0 is only explicitly shadowing L1 tables, an emulation failure of
a nested VM instruction cannot be due to a race to write a shadowed
page and so should never be re-executed.
This fixes an issue where an "MMIO" emulation failure[1] in L2 is all
but guaranteed to result in an infinite loop when TDP is enabled.
Because "cr2" is actually an L2 GPA when TDP is enabled, calling
kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped
case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA
and cause reexecute_instruction() to immediately return true. The
!mmio_info_in_cache() check in kvm_mmu_page_fault() doesn't catch this
case because mmio_info_in_cache() returns false for a nested MMU (the
MMIO caching currently handles L1 only, e.g. to cache nested guests'
GPAs we'd have to manually flush the cache when switching between
VMs and when L1 updated its page tables controlling the nested guest).
Way back when, commit 68be080345 ("KVM: x86: never re-execute
instruction with enabled tdp") changed reexecute_instruction() to
always return false when using TDP under the assumption that KVM would
only get into the emulator for MMIO. Commit 95b3cf69bd ("KVM: x86:
let reexecute_instruction work for tdp") effectively reverted that
behavior in order to handle the scenario where emulation failed due to
an access from L1 to the shadow page tables for L2, but it didn't
account for the case where emulation failed in L2 with TDP enabled.
All of the above logic also applies to retry_instruction(), added by
commit 1cb3f3ae5a ("KVM: x86: retry non-page-table writing
instructions"). An indefinite loop in retry_instruction() should be
impossible as it protects against retrying the same instruction over
and over, but it's still correct to not retry an L2 instruction in
the first place.
Fix the immediate issue by adding a check for a nested guest when
determining whether or not to allow retry in kvm_mmu_page_fault().
In addition to fixing the immediate bug, add WARN_ON_ONCE in the
retry functions since they are not designed to handle nested cases,
i.e. they need to be modified even if there is some scenario in the
future where we want to allow retrying a nested guest.
[1] This issue was encountered after commit 3a2936dedd ("kvm: mmu:
Don't expose private memslots to L2") changed the page fault path
to return KVM_PFN_NOSLOT when translating an L2 access to a
prive memslot. Returning KVM_PFN_NOSLOT is semantically correct
when we want to hide a memslot from L2, i.e. there effectively is
no defined memory region for L2, but it has the unfortunate side
effect of making KVM think the GFN is a MMIO page, thus triggering
emulation. The failure occurred with in-development code that
deliberately exposed a private memslot to L2, which L2 accessed
with an instruction that is not emulated by KVM.
Fixes: 95b3cf69bd ("KVM: x86: let reexecute_instruction work for tdp")
Fixes: 1cb3f3ae5a ("KVM: x86: retry non-page-table writing instructions")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Effectively force kvm_mmu_page_fault() to opt-in to allowing retry to
make it more obvious when and why it allows emulation to be retried.
Previously this approach was less convenient due to retry and
re-execute behavior being controlled by separate flags that were also
inverted in their implementations (opt-in versus opt-out).
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
retry_instruction() and reexecute_instruction() are a package deal,
i.e. there is no scenario where one is allowed and the other is not.
Merge their controlling emulation type flags to enforce this in code.
Name the combined flag EMULTYPE_ALLOW_RETRY to make it abundantly
clear that we are allowing re{try,execute} to occur, as opposed to
explicitly requesting retry of a previously failed instruction.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Re-execution of an instruction after emulation decode failure is
intended to be used only when emulating shadow page accesses. Invert
the flag to make allowing re-execution opt-in since that behavior is
by far in the minority.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Re-execution after an emulation decode failure is only intended to
handle a case where two or vCPUs race to write a shadowed page, i.e.
we should never re-execute an instruction as part of RSM emulation.
Add a new helper, kvm_emulate_instruction_from_buffer(), to support
emulating from a pre-defined buffer. This eliminates the last direct
call to x86_emulate_instruction() outside of kvm_mmu_page_fault(),
which means x86_emulate_instruction() can be unexported in a future
patch.
Fixes: 7607b71744 ("KVM: SVM: install RSM intercept")
Cc: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Re-execution after an emulation decode failure is only intended to
handle a case where two or vCPUs race to write a shadowed page, i.e.
we should never re-execute an instruction as part of MMIO emulation.
As handle_ept_misconfig() is only used for MMIO emulation, it should
pass EMULTYPE_NO_REEXECUTE when using the emulator to skip an instr
in the fast-MMIO case where VM_EXIT_INSTRUCTION_LEN is invalid.
And because the cr2 value passed to x86_emulate_instruction() is only
destined for use when retrying or reexecuting, we can simply call
emulate_instruction().
Fixes: d391f12070 ("x86/kvm/vmx: do not use vm-exit instruction length
for fast MMIO when running nested")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Variable dst_vaddr_end is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
variable 'dst_vaddr_end' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
nested_run_pending is set 20 lines above and check_vmentry_prereqs()/
check_vmentry_postreqs() don't seem to be resetting it (the later, however,
checks it).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
text_poke() and text_poke_bp() must be called with text_mutex held.
Put proper lockdep anotation in place instead of just mentioning the
requirement in a comment.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1808280853520.25787@cbobk.fhfr.pm
Reset the KASAN shadow state of the task stack before rewinding RSP.
Without this, a kernel oops will leave parts of the stack poisoned, and
code running under do_exit() can trip over such poisoned regions and cause
nonsensical false-positive KASAN reports about stack-out-of-bounds bugs.
This does not wipe the exception stacks; if an oops happens on an exception
stack, it might result in random KASAN false-positives from other tasks
afterwards. This is probably relatively uninteresting, since if the kernel
oopses on an exception stack, there are most likely bigger things to worry
about. It'd be more interesting if vmapped stacks and KASAN were
compatible, since then handle_stack_overflow() would oops from exception
stack context.
Fixes: 2deb4be280 ("x86/dumpstack: When OOPSing, rewind the stack before do_exit()")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: kasan-dev@googlegroups.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180828184033.93712-1-jannh@google.com
This should have been marked extern inline in order to pick up the out
of line definition in arch/x86/kernel/irqflags.S.
Fixes: 208cbb3255 ("x86/irqflags: Provide a declaration for native_save_fl")
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180827214011.55428-1-ndesaulniers@google.com
Commit cafa0010cd ("Raise the minimum required gcc version to 4.6")
bumped the minimum GCC version to 4.6 for all architectures.
Remove the workaround code.
It was the only user of cc-if-fullversion. Remove the macro as well.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/1535348714-25457-1-git-send-email-yamada.masahiro@socionext.com
Pull crypto fixes from Herbert Xu:
- Check for the right CPU feature bit in sm4-ce on arm64.
- Fix scatterwalk WARN_ON in aes-gcm-ce on arm64.
- Fix unaligned fault in aesni on x86.
- Fix potential NULL pointer dereference on exit in chtls.
- Fix DMA mapping direction for RSA in caam.
- Fix error path return value for xts setkey in caam.
- Fix address endianness when DMA unmapping in caam.
- Fix sleep-in-atomic in vmx.
- Fix command corruption when queue is full in cavium/nitrox.
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: cavium/nitrox - fix for command corruption in queue full case with backlog submissions.
crypto: vmx - Fix sleep-in-atomic bugs
crypto: arm64/aes-gcm-ce - fix scatterwalk API violation
crypto: aesni - Use unaligned loads from gcm_context_data
crypto: chtls - fix null dereference chtls_free_uld()
crypto: arm64/sm4-ce - check for the right CPU feature bit
crypto: caam - fix DMA mapping direction for RSA forms 2 & 3
crypto: caam/qi - fix error path in xts setkey
crypto: caam/jr - fix descriptor DMA unmapping
After changing over to 64-bit time_t syscalls, many architectures will
want compat_sys_utimensat() but not respective handlers for utime(),
utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to
complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that
support CONFIG_COMPAT set it, but future 64-bit architectures will not
(tile would not have needed it either, but got removed).
As older 32-bit architectures get converted to using CONFIG_64BIT_TIME,
they will have to use __ARCH_WANT_SYS_UTIME32 instead of
__ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't
need either of them as they never had a utime syscall.
Since the compat_utimbuf structure is now required outside of
CONFIG_COMPAT, I'm moving it into compat_time.h.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
changed from last version:
- renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
The sys_llseek sytem call is needed on all 32-bit architectures and
none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard
and simplify the include/asm-generic/unistd.h header further.
Since 32-bit tasks can run either natively or in compat mode on 64-bit
architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT.
There are a few 64-bit architectures that also reference sys_llseek
in their 64-bit ABI (e.g. sparc), but I verified that those all
select CONFIG_COMPAT, so the #if check is still correct here. It's
a bit odd to include it in the syscall table though, as it's the
same as sys_lseek() on 64-bit, but with strange calling conventions.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
While converting compat system call handlers to work on 32-bit
architectures, I found a number of types used in those handlers
that are identical between all architectures.
Let's move all the identical ones into asm-generic/compat.h to avoid
having to add even more identical definitions of those types.
For unknown reasons, mips defines __compat_gid32_t, __compat_uid32_t
and compat_caddr_t as signed, while all others have them unsigned.
This seems to be a mistake, but I'm leaving it alone here. The other
types all differ by size or alignment on at least on architecture.
compat_aio_context_t is currently defined in linux/compat.h but
also needed for compat_sys_io_getevents(), so let's move it into
the same place.
While we still have not decided whether the 32-bit time handling
will always use the compat syscalls, or in which form, I think this
is a useful cleanup that we can merge regardless.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
We have four generations of stat() syscalls:
- the oldstat syscalls that are only used on the older architectures
- the newstat family that is used on all 64-bit architectures but
lacked support for large files on 32-bit architectures.
- the stat64 family that is used mostly on 32-bit architectures to
replace newstat
- statx() to replace all of the above, adding 64-bit timestamps among
other things.
We already compile stat64 only on those architectures that need it,
but newstat is always built, including on those that don't reference
it. This adds a new __ARCH_WANT_NEW_STAT symbol along the lines of
__ARCH_WANT_OLD_STAT and __ARCH_WANT_STAT64 to control compilation of
newstat. All architectures that need it use an explict define, the
others now get a little bit smaller, and future architecture (including
64-bit targets) won't ever see it.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Variable save_pud is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
variable 'save_pud' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Using only 32-bit writes for the pte will result in an intermediate
L1TF vulnerable PTE. When running as a Xen PV guest this will at once
switch the guest to shadow mode resulting in a loss of performance.
Use arch_atomic64_xchg() instead which will perform the requested
operation atomically with all 64 bits.
Some performance considerations according to:
https://software.intel.com/sites/default/files/managed/ad/dc/Intel-Xeon-Scalable-Processor-throughput-latency.pdf
The main number should be the latency, as there is no tight loop around
native_ptep_get_and_clear().
"lock cmpxchg8b" has a latency of 20 cycles, while "lock xchg" (with a
memory operand) isn't mentioned in that document. "lock xadd" (with xadd
having 3 cycles less latency than xchg) has a latency of 11, so we can
assume a latency of 14 for "lock xchg".
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
In some cases 32-bit PAE PV guests still write PTEs directly instead of
using hypercalls. This is especially bad when clearing a PTE as this is
done via 32-bit writes which will produce intermediate L1TF attackable
PTEs.
Change the code to use hypercalls instead.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
On Nehalem and newer core CPUs the CPU cache internally uses 44 bits
physical address space. The L1TF workaround is limited by this internal
cache address width, and needs to have one bit free there for the
mitigation to work.
Older client systems report only 36bit physical address space so the range
check decides that L1TF is not mitigated for a 36bit phys/32GB system with
some memory holes.
But since these actually have the larger internal cache width this warning
is bogus because it would only really be needed if the system had more than
43bits of memory.
Add a new internal x86_cache_bits field. Normally it is the same as the
physical bits field reported by CPUID, but for Nehalem and newerforce it to
be at least 44bits.
Change the L1TF memory size warning to use the new cache_bits field to
avoid bogus warnings and remove the bogus comment about memory size.
Fixes: 17dbca1193 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Reported-by: George Anchev <studio@anchev.net>
Reported-by: Christopher Snowhill <kode54@gmail.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Michael Hocko <mhocko@suse.com>
Cc: vbabka@suse.cz
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180824170351.34874-1-andi@firstfloor.org
Pull perf updates from Thomas Gleixner:
"Kernel:
- Improve kallsyms coverage
- Add x86 entry trampolines to kcore
- Fix ARM SPE handling
- Correct PPC event post processing
Tools:
- Make the build system more robust
- Small fixes and enhancements all over the place
- Update kernel ABI header copies
- Preparatory work for converting libtraceevnt to a shared library
- License cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
tools arch x86: Update tools's copy of cpufeatures.h
perf python: Fix pyrf_evlist__read_on_cpu() interface
perf mmap: Store real cpu number in 'struct perf_mmap'
perf tools: Remove ext from struct kmod_path
perf tools: Add gzip_is_compressed function
perf tools: Add lzma_is_compressed function
perf tools: Add is_compressed callback to compressions array
perf tools: Move the temp file processing into decompress_kmodule
perf tools: Use compression id in decompress_kmodule()
perf tools: Store compression id into struct dso
perf tools: Add compression id into 'struct kmod_path'
perf tools: Make is_supported_compression() static
perf tools: Make decompress_to_file() function static
perf tools: Get rid of dso__needs_decompress() call in __open_dso()
perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
perf tools: Get rid of dso__needs_decompress() call in read_object_code()
tools lib traceevent: Change to SPDX License format
perf llvm: Allow passing options to llc in addition to clang
perf parser: Improve error message for PMU address filters
...
Pull x86 fixes from Thomas Gleixner:
- Correct the L1TF fallout on 32bit and the off by one in the 'too much
RAM for protection' calculation.
- Add a helpful kernel message for the 'too much RAM' case
- Unbreak the VDSO in case that the compiler desides to use indirect
jumps/calls and emits retpolines which cannot be resolved because the
kernel uses its own thunks, which does not work for the VDSO. Make it
use the builtin thunks.
- Re-export start_thread() which was unexported when the 32/64bit
implementation was unified. start_thread() is required by modular
binfmt handlers.
- Trivial cleanups
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/l1tf: Suggest what to do on systems with too much RAM
x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM
x86/kvm/vmx: Remove duplicate l1d flush definitions
x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
x86/process: Re-export start_thread()
x86/mce: Add notifier_block forward declaration
x86/vdso: Fix vDSO build if a retpoline is emitted
* memory_failure() gets confused by dev_pagemap backed mappings. The
recovery code has specific enabling for several possible page states
that needs new enabling to handle poison in dax mappings. Teach
memory_failure() about ZONE_DEVICE pages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
/kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
=Ftop
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"
* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t
- add build_{menu,n,g,x}config targets for compile-testing Kconfig
- fix and improve recursive dependency detection in Kconfig
- fix parallel building of menuconfig/nconfig
- fix syntax error in clang-version.sh
- suppress distracting log from syncconfig
- remove obsolete "rpm" target
- remove VMLINUX_SYMBOL(_STR) macro entirely
- fix microblaze build with CONFIG_DYNAMIC_FTRACE
- move compiler test for dead code/data elimination to Kconfig
- rename well-known LDFLAGS variable to KBUILD_LDFLAGS
- misc fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbgYhCAAoJED2LAQed4NsGErAP/jt7gt76+N0PZmADBZqyVR/H
4k286g3OiT7DIcdvwqE5BRvu+zNOamDujnnXw63/jwu2RjrkLX/JnhzTbC0IZleZ
KeO4bU4ZH0WFa0Ny9pp0LAnzbXGMnQjDXygcUd5BFoEd5JSLKW2PISEEjRh6b5B7
swJRdgySFaMrUBRNf13FwH5EvX/D0xZQe/wFhFCOv6L4gJZFMmpGUIepgTjTUmxZ
wcNN6xxXg+ulLHVcPdPQ9EYssNHN5xNys02+IdIrhhXuNHji/TFm4dGYuU+dDGeE
Eu4O6Qs7pg0PFGrZ5gLxXDJEp75W+uaTNOqV+jcjq8MRxJuWxyy2biUeelKRT/KH
0iv4ZQJVOMOhl8fZgLtQaXHyQ++5uwd6kvPPf+XFdkogGAIXK0wKWLoALFEOXwb6
z1BBnFx09LrKPGt0ZlKX624OEczedv/UAFiSh3Ic2S3PFEpq4oHrEGhTnyKRobPv
OEcF3RqKjmAdK7PLy4kVpTLhkutkWWhw6Giy9qXUkXYJWonJR7NTQ1mIan2LoGZC
sGi+qKae/8xgO2Nerx59tZpkiHYTMfYeAo8frzWurOxm3YzEfaxNNGPl+IMW7VKz
cNPzQZ5tMUy4i4PAhk/gIWibnUTPfjDbWsZSMtIbO0GFcao56EvllwD8/awuy7lO
QkaAeZHFcF+qgU3muaYK
=Vsb2
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- add build_{menu,n,g,x}config targets for compile-testing Kconfig
- fix and improve recursive dependency detection in Kconfig
- fix parallel building of menuconfig/nconfig
- fix syntax error in clang-version.sh
- suppress distracting log from syncconfig
- remove obsolete "rpm" target
- remove VMLINUX_SYMBOL(_STR) macro entirely
- fix microblaze build with CONFIG_DYNAMIC_FTRACE
- move compiler test for dead code/data elimination to Kconfig
- rename well-known LDFLAGS variable to KBUILD_LDFLAGS
- misc fixes and cleanups
* tag 'kbuild-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: rename LDFLAGS to KBUILD_LDFLAGS
kbuild: pass LDFLAGS to recordmcount.pl
kbuild: test dead code/data elimination support in Kconfig
initramfs: move gen_initramfs_list.sh from scripts/ to usr/
vmlinux.lds.h: remove stale <linux/export.h> include
export.h: remove VMLINUX_SYMBOL() and VMLINUX_SYMBOL_STR()
Coccinelle: remove pci_alloc_consistent semantic to detect in zalloc-simple.cocci
kbuild: make sorting initramfs contents independent of locale
kbuild: remove "rpm" target, which is alias of "rpm-pkg"
kbuild: Fix LOADLIBES rename in Documentation/kbuild/makefiles.txt
kconfig: suppress "configuration written to .config" for syncconfig
kconfig: fix "Can't open ..." in parallel build
kbuild: Add a space after `!` to prevent parsing as file pattern
scripts: modpost: check memory allocation results
kconfig: improve the recursive dependency report
kconfig: report recursive dependency involving 'imply'
kconfig: error out when seeing recursive dependency
kconfig: add build-only configurator targets
scripts/dtc: consolidate include path options in Makefile
A regression was reported bisecting to 1476db2d12
"Move HashKey computation from stack to gcm_context". That diff
moved HashKey computation from the stack, which was explicitly aligned
in the asm, to a struct provided from the C code, depending on
AESNI_ALIGN_ATTR for alignment. It appears some compilers may not
align this struct correctly, resulting in a crash on the movdqa
instruction when attempting to encrypt or decrypt data.
Fix by using unaligned loads for the HashKeys. On modern
hardware there is no perf difference between the unaligned and
aligned loads. All other accesses to gcm_context_data already use
unaligned loads.
Reported-by: Mauro Rossi <issor.oruam@gmail.com>
Fixes: 1476db2d12 ("Move HashKey computation from stack to gcm_context")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Including:
- PASID table handling updates for the Intel VT-d driver. It
implements a global PASID space now so that applications
usings multiple devices will just have one PASID.
- A new config option to make iommu passthroug mode the default.
- New sysfs attribute for iommu groups to export the type of the
default domain.
- A debugfs interface (for debug only) usable by IOMMU drivers
to export internals to user-space.
- R-Car Gen3 SoCs support for the ipmmu-vmsa driver
- The ARM-SMMU now aborts transactions from unknown devices and
devices not attached to any domain.
- Various cleanups and smaller fixes all over the place.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJbf/9wAAoJECvwRC2XARrjcuYP/3dIsOFN7Xb4sTOB5wxk4wmD
2Rm5o/18cFekEy4M8fwIBCYkzH/McohgKbOFcH6XiCxIwJ5RdXzITLAwmp4PbvIO
KtwppXSp+MQtboip/bp6NDNBhABErgUtgdXawwENCCrFivXDsB8W4wnXESAOkLv9
4fLXrUgDFCAquLZpLqQobXHhajtGAkSekaasphlhejXFulFyF1YcEUcliU7eXZ0R
rZjL4Zqcyyi5kv6d3WhL+tvmmhr7wfMsMPaW18eRf9tXvMpWRM2GOAj65coI2AWs
1T1kW/jvvrxnewOsmo1nYlw7R07uiRkUfHmJ9tY65xW4120HJFhdFLPUQZXfrX/b
wcGbheYIh6cwAaZBtPJ35bPeW6pREkDOShohbzt45T62Q837cBkr3zyHhNsoOXHS
13YVtTd2vtPa4iLdu2qmEOC1OuhQnMvqHqX0iN8U74QbDxEYYvMfAdx0JL3hmPp/
uynY3QmXIKCeZg+vH2qcWHm07nfaAr5y8WSPA0crnqeznD5zJ4kvJf5dFGmDyTKr
pyTkhidkifm6ZejrJsDZveoZdLpHrOatrqKaoLFh2crMUG3d807NYqQ3JmA3NDjg
zPbYyU4joFGNVjd3XkSnRTGxR6YvLIwNbkQ3b/K/B5AqWJ6VrTbbTCOa4GSms6rF
Qm8wRrmYaycKxkcMqtls
=TeYQ
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
- PASID table handling updates for the Intel VT-d driver. It implements
a global PASID space now so that applications usings multiple devices
will just have one PASID.
- A new config option to make iommu passthroug mode the default.
- New sysfs attribute for iommu groups to export the type of the
default domain.
- A debugfs interface (for debug only) usable by IOMMU drivers to
export internals to user-space.
- R-Car Gen3 SoCs support for the ipmmu-vmsa driver
- The ARM-SMMU now aborts transactions from unknown devices and devices
not attached to any domain.
- Various cleanups and smaller fixes all over the place.
* tag 'iommu-updates-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (42 commits)
iommu/omap: Fix cache flushes on L2 table entries
iommu: Remove the ->map_sg indirection
iommu/arm-smmu-v3: Abort all transactions if SMMU is enabled in kdump kernel
iommu/arm-smmu-v3: Prevent any devices access to memory without registration
iommu/ipmmu-vmsa: Don't register as BUS IOMMU if machine doesn't have IPMMU-VMSA
iommu/ipmmu-vmsa: Clarify supported platforms
iommu/ipmmu-vmsa: Fix allocation in atomic context
iommu: Add config option to set passthrough as default
iommu: Add sysfs attribyte for domain type
iommu/arm-smmu-v3: sync the OVACKFLG to PRIQ consumer register
iommu/arm-smmu: Error out only if not enough context interrupts
iommu/io-pgtable-arm-v7s: Abort allocation when table address overflows the PTE
iommu/io-pgtable-arm: Fix pgtable allocation in selftest
iommu/vt-d: Remove the obsolete per iommu pasid tables
iommu/vt-d: Apply per pci device pasid table in SVA
iommu/vt-d: Allocate and free pasid table
iommu/vt-d: Per PCI device pasid table interfaces
iommu/vt-d: Add for_each_device_domain() helper
iommu/vt-d: Move device_domain_info to header
iommu/vt-d: Apply global PASID in SVA
...
Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective.
Make the warning more helpful by suggesting the proper mem=X kernel boot
parameter to make it effective and a link to the L1TF document to help
decide if the mitigation is worth the unusable RAM.
[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/966571f0-9d7f-43dc-92c6-a10eec7a1254@suse.cz
Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective. In
fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to
holes in the e820 map, the main region is almost 500MB over the 32GB limit:
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000081effffff] usable
Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the
500MB revealed, that there's an off-by-one error in the check in
l1tf_select_mitigation().
l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range
check in the mitigation path does not take this into account.
Instead of amending the range check, make l1tf_pfn_limit() return the first
PFN which is over the limit which is less error prone. Adjust the other
users accordingly.
[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536
Fixes: 17dbca1193 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Reported-by: George Anchev <studio@anchev.net>
Reported-by: Christopher Snowhill <kode54@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180823134418.17008-1-vbabka@suse.cz
Commit a0f97e06a4 ("kbuild: enable 'make CFLAGS=...' to add
additional options to CC") renamed CFLAGS to KBUILD_CFLAGS.
Commit 222d394d30 ("kbuild: enable 'make AFLAGS=...' to add
additional options to AS") renamed AFLAGS to KBUILD_AFLAGS.
Commit 06c5040cdb ("kbuild: enable 'make CPPFLAGS=...' to add
additional options to CPP") renamed CPPFLAGS to KBUILD_CPPFLAGS.
For some reason, LDFLAGS was not renamed.
Using a well-known variable like LDFLAGS may result in accidental
override of the variable.
Kbuild generally uses KBUILD_ prefixed variables for the internally
appended options, so here is one more conversion to sanitize the
naming convention.
I did not touch Makefiles under tools/ since the tools build system
is a different world.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Palmer Dabbelt <palmer@sifive.com>
Merge fixes for missing TLB shootdowns.
This fixes a couple of cases that involved us possibly freeing page
table structures before the required TLB shootdown had been done.
There are a few cleanup patches to make the code easier to follow, and
to avoid some of the more problematic cases entirely when not necessary.
To make this easier for backports, it undoes the recent lazy TLB
patches, because the cleanups and fixes are more important, and Rik is
ok with re-doing them later when things have calmed down.
The missing TLB flush was only delayed, and the wrong ordering only
happened under memory pressure (and in theory under a couple of other
fairly theoretical situations), so this may have been all very unlikely
to have hit people in practice.
But getting the TLB shootdown wrong is _so_ hard to debug and see that I
consider this a crticial fix.
Many thanks to Jann Horn for having debugged this.
* tlb-fixes:
x86/mm: Only use tlb_remove_table() for paravirt
mm: mmu_notifier fix for tlb_end_vma
mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
mm/tlb: Remove tlb_remove_table() non-concurrent condition
mm: move tlb_table_flush to tlb_flush_mmu_free
x86/mm/tlb: Revert the recent lazy TLB patches
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW36rRgAKCRCAXGG7T9hj
vkrcAQC8F+ljGO5PtYUkKcMy17vqvcq/BdetJuUVfk+G1WmLxQEAiaNiqqJGsOyJ
Msa0HHDT31uBYGg/iq7yAWk23tcTZwE=
=Px4D
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.19b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes and cleanups from Juergen Gross:
"Some cleanups, some minor fixes and a fix for a bug introduced in this
merge window hitting 32-bit PV guests"
* tag 'for-linus-4.19b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: enable early use of set_fixmap in 32-bit Xen PV guest
xen: remove unused hypercall functions
x86/xen: remove unused function xen_auto_xlated_memory_setup()
xen/ACPI: don't upload Px/Cx data for disabled processors
x86/Xen: further refine add_preferred_console() invocations
xen/mcelog: eliminate redundant setting of interface version
x86/Xen: mark xen_setup_gdt() __init
If we don't use paravirt; don't play unnecessary and complicated games
to free page-tables.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().
This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.
But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.
NOTE: s390 was also needing something like this and might now
be able to use the generic code again.
[ Modified to be on top of Nick's cleanups, which simplified this patch
now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]
Fixes: 9e52fc2b50 ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commits:
95b0e6357d x86/mm/tlb: Always use lazy TLB mode
64482aafe5 x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
ac03158969 x86/mm/tlb: Make lazy TLB mode lazier
61d0beb579 x86/mm/tlb: Restructure switch_mm_irqs_off()
2ff6ddf19c x86/mm/tlb: Leave lazy TLB mode at page table free time
In order to simplify the TLB invalidate fixes for x86 and unify the
parts that need backporting. We'll try again later.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
optimizations for ARMv8.4 systems, Userspace interface for RAS, Fault
path optimization, Emulated physical timer fixes, Random cleanups
x86: fixes for L1TF, a new test case, non-support for SGX (inject the
right exception in the guest), a lockdep false positive
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbfXfZAAoJEL/70l94x66DL2QH/RnQZW4OaqVdE3pNvRvaNJGQ
41yk9aErbqPcK25aIKnhs9e3S+e32BhArA1YBwdHXwwuanANYv5W+o3HNTL0UFj7
UG6APKm5DR6kJeUZ3vCfyeZ/ZKxDW0uqf5DXQyHUiAhwLGw2wWYJ9Ttv0m0Q4Fxl
x9HEnK/s+komG93QT+2hIXtZdPiB026yBBqDDPyYiWrweyBagYUHz65p6qaPiOEY
HqOyLYKsgrqCv9U0NLTD9U54IWGFIaxMGgjyRdZTMCIQeGj6dAH7vyfURGOeDHvw
C0OZeEKRbMsHLwzXRBDEZp279pYgS7zafe/hMkr/znaac+j6xNwxpWwqg5Sm0UE=
=5yTH
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second set of KVM updates from Paolo Bonzini:
"ARM:
- Support for Group0 interrupts in guests
- Cache management optimizations for ARMv8.4 systems
- Userspace interface for RAS
- Fault path optimization
- Emulated physical timer fixes
- Random cleanups
x86:
- fixes for L1TF
- a new test case
- non-support for SGX (inject the right exception in the guest)
- fix lockdep false positive"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
KVM: VMX: fixes for vmentry_l1d_flush module parameter
kvm: selftest: add dirty logging test
kvm: selftest: pass in extra memory when create vm
kvm: selftest: include the tools headers
kvm: selftest: unify the guest port macros
tools: introduce test_and_clear_bit
KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
KVM: vmx: Inject #UD for SGX ENCLS instruction in guest
KVM: vmx: Add defines for SGX ENCLS exiting
x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()
x86: kvm: avoid unused variable warning
KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR
KVM: arm/arm64: Skip updating PTE entry if no change
KVM: arm/arm64: Skip updating PMD entry if no change
KVM: arm: Use true and false for boolean values
KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled
KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h
KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses
KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses
KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs
...
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
An ordinary arm64 defconfig build has ~64 KB worth of __ksymtab entries,
each consisting of two 64-bit fields containing absolute references, to
the symbol itself and to a char array containing its name, respectively.
When we build the same configuration with KASLR enabled, we end up with an
additional ~192 KB of relocations in the .init section, i.e., one 24 byte
entry for each absolute reference, which all need to be processed at boot
time.
Given how the struct kernel_symbol that describes each entry is completely
local to module.c (except for the references emitted by EXPORT_SYMBOL()
itself), we can easily modify it to contain two 32-bit relative references
instead. This reduces the size of the __ksymtab section by 50% for all
64-bit architectures, and gets rid of the runtime relocations entirely for
architectures implementing KASLR, either via standard PIE linking (arm64)
or using custom host tools (x86).
Note that the binary search involving __ksymtab contents relies on each
section being sorted by symbol name. This is implemented based on the
input section names, not the names in the ksymtab entries, so this patch
does not interfere with that.
Given that the use of place-relative relocations requires support both in
the toolchain and in the module loader, we cannot enable this feature for
all architectures. So make it dependent on whether
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS is defined.
Link: http://lkml.kernel.org/r/20180704083651.24360-4-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morris <james.morris@microsoft.com>
Cc: James Morris <jmorris@namei.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To allow existing C code to be incorporated into the decompressor or the
UEFI stub, introduce a CPP macro that turns all EXPORT_SYMBOL_xxx
declarations into nops, and #define it in places where such exports are
undesirable. Note that this gets rid of a rather dodgy redefine of
linux/export.h's header guard.
Link: http://lkml.kernel.org/r/20180704083651.24360-3-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morris <james.morris@microsoft.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "add support for relative references in special sections", v10.
This adds support for emitting special sections such as initcall arrays,
PCI fixups and tracepoints as relative references rather than absolute
references. This reduces the size by 50% on 64-bit architectures, but
more importantly, it removes the need for carrying relocation metadata for
these sections in relocatable kernels (e.g., for KASLR) that needs to be
fixed up at boot time. On arm64, this reduces the vmlinux footprint of
such a reference by 8x (8 byte absolute reference + 24 byte RELA entry vs
4 byte relative reference)
Patch #3 was sent out before as a single patch. This series supersedes
the previous submission. This version makes relative ksymtab entries
dependent on the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS rather
than trying to infer from kbuild test robot replies for which
architectures it should be blacklisted.
Patch #1 introduces the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS,
and sets it for the main architectures that are expected to benefit the
most from this feature, i.e., 64-bit architectures or ones that use
runtime relocations.
Patch #2 add support for #define'ing __DISABLE_EXPORTS to get rid of
ksymtab/kcrctab sections in decompressor and EFI stub objects when
rebuilding existing C files to run in a different context.
Patches #4 - #6 implement relative references for initcalls, PCI fixups
and tracepoints, respectively, all of which produce sections with order
~1000 entries on an arm64 defconfig kernel with tracing enabled. This
means we save about 28 KB of vmlinux space for each of these patches.
[From the v7 series blurb, which included the jump_label patches as well]:
For the arm64 kernel, all patches combined reduce the memory footprint
of vmlinux by about 1.3 MB (using a config copied from Ubuntu that has
KASLR enabled), of which ~1 MB is the size reduction of the RELA section
in .init, and the remaining 300 KB is reduction of .text/.data.
This patch (of 6):
Before updating certain subsystems to use place relative 32-bit
relocations in special sections, to save space and reduce the number of
absolute relocations that need to be processed at runtime by relocatable
kernels, introduce the Kconfig symbol and define it for some architectures
that should be able to support and benefit from it.
Link: http://lkml.kernel.org/r/20180704083651.24360-2-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: James Morris <jmorris@namei.org>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
Cc: James Morris <james.morris@microsoft.com>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.
Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.
We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.
This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.
I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.
The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.
The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.
[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two bug fixes:
1) missing entries in the l1d_param array; this can cause a host crash
if an access attempts to reach the missing entry. Future-proof the get
function against any overflows as well. However, the two entries
VMENTER_L1D_FLUSH_EPT_DISABLED and VMENTER_L1D_FLUSH_NOT_REQUIRED must
not be accepted by the parse function, so disable them there.
2) invalid values must be rejected even if the CPU does not have the
bug, so test for them before checking boot_cpu_has(X86_BUG_L1TF)
... and a small refactoring, since the .cmd field is redundant with
the index in the array.
Reported-by: Bandan Das <bsd@redhat.com>
Cc: stable@vger.kernel.org
Fixes: a7b9020b06
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mikhail reported the following lockdep splat:
WARNING: possible irq lock inversion dependency detected
CPU 0/KVM/10284 just changed the state of lock:
000000000d538a88 (&st->lock){+...}, at:
speculative_store_bypass_update+0x10b/0x170
but this lock was taken by another, HARDIRQ-safe lock
in the past:
(&(&sighand->siglock)->rlock){-.-.}
and interrupts could create inverse lock ordering between them.
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&st->lock);
local_irq_disable();
lock(&(&sighand->siglock)->rlock);
lock(&st->lock);
<Interrupt>
lock(&(&sighand->siglock)->rlock);
*** DEADLOCK ***
The code path which connects those locks is:
speculative_store_bypass_update()
ssb_prctl_set()
do_seccomp()
do_syscall_64()
In svm_vcpu_run() speculative_store_bypass_update() is called with
interupts enabled via x86_virt_spec_ctrl_set_guest/host().
This is actually a false positive, because GIF=0 so interrupts are
disabled even if IF=1; however, we can easily move the invocations of
x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to
cure it, and it's a good idea to keep the GIF=0/IF=1 area as small
and self-contained as possible.
Fixes: 1f50ddb4f4 ("x86/speculation: Handle HT correctly on AMD")
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Virtualization of Intel SGX depends on Enclave Page Cache (EPC)
management that is not yet available in the kernel, i.e. KVM support
for exposing SGX to a guest cannot be added until basic support
for SGX is upstreamed, which is a WIP[1].
Until SGX is properly supported in KVM, ensure a guest sees expected
behavior for ENCLS, i.e. all ENCLS #UD. Because SGX does not have a
true software enable bit, e.g. there is no CR4.SGXE bit, the ENCLS
instruction can be executed[1] by the guest if SGX is supported by the
system. Intercept all ENCLS leafs (via the ENCLS- exiting control and
field) and unconditionally inject #UD.
[1] https://www.spinics.net/lists/kvm/msg171333.html or
https://lkml.org/lkml/2018/7/3/879
[2] A guest can execute ENCLS in the sense that ENCLS will not take
an immediate #UD, but no ENCLS will ever succeed in a guest
without explicit support from KVM (map EPC memory into the guest),
unless KVM has a *very* egregious bug, e.g. accidentally mapped
EPC memory into the guest SPTEs. In other words this patch is
needed only to prevent the guest from seeing inconsistent behavior,
e.g. #GP (SGX not enabled in Feature Control MSR) or #PF (leaf
operand(s) does not point at EPC memory) instead of #UD on ENCLS.
Intercepting ENCLS is not required to prevent the guest from truly
utilizing SGX.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20180814163334.25724-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hardware support for basic SGX virtualization adds a new execution
control (ENCLS_EXITING), VMCS field (ENCLS_EXITING_BITMAP) and exit
reason (ENCLS), that enables a VMM to intercept specific ENCLS leaf
functions, e.g. to inject faults when the VMM isn't exposing SGX to
a VM. When ENCLS_EXITING is enabled, the VMM can set/clear bits in
the bitmap to intercept/allow ENCLS leaf functions in non-root, e.g.
setting bit 2 in the ENCLS_EXITING_BITMAP will cause ENCLS[EINIT]
to VMExit(ENCLS).
Note: EXIT_REASON_ENCLS was previously added by commit 1f51999270
("KVM: VMX: add missing exit reasons").
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20180814163334.25724-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Substitute spaces with tab. No functional changes.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Message-Id: <1534398159-48509-1-git-send-email-wang.yi59@zte.com.cn>
Cc: stable@vger.kernel.org # L1TF
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Removing one of the two accesses of the maxphyaddr variable led to
a harmless warning:
arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask':
arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable]
Removing the #ifdef seems to be the nicest workaround, as it
makes the code look cleaner than adding another #ifdef.
Fixes: 28a1f3ac1d ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: stable@vger.kernel.org # L1TF
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Make the idle loop handle stopped scheduler tick correctly (Rafael
Wysocki).
- Prevent the menu cpuidle governor from letting CPUs spend too much
time in shallow idle states when it is invoked with scheduler tick
stopped and clean it up somewhat (Rafael Wysocki).
- Avoid invoking the platform firmware to make the platform enter
the ACPI S3 sleep state with suspended PCIe root ports which may
confuse the firmware and cause it to crash (Rafael Wysocki).
- Fix sysfs-related race in the ondemand and conservative cpufreq
governors which may cause the system to crash if the governor
module is removed during an update of CPU frequency limits (Henry
Willard).
- Select SRCU when building the system wakeup framework to avoid a
build issue in it (zhangyi).
- Make the descriptions of ACPI C-states vendor-neutral to avoid
confusion (Prarit Bhargava).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbfSnVAAoJEILEb/54YlRxBn4QAKQ8PqkSYkBby+1hb90ET4dk
VaLkbCYXuzLK5rIDvnbYOALhVKo4B29Ex5GdCLN7cWkZMkrVKe7oX8QQTnp3/7lF
URjTKgTNec5uJG652PrE3ESAa3X/kYggj6aeQOxDR4iYKzcpJEQ92ekFW+SoJTNp
Jc2kZh3qkC2On64GB3ibsZaKnmHfPvLg0t4agwzuYq/Gff8NRJFk7kMwAPzqGzZo
b2UVRcYFWIRkJjgmU9iInoeHIY8mBdT3IiKwTemZP1dOhb5T1AHOXwGTk6/cS+RH
A9qx4eg7I3R00KmnYvO8WytYJeOu2qb83GIUx4fIJGOqfvevm5xkxB9F+nfE+ouj
ROBqO4+X4XfQGPw8slayg0rJjI9JSkXLnLdl0Qw2WRlbc4/fVWntra1C57EeKFBR
EG9UAF9+7nUUx0bOCLsfFF3+r9R3SDUjk7b4thyhYncyQRsYC+FL7ztlxnMzVtzW
M5SF2sPrpcQzqmcszdUdbESI10n5X8m/crJW4rsbTxBpAM+coO+uLcvHWOY4MpkW
BgBsR6bMDAlG/VlTFgeeP/tkCRd5zNlJi7yBFItXuOoVKXpnHCJuxq2WkZ1Rb74M
Gk1d3TduekHJm8VsLEdCJR/tEk1cMc0zVUD/a1yzI4Z21QxvXUCqMDdws4/Ey184
qmKgNR9R94vSC5xIPRhM
=9GrU
-----END PGP SIGNATURE-----
Merge tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These fix the main idle loop and the menu cpuidle governor, clean up
the latter, fix a mistake in the PCI bus type's support for system
suspend and resume, fix the ondemand and conservative cpufreq
governors, address a build issue in the system wakeup framework and
make the ACPI C-states desciptions less confusing.
Specifics:
- Make the idle loop handle stopped scheduler tick correctly (Rafael
Wysocki).
- Prevent the menu cpuidle governor from letting CPUs spend too much
time in shallow idle states when it is invoked with scheduler tick
stopped and clean it up somewhat (Rafael Wysocki).
- Avoid invoking the platform firmware to make the platform enter the
ACPI S3 sleep state with suspended PCIe root ports which may
confuse the firmware and cause it to crash (Rafael Wysocki).
- Fix sysfs-related race in the ondemand and conservative cpufreq
governors which may cause the system to crash if the governor
module is removed during an update of CPU frequency limits (Henry
Willard).
- Select SRCU when building the system wakeup framework to avoid a
build issue in it (zhangyi).
- Make the descriptions of ACPI C-states vendor-neutral to avoid
confusion (Prarit Bhargava)"
* tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: menu: Handle stopped tick more aggressively
sched: idle: Avoid retaining the tick when it has been stopped
PCI / ACPI / PM: Resume all bridges on suspend-to-RAM
cpuidle: menu: Update stale polling override comment
cpufreq: governor: Avoid accessing invalid governor_data
x86/ACPI/cstate: Make APCI C1 FFH MWAIT C-state description vendor-neutral
cpuidle: menu: Fix white space
PM / sleep: wakeup: Fix build error caused by missing SRCU support
Commit 7b25b9cb0d ("x86/xen/time: Initialize pv xen time in
init_hypervisor_platform()") moved the mapping of the shared info area
before pagetable_init(). This breaks booting as 32-bit PV guest as the
use of set_fixmap isn't possible at this time on 32-bit.
This can be worked around by populating the needed PMD on 32-bit
kernel earlier.
In order not to reimplement populate_extra_pte() using extend_brk()
for allocating new page tables extend alloc_low_pages() to do that in
case the early page table pool is not yet available.
Fixes: 7b25b9cb0d ("x86/xen/time: Initialize pv xen time in init_hypervisor_platform()")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Remove Xen hypercall functions which are used nowhere in the kernel.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
xen_auto_xlated_memory_setup() is a leftover from PVH V1. Remove it.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
As the sequence of invocations matters, add "tty" only after "hvc" when
a VGA console is available (which is often the case for Dom0, but hardly
ever for DomU).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Its only caller is __init, so to avoid section mismatch warnings when a
compiler decides to not inline the function marke this function so as
well. Take the opportunity and also make the function actually use its
argument: The sole caller passes in zero anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.
Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.
Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
In preparation for using set_memory_uc() instead set_memory_np() for
isolating poison from speculation, teach the memtype code to sanitize
physical addresses vs __PHYSICAL_MASK.
The motivation for using set_memory_uc() for this case is to allow
ongoing access to persistent memory pages via the pmem-driver +
memcpy_mcsafe() until the poison is repaired.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
On 32bit PAE kernels on 64bit hardware with enough physical bits,
l1tf_pfn_limit() will overflow unsigned long. This in turn affects
max_swapfile_size() and can lead to swapon returning -EINVAL. This has been
observed in a 32bit guest with 42 bits physical address size, where
max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces
the following warning to dmesg:
[ 6.396845] Truncating oversized swap area, only using 0k out of 2047996k
Fix this by using unsigned long long instead.
Fixes: 17dbca1193 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Fixes: 377eeaa8e1 ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Reported-by: Dominique Leuenberger <dimstar@suse.de>
Reported-by: Adrian Schroeter <adrian@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz
The consolidation of the start_thread() functions removed the export
unintentionally. This breaks binfmt handlers built as a module.
Add it back.
Fixes: e634d8fc79 ("x86-64: merge the standard and compat start_thread() functions")
Signed-off-by: Rian Hunter <rian@alum.mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180819230854.7275-1-rian@alum.mit.edu
Without linux/irq.h, there is no declaration of notifier_block, leading to
a build warning:
In file included from arch/x86/kernel/cpu/mcheck/threshold.c:10:
arch/x86/include/asm/mce.h:151:46: error: 'struct notifier_block' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
It's sufficient to declare the struct tag here, which avoids pulling in
more header files.
Fixes: 447ae31667 ("x86: Don't include linux/irq.h from asm/hardirq.h")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicolai Stange <nstange@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20180817100156.3009043-1-arnd@arndb.de
Currently, if the vDSO ends up containing an indirect branch or
call, GCC will emit the "external thunk" style of retpoline, and it
will fail to link.
Fix it by building the vDSO with inline retpoline thunks.
I haven't seen any reports of this triggering on an unpatched
kernel.
Fixes: commit 76b043848f ("x86/retpoline: Add initial retpoline support")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Matt Rickard <matt@softrans.com.au>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Vas Dias <jason.vas.dias@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/c76538cd3afbe19c6246c2d1715bc6a60bd63985.1534448381.git.luto@kernel.org
For x86 this brings in PCID emulation and CR3 caching for shadow page
tables, nested VMX live migration, nested VMCS shadowing, an optimized
IPI hypercall, and some optimizations.
ARM will come next week.
There is a semantic conflict because tip also added an .init_platform
callback to kvm.c. Please keep the initializer from this branch,
and add a call to kvmclock_init (added by tip) inside kvm_init_platform
(added here).
Also, there is a backmerge from 4.18-rc6. This is because of a
refactoring that conflicted with a relatively late bugfix and
resulted in a particularly hellish conflict. Because the conflict
was only due to unfortunate timing of the bugfix, I backmerged and
rebased the refactoring rather than force the resolution on you.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbdwNFAAoJEL/70l94x66DiPEH/1cAGZWGd85Y3yRu1dmTmqiz
kZy0V+WTQ5kyJF4ZsZKKOp+xK7Qxh5e9kLdTo70uPZCHwLu9IaGKN9+dL9Jar3DR
yLPX5bMsL8UUed9g9mlhdaNOquWi7d7BseCOnIyRTolb+cqnM5h3sle0gqXloVrS
UQb4QogDz8+86czqR8tNfazjQRKW/D2HEGD5NDNVY1qtpY+leCDAn9/u6hUT5c6z
EtufgyDh35UN+UQH0e2605gt3nN3nw3FiQJFwFF1bKeQ7k5ByWkuGQI68XtFVhs+
2WfqL3ftERkKzUOy/WoSJX/C9owvhMcpAuHDGOIlFwguNGroZivOMVnACG1AI3I=
=9Mgw
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull first set of KVM updates from Paolo Bonzini:
"PPC:
- minor code cleanups
x86:
- PCID emulation and CR3 caching for shadow page tables
- nested VMX live migration
- nested VMCS shadowing
- optimized IPI hypercall
- some optimizations
ARM will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
KVM: X86: Implement PV IPIs in linux guest
KVM: X86: Add kvm hypervisor init time platform setup callback
KVM: X86: Implement "send IPI" hypercall
KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
KVM: x86: Skip pae_root shadow allocation if tdp enabled
KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
KVM: vmx: move struct host_state usage to struct loaded_vmcs
KVM: vmx: compute need to reload FS/GS/LDT on demand
KVM: nVMX: remove a misleading comment regarding vmcs02 fields
KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
KVM: vmx: add dedicated utility to access guest's kernel_gs_base
KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
KVM: vmx: refactor segmentation code in vmx_save_host_state()
kvm: nVMX: Fix fault priority for VMX operations
kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
...
Here is the bit set of char/misc drivers for 4.19-rc1
There is a lot here, much more than normal, seems like everyone is
writing new driver subsystems these days... Anyway, major things here
are:
- new FSI driver subsystem, yet-another-powerpc low-level
hardware bus
- gnss, finally an in-kernel GPS subsystem to try to tame all of
the crazy out-of-tree drivers that have been floating around
for years, combined with some really hacky userspace
implementations. This is only for GNSS receivers, but you
have to start somewhere, and this is great to see.
Other than that, there are new slimbus drivers, new coresight drivers,
new fpga drivers, and loads of DT bindings for all of these and existing
drivers.
Full details of everything is in the shortlog.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW3g7ew8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykfBgCeOG0RkSI92XVZe0hs/QYFW9kk8JYAnRBf3Qpm
cvW7a+McOoKz/MGmEKsi
=TNfn
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the bit set of char/misc drivers for 4.19-rc1
There is a lot here, much more than normal, seems like everyone is
writing new driver subsystems these days... Anyway, major things here
are:
- new FSI driver subsystem, yet-another-powerpc low-level hardware
bus
- gnss, finally an in-kernel GPS subsystem to try to tame all of the
crazy out-of-tree drivers that have been floating around for years,
combined with some really hacky userspace implementations. This is
only for GNSS receivers, but you have to start somewhere, and this
is great to see.
Other than that, there are new slimbus drivers, new coresight drivers,
new fpga drivers, and loads of DT bindings for all of these and
existing drivers.
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (255 commits)
android: binder: Rate-limit debug and userspace triggered err msgs
fsi: sbefifo: Bump max command length
fsi: scom: Fix NULL dereference
misc: mic: SCIF Fix scif_get_new_port() error handling
misc: cxl: changed asterisk position
genwqe: card_base: Use true and false for boolean values
misc: eeprom: assignment outside the if statement
uio: potential double frees if __uio_register_device() fails
eeprom: idt_89hpesx: clean up an error pointer vs NULL inconsistency
misc: ti-st: Fix memory leak in the error path of probe()
android: binder: Show extra_buffers_size in trace
firmware: vpd: Fix section enabled flag on vpd_section_destroy
platform: goldfish: Retire pdev_bus
goldfish: Use dedicated macros instead of manual bit shifting
goldfish: Add missing includes to goldfish.h
mux: adgs1408: new driver for Analog Devices ADGS1408/1409 mux
dt-bindings: mux: add adi,adgs1408
Drivers: hv: vmbus: Cleanup synic memory free path
Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind()
...
Merge updates from Andrew Morton:
- a few misc things
- a few Y2038 fixes
- ntfs fixes
- arch/sh tweaks
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
mm/hmm.c: remove unused variables align_start and align_end
fs/userfaultfd.c: remove redundant pointer uwq
mm, vmacache: hash addresses based on pmd
mm/list_lru: introduce list_lru_shrink_walk_irq()
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
mm/sparse: delete old sparse_init and enable new one
mm/sparse: add new sparse_init_nid() and sparse_init()
mm/sparse: move buffer init/fini to the common place
mm/sparse: use the new sparse buffer functions in non-vmemmap
mm/sparse: abstract sparse buffer allocations
mm/hugetlb.c: don't zero 1GiB bootmem pages
mm, page_alloc: double zone's batchsize
mm/oom_kill.c: document oom_lock
mm/hugetlb: remove gigantic page support for HIGHMEM
mm, oom: remove sleep from under oom_lock
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
...
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")
In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.
Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turns out that we should *not* invert all not-present mappings,
because the all zeroes case is obviously special.
clear_page() does not undergo the XOR logic to invert the address bits,
i.e. PTE, PMD and PUD entries that have not been individually written
will have val=0 and so will trigger __pte_needs_invert(). As a result,
{pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones
(adjusted by the max PFN mask) instead of zero. A zeroed entry is ok
because the page at physical address 0 is reserved early in boot
specifically to mitigate L1TF, so explicitly exempt them from the
inversion when reading the PFN.
Manifested as an unexpected mprotect(..., PROT_NONE) failure when called
on a VMA that has VM_PFNMAP and was mmap'd to as something other than
PROT_NONE but never used. mprotect() sends the PROT_NONE request down
prot_none_walk(), which walks the PTEs to check the PFNs.
prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
-EACCES because it thinks mprotect() is trying to adjust a high MMIO
address.
[ This is a very modified version of Sean's original patch, but all
credit goes to Sean for doing this and also pointing out that
sometimes the __pte_needs_invert() function only gets the protection
bits, not the full eventual pte. But zero remains special even in
just protection bits, so that's ok. - Linus ]
Fixes: f22cc87f6c ("x86/speculation/l1tf: Invert all not present mappings")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The split of .system_keyring into .builtin_trusted_keys and
.secondary_trusted_keys broke kexec, thereby preventing kernels signed by
keys which are now in the secondary keyring from being kexec'd.
Fix this by passing VERIFY_USE_SECONDARY_KEYRING to
verify_pefile_signature().
Fixes: d3bfe84129 ("certs: Add a secondary system keyring that can be added to dynamically")
Signed-off-by: Yannik Sembritzki <yannik@sembritzki.me>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: kexec@lists.infradead.org
Cc: keyrings@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlt1f9AUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxbdhAArnhRvkwOk4m4/LCuKF6HpmlxbBNC
TjnBCenNf+lFXzWskfDFGFl/Wif4UzGbRTSCNQrwMzj3Ww3f/6R2QIq9rEJvyNC4
VdxQnaBEZSUgN87q5UGqgdjMTo3zFvlFH6fpb5XDiQ5IX/QZeXeYqoB64w+HvKPU
M+IsoOvnA5gb7pMcpchrGUnSfS1e6AqQbbTt6tZflore6YCEA4cH5OnpGx8qiZIp
ut+CMBvQjQB01fHeBc/wGrVte4NwXdONrXqpUb4sHF7HqRNfEh0QVyPhvebBi+k1
kquqoBQfPFTqgcab31VOcQhg70dEx+1qGm5/YBAwmhCpHR/g2gioFXoROsr+iUOe
BtF6LZr+Y8cySuhJnkCrJBqWvvBaKbJLg0KMbI+7p4o9MZpod2u7LS5LFrlRDyKW
3nz3o+b1+v3tCCKVKIhKo0ljolgkweQtR1f6KIHvq93wBODHVQnAOt9NlPfHVyks
ryGBnOhMjoU5hvfexgIWFk9Ph9MEVQSffkI+TeFPO/tyGBfGfQyGtESiXuEaMQaH
FGdZHX2RLkY3pWHOtWeMzRHzOnr2XjpDFcAqL3HBGPdJ30K3Umv3WOgoFe2SaocG
0gaddPjKSwwM4Sa/VP+O5cjGuzi7QnczSDdpYjxIGZzBav32hqx4/rsnLw7bHH8y
XkEme7cYJc8MGsA=
=2Dmn
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
- Decode AER errors with names similar to "lspci" (Tyler Baicar)
- Expose AER statistics in sysfs (Rajat Jain)
- Clear AER status bits selectively based on the type of recovery (Oza
Pawandeep)
- Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
Gagniuc)
- Don't clear AER status bits if we're using the "Firmware-First"
strategy where firmware owns the registers (Alexandru Gagniuc)
- Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy
Shevchenko)
- Remove unnecessary includes of <linux/pci-aspm.h> (Bjorn Helgaas)
- Defer DPC event handling to work queue (Keith Busch)
- Use threaded IRQ for DPC bottom half (Keith Busch)
- Print AER status while handling DPC events (Keith Busch)
- Work around IDT switch ACS Source Validation erratum (James
Puthukattukaran)
- Emit diagnostics for all cases of PCIe Link downtraining (Links
operating slower than they're capable of) (Alexandru Gagniuc)
- Skip VFs when configuring Max Payload Size (Myron Stowe)
- Reduce Root Port Max Payload Size if necessary when hot-adding a
device below it (Myron Stowe)
- Simplify SHPC existence/permission checks (Bjorn Helgaas)
- Remove hotplug sample skeleton driver (Lukas Wunner)
- Convert pciehp to threaded IRQ handling (Lukas Wunner)
- Improve pciehp tolerance of missed events and initially unstable
links (Lukas Wunner)
- Clear spurious pciehp events on resume (Lukas Wunner)
- Add pciehp runtime PM support, including for Thunderbolt controllers
(Lukas Wunner)
- Support interrupts from pciehp bridges in D3hot (Lukas Wunner)
- Mark fall-through switch cases before enabling -Wimplicit-fallthrough
(Gustavo A. R. Silva)
- Move DMA-debug PCI init from arch code to PCI core (Christoph
Hellwig)
- Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is
supplied (Heiner Kallweit)
- Unify PCI and DMA direction #defines (Shunyong Yang)
- Add PCI_DEVICE_DATA() macro (Andy Shevchenko)
- Check for VPD completion before checking for timeout (Bert Kenward)
- Limit Netronome NFP5000 config space size to work around erratum
(Jakub Kicinski)
- Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit)
- Document ACPI description of PCI host bridges (Bjorn Helgaas)
- Add "pci=disable_acs_redir=" parameter to disable ACS redirection for
peer-to-peer DMA support (we don't have the peer-to-peer support yet;
this is just one piece) (Logan Gunthorpe)
- Clean up devm_of_pci_get_host_bridge_resources() resource allocation
(Jan Kiszka)
- Fixup resizable BARs after suspend/resume (Christian König)
- Make "pci=earlydump" generic (Sinan Kaya)
- Fix ROM BAR access routines to stay in bounds and check for signature
correctly (Rex Zhu)
- Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer)
- Expand documentation for pci_add_dma_alias() (Logan Gunthorpe)
- To avoid bus errors, enable PASID only if entire path supports
End-End TLP prefixes (Sinan Kaya)
- Unify slot and bus reset functions and remove hotplug knowledge from
callers (Sinan Kaya)
- Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
fix guest reboot issues (Alex Williamson)
- Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD
Controller (Bjorn Helgaas)
- Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt)
- Remove Aardvark outbound window configuration (Evan Wang)
- Fix Aardvark bridge window sizing issue (Zachary Zhang)
- Convert Aardvark to use pci_host_probe() to reduce code duplication
(Thomas Petazzoni)
- Correct the Cadence cdns_pcie_writel() signature (Alan Douglas)
- Add Cadence support for optional generic PHYs (Alan Douglas)
- Add Cadence power management ops (Alan Douglas)
- Remove redundant variable from Cadence driver (Colin Ian King)
- Add Kirin MSI support (Xiaowei Song)
- Drop unnecessary root_bus_nr setting from exynos, imx6, keystone,
armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn
Guo)
- Move link notification settings from DesignWare core to individual
drivers (Gustavo Pimentel)
- Add endpoint library MSI-X interfaces (Gustavo Pimentel)
- Correct signature of endpoint library IRQ interfaces (Gustavo
Pimentel)
- Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel)
- Add endpoint library MSI-X test support (Gustavo Pimentel)
- Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation
(Jia-Ju Bai)
- Add more devices to Broadcom PAXC quirk (Ray Jui)
- Work around corrupted Broadcom PAXC config space to enable SMMU and
GICv3 ITS (Ray Jui)
- Disable MSI parsing to work around broken Broadcom PAXC logic in some
devices (Ray Jui)
- Hide unconfigured functions to work around a Broadcom PAXC defect
(Ray Jui)
- Lower iproc log level to reduce console output during boot (Ray Jui)
- Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi)
- Fix mobiveil missing include file (Lorenzo Pieralisi)
- Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi)
- Fix mvebu I/O space remapping issues (Thomas Petazzoni)
- Use generic pci_host_bridge in mvebu instead of ARM-specific API
(Thomas Petazzoni)
- Whitelist VMD devices with fast interrupt handlers to avoid sharing
vectors with slow handlers (Keith Busch)
* tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits)
PCI/AER: Don't clear AER bits if error handling is Firmware-First
PCI: Limit config space size for Netronome NFP5000
PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips
PCI/VPD: Check for VPD access completion before checking for timeout
PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
PCI: Match Root Port's MPS to endpoint's MPSS as necessary
PCI: Skip MPS logic for Virtual Functions (VFs)
PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
PCI: Check for PCIe Link downtraining
PCI: Add ACS Redirect disable quirk for Intel Sunrise Point
PCI: Add device-specific ACS Redirect disable infrastructure
PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE
PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support
PCI: Allow specifying devices using a base bus and path of devfns
PCI: Make specifying PCI devices in kernel parameters reusable
PCI: Hide ACS quirk declarations inside PCI core
PCI: Delay after FLR of Intel DC P3700 NVMe
PCI: Disable Samsung SM961/PM961 NVMe before FLR
PCI: Export pcie_has_flr()
PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers()
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbc41pAAoJEAx081l5xIa+ZrAP/AzKj4i4pBLVJcvNZ2BwD+UD
ZNSNj2iqCJ5+Jo/WtIwQ8tLct9UqfVssUwBke6tZksiLdTigGPTUyVIAdK+9kyWD
D00m3x/pToJrSF2D0FwxQlPUtPkohp9N+E6+TU7gd1oCasZfBzmcEpoVAmZf+NWE
kN1xXpmGxZWpu0wc7JA2lv9MuUTijCwIqJqa5E0bB3z06G5mw+PJ89kYzMx19OyA
ZYQK8y3A40ZGl8UbajZ4xg9pqFCRYFFHGqfYlpUWWTh0XMAXu8+Yqzh3dJxmak7r
4u2pdQBsxPMZO8qKBHpVvI7Zhoe0Ntnolc0XVD+2IbqqnTprVbQs0bWf3YyfUlQi
1/9bWFK67W0LEuzac6M7a7EQqFNiHF13Btao7aqENTIe/GaCZJoopaiRMAmh6EHD
4PezeYqrW8cSaPj6OKouL1BhW9Bjixsg0bvjS/uB6m4KekFCt1++BDFGzkqvm6Mo
SVW7nkJoCFpCASaR7DhUEOPexaHeJ65HCDDUvYdqz9jd2w1TgvvanEZWual1NwEm
ImA8A4wGZ/3KijpyyKm0gE96RX7+zMMZ3brW6p1vhUUKVYJCrvSr5jrXH5+2k6Aw
Y455doGL87IRkwyje/YbQF0I8pbUZD9QS5wII13tLGwOH9/uC/Xl6dHNM40gtqyh
W4gEdY+NAMJmYLvRNawa
=g9rD
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"This is the main drm pull request for 4.19.
Rob has some new hardware support for new qualcomm hw that I'll send
along separately. This has the display part of it, the remaining pull
is for the acceleration engine.
This also contains a wound-wait/wait-die mutex rework, Peter has acked
it for merging via my tree.
Otherwise mostly the usual level of activity. Summary:
core:
- Wound-wait/wait-die mutex rework
- Add writeback connector type
- Add "content type" property for HDMI
- Move GEM bo to drm_framebuffer
- Initial gpu scheduler documentation
- GPU scheduler fixes for dying processes
- Console deferred fbcon takeover support
- Displayport support for CEC tunneling over AUX
panel:
- otm8009a panel driver fixes
- Innolux TV123WAM and G070Y2-L01 panel driver
- Ilitek ILI9881c panel driver
- Rocktech RK070ER9427 LCD
- EDT ETM0700G0EDH6 and EDT ETM0700G0BDH6
- DLC DLC0700YZG-1
- BOE HV070WSA-100
- newhaven, nhd-4.3-480272ef-atxl LCD
- DataImage SCF0700C48GGU18
- Sharp LQ035Q7DB03
- p079zca: Refactor to support multiple panels
tinydrm:
- ILI9341 display panel
New driver:
- vkms - virtual kms driver to testing.
i915:
- Icelake:
Display enablement
DSI support
IRQ support
Powerwell support
- GPU reset fixes and improvements
- Full ppgtt support refactoring
- PSR fixes and improvements
- Execlist improvments
- GuC related fixes
amdgpu:
- Initial amdgpu documentation
- JPEG engine support on VCN
- CIK uses powerplay by default
- Move to using core PCIE functionality for gens/lanes
- DC/Powerplay interface rework
- Stutter mode support for RV
- Vega12 Powerplay updates
- GFXOFF fixes
- GPUVM fault debugging
- Vega12 GFXOFF
- DC improvements
- DC i2c/aux changes
- UVD 7.2 fixes
- Powerplay fixes for Polaris12, CZ/ST
- command submission bo_list fixes
amdkfd:
- Raven support
- Power management fixes
udl:
- Cleanups and fixes
nouveau:
- misc fixes and cleanups.
msm:
- DPU1 support display controller in sdm845
- GPU coredump support.
vmwgfx:
- Atomic modesetting validation fixes
- Support for multisample surfaces
armada:
- Atomic modesetting support completed.
exynos:
- IPPv2 fixes
- Move g2d to component framework
- Suspend/resume support cleanups
- Driver cleanups
imx:
- CSI configuration improvements
- Driver cleanups
- Use atomic suspend/resume helpers
- ipu-v3 V4L2 XRGB32/XBGR32 support
pl111:
- Add Nomadik LCDC variant
v3d:
- GPU scheduler jobs management
sun4i:
- R40 display engine support
- TCON TOP driver
mediatek:
- MT2712 SoC support
rockchip:
- vop fixes
omapdrm:
- Workaround for DRA7 errata i932
- Fix mm_list locking
mali-dp:
- Writeback implementation
PM improvements
- Internal error reporting debugfs
tilcdc:
- Single fix for deferred probing
hdlcd:
- Teardown fixes
tda998x:
- Converted to a bridge driver.
etnaviv:
- Misc fixes"
* tag 'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drm: (1506 commits)
drm/amdgpu/sriov: give 8s for recover vram under RUNTIME
drm/scheduler: fix param documentation
drm/i2c: tda998x: correct PLL divider calculation
drm/i2c: tda998x: get rid of private fill_modes function
drm/i2c: tda998x: move mode_valid() to bridge
drm/i2c: tda998x: register bridge outside of component helper
drm/i2c: tda998x: cleanup from previous changes
drm/i2c: tda998x: allocate tda998x_priv inside tda998x_create()
drm/i2c: tda998x: convert to bridge driver
drm/scheduler: fix timeout worker setup for out of order job completions
drm/amd/display: display connected to dp-1 does not light up
drm/amd/display: update clk for various HDMI color depths
drm/amd/display: program display clock on cache match
drm/amd/display: Add NULL check for enabling dp ss
drm/amd/display: add vbios table check for enabling dp ss
drm/amd/display: Don't share clk source between DP and HDMI
drm/amd/display: Fix DP HBR2 Eye Diagram Pattern on Carrizo
drm/amd/display: Use calculated disp_clk_khz value for dce110
drm/amd/display: Implement custom degamma lut on dcn
drm/amd/display: Destroy aux_engines only once
...
Pull networking updates from David Miller:
"Highlights:
- Gustavo A. R. Silva keeps working on the implicit switch fallthru
changes.
- Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
Luca Coelho.
- Re-enable ASPM in r8169, from Kai-Heng Feng.
- Add virtual XFRM interfaces, which avoids all of the limitations of
existing IPSEC tunnels. From Steffen Klassert.
- Convert GRO over to use a hash table, so that when we have many
flows active we don't traverse a long list during accumluation.
- Many new self tests for routing, TC, tunnels, etc. Too many
contributors to mention them all, but I'm really happy to keep
seeing this stuff.
- Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.
- Lots of cleanups and fixes in L2TP code from Guillaume Nault.
- Add IPSEC offload support to netdevsim, from Shannon Nelson.
- Add support for slotting with non-uniform distribution to netem
packet scheduler, from Yousuk Seung.
- Add UDP GSO support to mlx5e, from Boris Pismenny.
- Support offloading of Team LAG in NFP, from John Hurley.
- Allow to configure TX queue selection based upon RX queue, from
Amritha Nambiar.
- Support ethtool ring size configuration in aquantia, from Anton
Mikaev.
- Support DSCP and flowlabel per-transport in SCTP, from Xin Long.
- Support list based batching and stack traversal of SKBs, this is
very exciting work. From Edward Cree.
- Busyloop optimizations in vhost_net, from Toshiaki Makita.
- Introduce the ETF qdisc, which allows time based transmissions. IGB
can offload this in hardware. From Vinicius Costa Gomes.
- Add parameter support to devlink, from Moshe Shemesh.
- Several multiplication and division optimizations for BPF JIT in
nfp driver, from Jiong Wang.
- Lots of prepatory work to make more of the packet scheduler layer
lockless, when possible, from Vlad Buslov.
- Add ACK filter and NAT awareness to sch_cake packet scheduler, from
Toke Høiland-Jørgensen.
- Support regions and region snapshots in devlink, from Alex Vesker.
- Allow to attach XDP programs to both HW and SW at the same time on
a given device, with initial support in nfp. From Jakub Kicinski.
- Add TLS RX offload and support in mlx5, from Ilya Lesokhin.
- Use PHYLIB in r8169 driver, from Heiner Kallweit.
- All sorts of changes to support Spectrum 2 in mlxsw driver, from
Ido Schimmel.
- PTP support in mv88e6xxx DSA driver, from Andrew Lunn.
- Make TCP_USER_TIMEOUT socket option more accurate, from Jon
Maxwell.
- Support for templates in packet scheduler classifier, from Jiri
Pirko.
- IPV6 support in RDS, from Ka-Cheong Poon.
- Native tproxy support in nf_tables, from Máté Eckl.
- Maintain IP fragment queue in an rbtree, but optimize properly for
in-order frags. From Peter Oskolkov.
- Improvde handling of ACKs on hole repairs, from Yuchung Cheng"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
hv/netvsc: Fix NULL dereference at single queue mode fallback
net: filter: mark expected switch fall-through
xen-netfront: fix warn message as irq device name has '/'
cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
net: dsa: mv88e6xxx: missing unlock on error path
rds: fix building with IPV6=m
inet/connection_sock: prefer _THIS_IP_ to current_text_addr
net: dsa: mv88e6xxx: bitwise vs logical bug
net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
ieee802154: hwsim: using right kind of iteration
net: hns3: Add vlan filter setting by ethtool command -K
net: hns3: Set tx ring' tc info when netdev is up
net: hns3: Remove tx ring BD len register in hns3_enet
net: hns3: Fix desc num set to default when setting channel
net: hns3: Fix for phy link issue when using marvell phy driver
net: hns3: Fix for information of phydev lost problem when down/up
net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
net: hns3: Add support for serdes loopback selftest
bnxt_en: take coredump_record structure off stack
...
i8259.h uses inb/outb and thus needs to include asm/io.h to avoid the
following build error, as seen with x86_64:defconfig and CONFIG_SMP=n.
In file included from drivers/rtc/rtc-cmos.c:45:0:
arch/x86/include/asm/i8259.h: In function 'inb_pic':
arch/x86/include/asm/i8259.h:32:24: error:
implicit declaration of function 'inb'
arch/x86/include/asm/i8259.h: In function 'outb_pic':
arch/x86/include/asm/i8259.h:45:2: error:
implicit declaration of function 'outb'
Reported-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Suggested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Fixes: 447ae31667 ("x86: Don't include linux/irq.h from asm/hardirq.h")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the source statements of arch-independent Kconfig files instead of
duplicating the includes in every arch/$(SRCARCH)/Kconfig.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbdFsfAAoJED2LAQed4NsGxHsP/1tmA57OOOj8oGxO2OXhXVbr
Q0MZqCoV4bqMvK/hgCQdl9f+tp0m+j12x4xDLdVf4OqnTXMbqvPDu3uQVKvaj/k1
gHhsFA1tFgSbuJ8InltUsrPEQqbceeJsj50xHVAKijqI6LYeRPPSU7aE9obn+OzH
n2nd5sLKvMI/dqdJvW6i5KPydqTH3r3iA7D+ne/XQj0s0EMXvXUPmDT1+ijTnM4a
yfm6W5p7L/c3Ugf1Pz5PfnPl4BxBwZMfW5ie/UO8j5C6Rl0iPaOGuuHurocaaJb3
MefR/7NEAR3G8MhJyL2+70jbbwhjpqR2b5ooz1vpuulPHxjeU45BY60XIBWq1afR
ewsc12MMCYB695ieYWoHdaWgxD/jhffyRuajfpkXKIZEMgDxS03sMhdULXENVMx1
M0ZQ01g/NLWt9ti9DY3eTKB3ymOhnBa1sa77nGGUHkITq4DQKwPX1J9FP/HT6RNt
uOvzeH5kGzc7tqOlZAO0kHbwhQG1uqGcd78IYd4lgf/XfkSgDERTWjnJmnQbwr9m
3PFuST2u8eyO+8Lh1MK76TXOEkXsHMdFugPmb6SlgtMEPKGVLDPlsj52o/LFtgzl
eygfMiBFr2+ttkZ6IpNcpmQ4IztmDpz6XoMk3PqDAfUTUSYpCnq1gAEuff/eisCM
Odva1ZZaeQ7WpxhsP8rr
=gsQJ
-----END PGP SIGNATURE-----
Merge tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kconfig consolidation from Masahiro Yamada:
"Consolidation of Kconfig files by Christoph Hellwig.
Move the source statements of arch-independent Kconfig files instead
of duplicating the includes in every arch/$(SRCARCH)/Kconfig"
* tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: add a Memory Management options" menu
kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt
kconfig: use a menu in arch/Kconfig to reduce clutter
kconfig: include kernel/Kconfig.preempt from init/Kconfig
Kconfig: consolidate the "Kernel hacking" menu
kconfig: include common Kconfig files from top-level Kconfig
kconfig: remove duplicate SWAP symbol defintions
um: create a proper drivers Kconfig
um: cleanup Kconfig files
um: stop abusing KBUILD_KCONFIG
- Clean up devm_of_pci_get_host_bridge_resources() resource allocation
(Jan Kiszka)
- Fixup resizable BARs after suspend/resume (Christian König)
- Make "pci=earlydump" generic (Sinan Kaya)
- Fix ROM BAR access routines to stay in bounds and check for signature
correctly (Rex Zhu)
* pci/resource:
PCI: Make pci_get_rom_size() static
PCI: Add check code for last image indicator not set
PCI: Avoid accessing memory outside the ROM BAR
PCI: Make early dump functionality generic
PCI: Cleanup PCI_REBAR_CTRL_BAR_SHIFT handling
PCI: Restore resized BAR state on resume
PCI: Clean up resource allocation in devm_of_pci_get_host_bridge_resources()
# Conflicts:
# Documentation/admin-guide/kernel-parameters.txt
- verify depmod is installed before modules_install
- support build salt in case build ids must be unique between builds
- allow users to specify additional host compiler flags via HOST*FLAGS,
and rename internal variables to KBUILD_HOST*FLAGS
- update buildtar script to drop vax support, add arm64 support
- update builddeb script for better debarch support
- document the pit-fall of if_changed usage
- fix parallel build of UML with O= option
- make 'samples' target depend on headers_install to fix build errors
- remove deprecated host-progs variable
- add a new coccinelle script for refcount_t vs atomic_t check
- improve double-test coccinelle script
- misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbdFZ0AAoJED2LAQed4NsGcHYP/23txxk3GRP7O4UkfPw9Rtky
MHiXTgcoy2vbG+l12BgzWX+qFii8XTUe3dQtK4HnGQFUIBtEBV/hpZPJtxfgGSev
Zou5cv1kr5rNzTkCn//TG3O6/WIkTBCe2hahDCtmGDI3kd/cPK4dHbU/q6KpaqIJ
qzZYBXIvCeu2GM8idQoCRrwdMpgu1pBz1gz2sDje1yHH2toI7T6cXHRLQDgx+HPq
LIP7W9GUsoDdXjecvPD51LiW89E6BUxETBh5Ft9r9uzwB5ylQQMcw6Qyu2DiYDUX
PPsHCMiolYV+Ttcy+vj/67KOvKmEaFotssck+RD/xDCF17zKhRkup+YM8kPLHTVZ
TcAUZadbnT6U/s2W6GFwvVbN/P7cc3aif+aNCC/Pl23yagp3pydlSCocYxQgiVR7
/rx48haYDEgu/MJ1X0dOpSO0ErY7zu2OoAlNerW+D9QizwbP+WtZO/CJH8SxQRuN
dQ1xmyNrie+ODgi9tbc4eBrsb+1rioX927TP5MbJcfXt5CTsxDmIqop5XwyYIoQN
ZWWlzC8Ii3P2trAVpBgM2IEbngSxwr6T9Wbf1ScJnPKr/o1rq+pBk49cYstTz3kQ
OwJ8gPwUrkW4R+hlD7L6mL/WcrKzZBQS0Ij1QW2kVSEhRrsKo99psE1/rGehnHu9
KGB0LYYCqGSOHR4zOjg0
=VjfG
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- verify depmod is installed before modules_install
- support build salt in case build ids must be unique between builds
- allow users to specify additional host compiler flags via HOST*FLAGS,
and rename internal variables to KBUILD_HOST*FLAGS
- update buildtar script to drop vax support, add arm64 support
- update builddeb script for better debarch support
- document the pit-fall of if_changed usage
- fix parallel build of UML with O= option
- make 'samples' target depend on headers_install to fix build errors
- remove deprecated host-progs variable
- add a new coccinelle script for refcount_t vs atomic_t check
- improve double-test coccinelle script
- misc cleanups and fixes
* tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (41 commits)
coccicheck: return proper error code on fail
Coccinelle: doubletest: reduce side effect false positives
kbuild: remove deprecated host-progs variable
kbuild: make samples really depend on headers_install
um: clean up archheaders recipe
kbuild: add %asm-generic to no-dot-config-targets
um: fix parallel building with O= option
scripts: Add Python 3 support to tracing/draw_functrace.py
builddeb: Add automatic support for sh{3,4}{,eb} architectures
builddeb: Add automatic support for riscv* architectures
builddeb: Add automatic support for m68k architecture
builddeb: Add automatic support for or1k architecture
builddeb: Add automatic support for sparc64 architecture
builddeb: Add automatic support for mips{,64}r6{,el} architectures
builddeb: Add automatic support for mips64el architecture
builddeb: Add automatic support for ppc64 and powerpcspe architectures
builddeb: Introduce functions to simplify kconfig tests in set_debarch
builddeb: Drop check for 32-bit s390
builddeb: Change architecture detection fallback to use dpkg-architecture
builddeb: Skip architecture detection when KBUILD_DEBARCH is set
...
allmodconfig+CONFIG_INTEL_KVM=n results in the following build error.
ERROR: "l1tf_vmx_mitigation" [arch/x86/kvm/kvm.ko] undefined!
Fixes: 5b76a3cff0 ("KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry")
Reported-by: Meelis Roos <mroos@linux.ee>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW3LkCgAKCRCAXGG7T9hj
vtyfAQDTMUqfBlpz9XqFyTBTFRkP3aVtnEeE7BijYec+RXPOxwEAsiXwZPsmW/AN
up+NEHqPvMOcZC8zJZ9THCiBgOxligY=
=F51X
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- add dma-buf functionality to Xen grant table handling
- fix for booting the kernel as Xen PVH dom0
- fix for booting the kernel as a Xen PV guest with
CONFIG_DEBUG_VIRTUAL enabled
- other minor performance and style fixes
* tag 'for-linus-4.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/balloon: fix balloon initialization for PVH Dom0
xen: don't use privcmd_call() from xen_mc_flush()
xen/pv: Call get_cpu_address_sizes to set x86_virt/phys_bits
xen/biomerge: Use true and false for boolean values
xen/gntdev: don't dereference a null gntdev_dmabuf on allocation failure
xen/spinlock: Don't use pvqspinlock if only 1 vCPU
xen/gntdev: Implement dma-buf import functionality
xen/gntdev: Implement dma-buf export functionality
xen/gntdev: Add initial support for dma-buf UAPI
xen/gntdev: Make private routines/structures accessible
xen/gntdev: Allow mappings for DMA buffers
xen/grant-table: Allow allocating buffers suitable for DMA
xen/balloon: Share common memory reservation routines
xen/grant-table: Make set/clear page private code shared
A bunch of good stuff in here:
- Wire up support for qspinlock, replacing our trusty ticket lock code
- Add an IPI to flush_icache_range() to ensure that stale instructions
fetched into the pipeline are discarded along with the I-cache lines
- Support for the GCC "stackleak" plugin
- Support for restartable sequences, plus an arm64 port for the selftest
- Kexec/kdump support on systems booting with ACPI
- Rewrite of our syscall entry code in C, which allows us to zero the
GPRs on entry from userspace
- Support for chained PMU counters, allowing 64-bit event counters to be
constructed on current CPUs
- Ensure scheduler topology information is kept up-to-date with CPU
hotplug events
- Re-enable support for huge vmalloc/IO mappings now that the core code
has the correct hooks to use break-before-make sequences
- Miscellaneous, non-critical fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJbbV41AAoJELescNyEwWM0WoEIALhrKtsIn6vqFlSs/w6aDuJL
cMWmFxjTaKLmIq2+cJIdFLOJ3CH80Pu9gB+nEv/k+cZdCTfUVKfRf28HTpmYWsht
bb4AhdHMC7yFW752BHk+mzJspeC8h/2Rm8wMuNVplZ3MkPrwo3vsiuJTofLhVL/y
BihlU3+5sfBvCYIsWnuEZIev+/I/s/qm1ASiqIcKSrFRZP6VTt5f9TC75vFI8seW
7yc3odKb0CArexB8yBjiPNziehctQF42doxQyL45hezLfWw4qdgHOSiwyiOMxEz9
Fwwpp8Tx33SKLNJgqoqYznGW9PhYJ7n2Kslv19uchJrEV+mds82vdDNaWRULld4=
=kQn6
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"A bunch of good stuff in here. Worth noting is that we've pulled in
the x86/mm branch from -tip so that we can make use of the core
ioremap changes which allow us to put down huge mappings in the
vmalloc area without screwing up the TLB. Much of the positive
diffstat is because of the rseq selftest for arm64.
Summary:
- Wire up support for qspinlock, replacing our trusty ticket lock
code
- Add an IPI to flush_icache_range() to ensure that stale
instructions fetched into the pipeline are discarded along with the
I-cache lines
- Support for the GCC "stackleak" plugin
- Support for restartable sequences, plus an arm64 port for the
selftest
- Kexec/kdump support on systems booting with ACPI
- Rewrite of our syscall entry code in C, which allows us to zero the
GPRs on entry from userspace
- Support for chained PMU counters, allowing 64-bit event counters to
be constructed on current CPUs
- Ensure scheduler topology information is kept up-to-date with CPU
hotplug events
- Re-enable support for huge vmalloc/IO mappings now that the core
code has the correct hooks to use break-before-make sequences
- Miscellaneous, non-critical fixes and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
arm64: alternative: Use true and false for boolean values
arm64: kexec: Add comment to explain use of __flush_icache_range()
arm64: sdei: Mark sdei stack helper functions as static
arm64, kaslr: export offset in VMCOREINFO ELF notes
arm64: perf: Add cap_user_time aarch64
efi/libstub: Only disable stackleak plugin for arm64
arm64: drop unused kernel_neon_begin_partial() macro
arm64: kexec: machine_kexec should call __flush_icache_range
arm64: svc: Ensure hardirq tracing is updated before return
arm64: mm: Export __sync_icache_dcache() for xen-privcmd
drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
arm64: Add support for STACKLEAK gcc plugin
arm64: Add stack information to on_accessible_stack
drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
arm64: fix ACPI dependencies
rseq/selftests: Add support for arm64
arm64: acpi: fix alignment fault in accessing ACPI
efi/arm: map UEFI memory map even w/o runtime services enabled
efi/arm: preserve early mapping of UEFI memory map longer for BGRT
drivers: acpi: add dependency of EFI for arm64
...
Commit 5209654a46 (x86/ACPI/cstate: Allow ACPI C1 FFH MWAIT use on
AMD systems) forgot to update the ACPI C1 idle state description and
tools like turbostat display "ACPI FFH INTEL MWAIT 0x0" which is
quite confusing on an AMD system.
Drop the "INTEL" part from the ACPI C1 FFH MWAIT C-state description
to avoid confusion.
Fixes: 5209654a46 (x86/ACPI/cstate: Allow ACPI C1 FFH MWAIT use on AMD systems)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Without program headers for PTI entry trampoline pages, the trampoline
virtual addresses do not map to anything.
Example before:
sudo gdb --quiet vmlinux /proc/kcore
Reading symbols from vmlinux...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0 root=UUID=a6096b83-b763-4101-807e-f33daff63233'.
#0 0x0000000000000000 in irq_stack_union ()
(gdb) x /21ib 0xfffffe0000006000
0xfffffe0000006000: Cannot access memory at address 0xfffffe0000006000
(gdb) quit
After:
sudo gdb --quiet vmlinux /proc/kcore
[sudo] password for ahunter:
Reading symbols from vmlinux...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0-fix-4-00005-gd6e65a8b4072 root=UUID=a6096b83-b7'.
#0 0x0000000000000000 in irq_stack_union ()
(gdb) x /21ib 0xfffffe0000006000
0xfffffe0000006000: swapgs
0xfffffe0000006003: mov %rsp,-0x3e12(%rip) # 0xfffffe00000021f8
0xfffffe000000600a: xchg %ax,%ax
0xfffffe000000600c: mov %cr3,%rsp
0xfffffe000000600f: bts $0x3f,%rsp
0xfffffe0000006014: and $0xffffffffffffe7ff,%rsp
0xfffffe000000601b: mov %rsp,%cr3
0xfffffe000000601e: mov -0x3019(%rip),%rsp # 0xfffffe000000300c
0xfffffe0000006025: pushq $0x2b
0xfffffe0000006027: pushq -0x3e35(%rip) # 0xfffffe00000021f8
0xfffffe000000602d: push %r11
0xfffffe000000602f: pushq $0x33
0xfffffe0000006031: push %rcx
0xfffffe0000006032: push %rdi
0xfffffe0000006033: mov $0xffffffff91a00010,%rdi
0xfffffe000000603a: callq 0xfffffe0000006046
0xfffffe000000603f: pause
0xfffffe0000006041: lfence
0xfffffe0000006044: jmp 0xfffffe000000603f
0xfffffe0000006046: mov %rdi,(%rsp)
0xfffffe000000604a: retq
(gdb) quit
In addition, entry trampolines all map to the same page. Represent that
by giving the corresponding program headers in kcore the same offset.
This has the benefit that, when perf tools uses /proc/kcore as a source
for kernel object code, samples from different CPU trampolines are
aggregated together. Note, such aggregation is normal for profiling
i.e. people want to profile the object code, not every different virtual
address the object code might be mapped to (across different processes
for example).
Notes by PeterZ:
This also adds the KCORE_REMAP functionality.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1528289651-4113-4-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently, the addresses of PTI entry trampolines are not exported to
user space. Kernel profiling tools need these addresses to identify the
kernel code, so add a symbol and address for each CPU's PTI entry
trampoline.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1528289651-4113-3-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The function has an inline "return false;" definition with CONFIG_SMP=n
but the "real" definition is also visible leading to "redefinition of
‘apic_id_is_primary_thread’" compiler error.
Guard it with #ifdef CONFIG_SMP
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 6a4d2657e0 ("x86/smp: Provide topology_is_primary_thread()")
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add a new framework for CPU idle time injection (Daniel Lezcano).
- Add AVS support to the armada-37xx cpufreq driver (Gregory CLEMENT).
- Add support for current CPU frequency reporting to the ACPI CPPC
cpufreq driver (George Cherian).
- Rework the cooling device registration in the imx6q/thermal
driver (Bastian Stender).
- Make the pcc-cpufreq driver refuse to work with dynamic
scaling governors on systems with many CPUs to avoid
scalability issues with it (Rafael Wysocki).
- Fix the intel_pstate driver to report different maximum CPU
frequencies on systems where they really are different and to
ignore the turbo active ratio if hardware-managend P-states (HWP)
are in use; make it use the match_string() helper (Xie Yisheng,
Srinivas Pandruvada).
- Fix a minor deferred probe issue in the qcom-kryo cpufreq
driver (Niklas Cassel).
- Add a tracepoint for the tracking of frequency limits changes
(from Andriod) to the cpufreq core (Ruchi Kandoi).
- Fix a circular lock dependency between CPU hotplug and sysfs
locking in the cpufreq core reported by lockdep (Waiman Long).
- Avoid excessive error reports on driver registration failures
in the ARM cpuidle driver (Sudeep Holla).
- Add a new device links flag to the driver core to make links go
away automatically on supplier driver removal (Vivek Gautam).
- Eliminate potential race condition between system-wide power
management transitions and system shutdown (Pingfan Liu).
- Add a quirk to save NVS memory on system suspend for the ASUS
1025C laptop (Willy Tarreau).
- Make more systems use suspend-to-idle (instead of ACPI S3) by
default (Tristian Celestin).
- Get rid of stack VLA usage in the low-level hibernation code on
64-bit x86 (Kees Cook).
- Fix error handling in the hibernation core and mark an expected
fall-through switch in it (Chengguang Xu, Gustavo Silva).
- Extend the generic power domains (genpd) framework to support
attaching a device to a power domain by name (Ulf Hansson).
- Fix device reference counting and user limits initialization in
the devfreq core (Arvind Yadav, Matthias Kaehlcke).
- Fix a few issues in the rk3399_dmc devfreq driver and improve its
documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).
- Drop a redundant error message from the exynos-ppmu devfreq driver
(Markus Elfring).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbcqOqAAoJEILEb/54YlRxOxMP/2ZFvnXU0pey/VX/+TelLMS7
/ROVGQ+s75QP1c9P/3BjvnXc0dsMRLRFPog+7wyoG/2DbEIV25COyAYsmSE0TRni
XUaZO6YAx4/e3pm2AfamYbLCPvjw85eucHg5QJQ4b1mSVRNJOsNv+fUo6lmxwvnm
j9kHvfttFeIhoa/3wa7hbhPKLln46atnpVSxCIceY7L5EFNhkKBvQt6B5yx9geb9
QMY6ohgkyN+bnK9QySXX+trcWpzx1uGX0apI07NkX7n9QGFdU4lCW8lsAf8jMC3g
PPValTsUQsdRONUJJsrgqBioq4tvtgQWibyS2tfRrOGXYvHpJNpGmHVplfsrf/SE
cvlsciR47YbmrXZuqg/r8hql+qefNN16/rnZIZ9VnbcG806VBy2z8IzI5wcdWR7p
vzxhbCqVqOHcEdEwRwvuM2io67MWvkGtKsbCP+33DBh8SubpsECpKN4nIDboa3SE
CJ15RUqXnF6enmmfCKOoHZeu7iXWDz6Pi71XmRzaj9DqbITVV281IerqLgV3rbal
BVa53+202iD0IP+2b7KedGe/5ALlI97ffN0gB+L/eB832853DKSZQKzcvvpRhEN7
Iv2crnUwuQED9ns8P7hzp1Bk9CFCAOLW8UM43YwZRPWnmdeSsPJusJ5lzkAf7bss
wfsFoUE3RaY4msnuHyCh
=kv2M
-----END PGP SIGNATURE-----
Merge tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add a new framework for CPU idle time injection, to be used by
all of the idle injection code in the kernel in the future, fix some
issues and add a number of relatively small extensions in multiple
places.
Specifics:
- Add a new framework for CPU idle time injection (Daniel Lezcano).
- Add AVS support to the armada-37xx cpufreq driver (Gregory
CLEMENT).
- Add support for current CPU frequency reporting to the ACPI CPPC
cpufreq driver (George Cherian).
- Rework the cooling device registration in the imx6q/thermal driver
(Bastian Stender).
- Make the pcc-cpufreq driver refuse to work with dynamic scaling
governors on systems with many CPUs to avoid scalability issues
with it (Rafael Wysocki).
- Fix the intel_pstate driver to report different maximum CPU
frequencies on systems where they really are different and to
ignore the turbo active ratio if hardware-managend P-states (HWP)
are in use; make it use the match_string() helper (Xie Yisheng,
Srinivas Pandruvada).
- Fix a minor deferred probe issue in the qcom-kryo cpufreq driver
(Niklas Cassel).
- Add a tracepoint for the tracking of frequency limits changes (from
Andriod) to the cpufreq core (Ruchi Kandoi).
- Fix a circular lock dependency between CPU hotplug and sysfs
locking in the cpufreq core reported by lockdep (Waiman Long).
- Avoid excessive error reports on driver registration failures in
the ARM cpuidle driver (Sudeep Holla).
- Add a new device links flag to the driver core to make links go
away automatically on supplier driver removal (Vivek Gautam).
- Eliminate potential race condition between system-wide power
management transitions and system shutdown (Pingfan Liu).
- Add a quirk to save NVS memory on system suspend for the ASUS 1025C
laptop (Willy Tarreau).
- Make more systems use suspend-to-idle (instead of ACPI S3) by
default (Tristian Celestin).
- Get rid of stack VLA usage in the low-level hibernation code on
64-bit x86 (Kees Cook).
- Fix error handling in the hibernation core and mark an expected
fall-through switch in it (Chengguang Xu, Gustavo Silva).
- Extend the generic power domains (genpd) framework to support
attaching a device to a power domain by name (Ulf Hansson).
- Fix device reference counting and user limits initialization in the
devfreq core (Arvind Yadav, Matthias Kaehlcke).
- Fix a few issues in the rk3399_dmc devfreq driver and improve its
documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).
- Drop a redundant error message from the exynos-ppmu devfreq driver
(Markus Elfring)"
* tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
PM / reboot: Eliminate race between reboot and suspend
PM / hibernate: Mark expected switch fall-through
cpufreq: intel_pstate: Ignore turbo active ratio in HWP
cpufreq: Fix a circular lock dependency problem
cpu/hotplug: Add a cpus_read_trylock() function
x86/power/hibernate_64: Remove VLA usage
cpufreq: trace frequency limits change
cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP
cpufreq: pcc-cpufreq: Disable dynamic scaling on many-CPU systems
cpufreq: qcom-kryo: Silently error out on EPROBE_DEFER
cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC
cpufreq: armada-37xx: Add AVS support
dt-bindings: marvell: Add documentation for the Armada 3700 AVS binding
PM / devfreq: rk3399_dmc: Fix duplicated opp table on reload.
PM / devfreq: Init user limits from OPP limits, not viceversa
PM / devfreq: rk3399_dmc: fix spelling mistakes.
PM / devfreq: rk3399_dmc: do not print error when get supply and clk defer.
dt-bindings: devfreq: rk3399_dmc: move interrupts to be optional.
PM / devfreq: rk3399_dmc: remove wait for dcf irq event.
dt-bindings: clock: add rk3399 DDR3 standard speed bins.
...
The introduction of generic_max_swapfile_size and arch-specific versions has
broken linking on x86 with CONFIG_SWAP=n due to undefined reference to
'generic_max_swapfile_size'. Fix it by compiling the x86-specific
max_swapfile_size() only with CONFIG_SWAP=y.
Reported-by: Tomas Pruzina <pruzinat@gmail.com>
Fixes: 377eeaa8e1 ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- a series from Robin to fix bus imposed dma limits by adding a separate
mask for them to struct device instead of trying to squeeze a second
meaning out of the existing dma mask as we did before. This has ACKs
from the various other subsystems touched
- a small swiotlb cleanup from Kees (acked by Konrad)
- conversion of nios2 and sh to the new generic dma-noncoherent code.
Various other architecture conversions will come through the
architectures maintainers trees.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAltxLuILHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPKMA//WY7UGhTrQ9kSjrfCRYJnSgLPjkqwY1wuYdXg1FSw
rUYtQXCem59W48aX9QqHzGiCXf3uVJDBCO7/ikP8R/hHJTC/ih6tTJAtKMBTaDm5
NEVU4lZCzHfzliF3vhj/3+ifWThI0gSo5QQolf4hD5255a95HMLZgJ9zt0IRVgdj
aKfFfKG824LX+G4iuN27jwmvfOJM4KwzaAFID6Rc4Pt1BXy3CmLkED63p6YvTAk/
pCR9LZpYWPccEuA1YfHoDJrB4onTw7zkVHFBtMZ5xW+ypJwRACSpLp/8IjeG841x
/vA3BsDNu8zrRj2cHpD+fViPriMm/qdMew45keII/0jCeCs640wg9Cy7f6yl2QQh
nsCJA8DEKvY8ebaX12XyWC3GNbLNru1XORrMl9+Lqu0kUHbdNxRDg47fWILLdxGc
QvgassFOxb7wg9GwIsoGpOItagl2hHP3j0vMrHlfWaBNefi5ZtWiZs8aJBgm2rJm
kzjmXGzx9IBy5SWSpou5UDCvf4xF7CkTA11YKPyBt2H3hquFKBH6R+6SzAm0n4rn
WcAlqx4BQX1no/p0hPM4b01yPwgd3OmNmITAaYKWTvDXKpImRkPYWIfvLQBx4/PW
9TyVd1QWO1wMbCziVCaHVwpsNAHFTK/kw41OLjK4H1P+HMorGH0crhimfl5BiaJY
XJk=
=WNwP
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.19' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- a series from Robin to fix bus imposed dma limits by adding a
separate mask for them to struct device instead of trying to squeeze
a second meaning out of the existing dma mask as we did before.
This has ACKs from the various other subsystems touched
- a small swiotlb cleanup from Kees (acked by Konrad)
- conversion of nios2 and sh to the new generic dma-noncoherent code.
Various other architecture conversions will come through the
architectures maintainers trees.
* tag 'dma-mapping-4.19' of git://git.infradead.org/users/hch/dma-mapping:
sh: use generic dma_noncoherent_ops
sh: split arch/sh/mm/consistent.c
sh: use dma_direct_ops for the CONFIG_DMA_COHERENT case
sh: introduce a sh_cacheop_vaddr helper
sh: simplify get_arch_dma_ops
OF: Don't set default coherent DMA mask
ACPI/IORT: Don't set default coherent DMA mask
iommu/dma: Respect bus DMA limit for IOVAs
of/device: Set bus DMA mask as appropriate
ACPI/IORT: Set bus DMA mask as appropriate
dma-mapping: Generalise dma_32bit_limit flag
ACPI/IORT: Support address size limit for root complexes
of/platform: Initialise default DMA masks
nios2: use generic dma_noncoherent_ops
swiotlb: clean up reporting
dma-mapping: relax warning for per-device areas
Always set the 5 upper-most supported physical address bits to 1 for SPTEs
that are marked as non-present or reserved, to make them unusable for
L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs.
(We do not need to mark PTEs that are completely 0 as physical page 0
is already reserved.)
This allows mitigation of L1TF without disabling hyper-threading by using
shadow paging mode instead of EPT.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.
If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.
While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.
While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.
The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646
The mitigations provided by this pull request include:
- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.
- Hypervisor protection by flushing L1 Data Cache on VMENTER.
- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs
- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs
- Extensive documentation about L1TF including various degrees of
mitigations.
Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.
There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"
* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...
Merge changes in the PM core, system-wide PM infrastructure, generic
power domains (genpd) framework, ACPI PM infrastructure and cpuidle
for 4.19.
* pm-core:
driver core: Add flag to autoremove device link on supplier unbind
driver core: Rename flag AUTOREMOVE to AUTOREMOVE_CONSUMER
* pm-domains:
PM / Domains: Introduce dev_pm_domain_attach_by_name()
PM / Domains: Introduce option to attach a device by name to genpd
PM / Domains: dt: Add a power-domain-names property
* pm-sleep:
PM / reboot: Eliminate race between reboot and suspend
PM / hibernate: Mark expected switch fall-through
x86/power/hibernate_64: Remove VLA usage
PM / hibernate: cast PAGE_SIZE to int when comparing with error code
* acpi-pm:
ACPI / PM: save NVS memory for ASUS 1025C laptop
ACPI / PM: Default to s2idle in all machines supporting LP S0
* pm-cpuidle:
ARM: cpuidle: silence error on driver registration failure
Pull x86 timer updates from Thomas Gleixner:
"Early TSC based time stamping to allow better boot time analysis.
This comes with a general cleanup of the TSC calibration code which
grew warts and duct taping over the years and removes 250 lines of
code. Initiated and mostly implemented by Pavel with help from various
folks"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/kvmclock: Mark kvm_get_preset_lpj() as __init
x86/tsc: Consolidate init code
sched/clock: Disable interrupts when calling generic_sched_clock_init()
timekeeping: Prevent false warning when persistent clock is not available
sched/clock: Close a hole in sched_clock_init()
x86/tsc: Make use of tsc_calibrate_cpu_early()
x86/tsc: Split native_calibrate_cpu() into early and late parts
sched/clock: Use static key for sched_clock_running
sched/clock: Enable sched clock early
sched/clock: Move sched clock initialization and merge with generic clock
x86/tsc: Use TSC as sched clock early
x86/tsc: Initialize cyc2ns when tsc frequency is determined
x86/tsc: Calibrate tsc only once
ARM/time: Remove read_boot_clock64()
s390/time: Remove read_boot_clock64()
timekeeping: Default boot time offset to local_clock()
timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
s390/time: Add read_persistent_wall_and_boot_offset()
x86/xen/time: Output xen sched_clock time from 0
x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
...
Pull x86 PTI updates from Thomas Gleixner:
"The Speck brigade sadly provides yet another large set of patches
destroying the perfomance which we carefully built and preserved
- PTI support for 32bit PAE. The missing counter part to the 64bit
PTI code implemented by Joerg.
- A set of fixes for the Global Bit mechanics for non PCID CPUs which
were setting the Global Bit too widely and therefore possibly
exposing interesting memory needlessly.
- Protection against userspace-userspace SpectreRSB
- Support for the upcoming Enhanced IBRS mode, which is preferred
over IBRS. Unfortunately we dont know the performance impact of
this, but it's expected to be less horrible than the IBRS
hammering.
- Cleanups and simplifications"
* 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
x86/mm/pti: Move user W+X check into pti_finalize()
x86/relocs: Add __end_rodata_aligned to S_REL
x86/mm/pti: Clone kernel-image on PTE level for 32 bit
x86/mm/pti: Don't clear permissions in pti_clone_pmd()
x86/mm/pti: Fix 32 bit PCID check
x86/mm/init: Remove freed kernel image areas from alias mapping
x86/mm/init: Add helper for freeing kernel image pages
x86/mm/init: Pass unconverted symbol addresses to free_init_pages()
mm: Allow non-direct-map arguments to free_reserved_area()
x86/mm/pti: Clear Global bit more aggressively
x86/speculation: Support Enhanced IBRS on future CPUs
x86/speculation: Protect against userspace-userspace spectreRSB
x86/kexec: Allocate 8k PGDs for PTI
Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
x86/mm: Remove in_nmi() warning from vmalloc_fault()
x86/entry/32: Check for VM86 mode in slow-path check
perf/core: Make sure the ring-buffer is mapped in all page-tables
x86/pti: Check the return value of pti_user_pagetable_walk_pmd()
x86/pti: Check the return value of pti_user_pagetable_walk_p4d()
x86/entry/32: Add debug code to check entry/exit CR3
...
Pull x86 vdso update from Thomas Gleixner:
"Use LD to link the VDSO libs instead of indirecting trough CC which
causes build failures with Clang"
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: vdso: Use $LD instead of $CC to link
Pull misc x86 fixes from Thomas Gleixner:
"Two fixes for x86:
- Provide a declaration for native_save_fl() which unbreaks the
wreckage caused by making it 'extern inline'.
- Fix the failing paravirt patching which is supposed to replace
indirect with direct calls. The wreckage is caused by an incorrect
clobber test"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/paravirt: Fix spectre-v2 mitigations for paravirt guests
x86/irqflags: Provide a declaration for native_save_fl
Pull x86 mm updates from Thomas Gleixner:
- Make lazy TLB mode even lazier to avoid pointless switch_mm()
operations, which reduces CPU load by 1-2% for memcache workloads
- Small cleanups and improvements all over the place
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove redundant check for kmem_cache_create()
arm/asm/tlb.h: Fix build error implicit func declaration
x86/mm/tlb: Make clear_asid_other() static
x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
x86/mm/tlb: Always use lazy TLB mode
x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Restructure switch_mm_irqs_off()
x86/mm/tlb: Leave lazy TLB mode at page table free time
mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
x86/mm: Add TLB purge to free pmd/pte page interfaces
ioremap: Update pgtable free interfaces with addr
x86/mm: Disable ioremap free page handling on x86-PAE
Pull x86 platform updates from Thomas Gleixner:
"Trivial cleanups and improvements"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/UV: Remove redundant check of p == q
x86/platform/olpc: Use PTR_ERR_OR_ZERO()
x86/platform/UV: Mark memblock related init code and data correctly
Pull x86 cache QoS (RDT/CAR) updates from Thomas Gleixner:
"Add support for pseudo-locked cache regions.
Cache Allocation Technology (CAT) allows on certain CPUs to isolate a
region of cache and 'lock' it. Cache pseudo-locking builds on the fact
that a CPU can still read and write data pre-allocated outside its
current allocated area on cache hit. With cache pseudo-locking data
can be preloaded into a reserved portion of cache that no application
can fill, and from that point on will only serve cache hits. The cache
pseudo-locked memory is made accessible to user space where an
application can map it into its virtual address space and thus have a
region of memory with reduced average read latency.
The locking is not perfect and gets totally screwed by WBINDV and
similar mechanisms, but it provides a reasonable enhancement for
certain types of latency sensitive applications.
The implementation extends the current CAT mechanism and provides a
generally useful exclusive CAT mode on which it builds the extra
pseude-locked regions"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
x86/intel_rdt: Disable PMU access
x86/intel_rdt: Fix possible circular lock dependency
x86/intel_rdt: Make CPU information accessible for pseudo-locked regions
x86/intel_rdt: Support restoration of subset of permissions
x86/intel_rdt: Fix cleanup of plr structure on error
x86/intel_rdt: Move pseudo_lock_region_clear()
x86/intel_rdt: Limit C-states dynamically when pseudo-locking active
x86/intel_rdt: Support L3 cache performance event of Broadwell
x86/intel_rdt: More precise L2 hit/miss measurements
x86/intel_rdt: Create character device exposing pseudo-locked region
x86/intel_rdt: Create debugfs files for pseudo-locking testing
x86/intel_rdt: Create resctrl debug area
x86/intel_rdt: Ensure RDT cleanup on exit
x86/intel_rdt: Resctrl files reflect pseudo-locked information
x86/intel_rdt: Support creation/removal of pseudo-locked region
x86/intel_rdt: Pseudo-lock region creation/removal core
x86/intel_rdt: Discover supported platforms via prefetch disable bits
x86/intel_rdt: Add utilities to test pseudo-locked region possibility
x86/intel_rdt: Split resource group removal in two
x86/intel_rdt: Enable entering of pseudo-locksetup mode
...
Pull x86/hyper-v update from Thomas Gleixner:
"Add fast hypercall support for guest running on the Microsoft HyperV(isor)"
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyper-v: Fix wrong merge conflict resolution
x86/hyper-v: Check for VP_INVAL in hyperv_flush_tlb_others()
x86/hyper-v: Check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()
x86/hyper-v: Trace PV IPI send
x86/hyper-v: Use cheaper HVCALL_SEND_IPI hypercall when possible
x86/hyper-v: Use 'fast' hypercall for HVCALL_SEND_IPI
x86/hyper-v: Implement hv_do_fast_hypercall16
x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible
Pull x86 dump printing cleanup from Thomas Gleixner:
"Clean up the show_opcodes() printout so nested dumps can be properly
differentiated"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Avoid pr_cont() in show_opcodes()
Pull x86 cpu updates from Thomas Gleixner:
"Two small updates for the CPU code:
- Improve NUMA emulation
- Add the EPT_AD CPU feature bit"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpufeatures: Add EPT_AD feature bit
x86/numa_emulation: Introduce uniform split capability
x86/numa_emulation: Fix emulated-to-physical node mapping
Pull x86 cleanups from Thomas Gleixner:
"Trival cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/iommu: Use NULL instead of 0
x86/platform/pcspeaker: Use PTR_ERR_OR_ZERO() to fix ptr_ret.cocci warning
Pull x86 build cleanup from Thomas Gleixner:
"Remove a stale quirk for a no longer supported GCC version"
* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Remove old -funit-at-a-time GCC quirk
Pull x86 asm updates from Thomas Gleixner:
"The lowlevel and ASM code updates for x86:
- Make stack trace unwinding more reliable
- ASM instruction updates for better code generation
- Various cleanups"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Add two more instruction suffixes
x86/asm/64: Use 32-bit XOR to zero registers
x86/build/vdso: Simplify 'cmd_vdso2c'
x86/build/vdso: Remove unused vdso-syms.lds
x86/stacktrace: Enable HAVE_RELIABLE_STACKTRACE for the ORC unwinder
x86/unwind/orc: Detect the end of the stack
x86/stacktrace: Do not fail for ORC with regs on stack
x86/stacktrace: Clarify the reliable success paths
x86/stacktrace: Remove STACKTRACE_DUMP_ONCE
x86/stacktrace: Do not unwind after user regs
x86/asm: Use CC_SET/CC_OUT in percpu_cmpxchg8b_double() to micro-optimize code generation
Pull x86 boot updates from Thomas Gleixner:
"Boot code updates for x86:
- Allow to skip a given amount of huge pages for address layout
randomization on the kernel command line to prevent regressions in
the huge page allocation with small memory sizes
- Various cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Use CC_SET()/CC_OUT() instead of open coding it
x86/boot/KASLR: Make local variable mem_limit static
x86/boot/KASLR: Skip specified number of 1GB huge pages when doing physical randomization (KASLR)
x86/boot/KASLR: Add two new functions for 1GB huge pages handling
Pull x86 apic update from Thomas Gleixner:
"Trivial cleanups of the APIC related code"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Trivial coding style fixes
x86/vector: Merge allocate_vector() into assign_vector_locked()
Pull perf update from Thomas Gleixner:
"The perf crowd presents:
Kernel updates:
- Removal of jprobes
- Cleanup and consolidatation the handling of kprobes
- Cleanup and consolidation of hardware breakpoints
- The usual pile of fixes and updates to PMUs and event descriptors
Tooling updates:
- Updates and improvements all over the place. Nothing outstanding,
just the (good) boring incremental grump work"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
perf trace: Do not require --no-syscalls to suppress strace like output
perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's bpf.h
perf tools: Allow overriding MAX_NR_CPUS at compile time
perf bpf: Show better message when failing to load an object
perf list: Unify metric group description format with PMU event description
perf vendor events arm64: Update ThunderX2 implementation defined pmu core events
perf cs-etm: Generate branch sample for CS_ETM_TRACE_ON packet
perf cs-etm: Generate branch sample when receiving a CS_ETM_TRACE_ON packet
perf cs-etm: Support dummy address value for CS_ETM_TRACE_ON packet
perf cs-etm: Fix start tracing packet handling
perf build: Fix installation directory for eBPF
perf c2c report: Fix crash for empty browser
perf tests: Fix indexing when invoking subtests
perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' args
perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
perf trace beauty: Do not print NULL strarray entries
perf beauty: Add a generator for IPPROTO_ socket's protocol constants
tools include uapi: Grab a copy of linux/in.h
perf tests: Fix complex event name parsing
perf evlist: Fix error out while applying initial delay and LBR
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...