This patch enables arm64 to be built with vmap'd task and IRQ stacks.
As vmap'd stacks are mapped at page granularity, stacks must be a multiple of
PAGE_SIZE. This means that a 64K page kernel must use stacks of at least 64K in
size.
To minimize the increase in Image size, IRQ stacks are dynamically allocated at
boot time, rather than embedding the boot CPU's IRQ stack in the kernel image.
This patch was co-authored by Ard Biesheuvel and Mark Rutland.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
We allocate our IRQ stacks using a percpu array. This allows us to generate our
IRQ stack pointers with adr_this_cpu, but bloats the kernel Image with the boot
CPU's IRQ stack. Additionally, these are packed with other percpu variables,
and aren't guaranteed to have guard pages.
When we enable VMAP_STACK we'll want to vmap our IRQ stacks also, in order to
provide guard pages and to permit more stringent alignment requirements. Doing
so will require that we use a percpu pointer to each IRQ stack, rather than
allocating a percpu IRQ stack in the kernel image.
This patch updates our IRQ stack code to use a percpu pointer to the base of
each IRQ stack. This will allow us to change the way the stack is allocated
with minimal changes elsewhere. In some cases we may try to backtrace before
the IRQ stack pointers are initialised, so on_irq_stack() is updated to account
for this.
In testing with cyclictest, there was no measureable difference between using
adr_this_cpu (for irq_stack) and ldr_this_cpu (for irq_stack_ptr) in the IRQ
entry path.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
In subsequent patches, we will detect stack overflow in our exception
entry code, by verifying the SP after it has been decremented to make
space for the exception regs.
This verification code is small, and we can minimize its impact by
placing it directly in the vectors. To avoid redundant modification of
the SP, we also need to move the initial decrement of the SP into the
vectors.
As a preparatory step, this patch introduces kernel_ventry, which
performs this decrement, and updates the entry code accordingly.
Subsequent patches will fold SP verification into kernel_ventry.
There should be no functional change as a result of this patch.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[Mark: turn into prep patch, expand commit msg]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Currently we define SEGMENT_ALIGN directly in our vmlinux.lds.S.
This is unfortunate, as the EFI stub currently open-codes the same
number, and in future we'll want to fiddle with this.
This patch moves the definition to our <asm/memory.h>, where it can be
used by both vmlinux.lds.S and the EFI stub code.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Before we add yet another stack to the kernel, it would be nice to
ensure that we consistently organise stack definitions and related
helper functions.
This patch moves the basic IRQ stack defintions to <asm/memory.h> to
live with their task stack counterparts. Helpers used for unwinding are
moved into <asm/stacktrace.h>, where subsequent patches will add helpers
for other stacks. Includes are fixed up accordingly.
This patch is a pure refactoring -- there should be no functional
changes as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
For historical reasons, we leave the top 16 bytes of our task and IRQ
stacks unused, a practice used to ensure that the SP can always be
masked to find the base of the current stack (historically, where
thread_info could be found).
However, this is not necessary, as:
* When an exception is taken from a task stack, we decrement the SP by
S_FRAME_SIZE and stash the exception registers before we compare the
SP against the task stack. In such cases, the SP must be at least
S_FRAME_SIZE below the limit, and can be safely masked to determine
whether the task stack is in use.
* When transitioning to an IRQ stack, we'll place a dummy frame onto the
IRQ stack before enabling asynchronous exceptions, or executing code
we expect to trigger faults. Thus, if an exception is taken from the
IRQ stack, the SP must be at least 16 bytes below the limit.
* We no longer mask the SP to find the thread_info, which is now found
via sp_el0. Note that historically, the offset was critical to ensure
that cpu_switch_to() found the correct stack for new threads that
hadn't yet executed ret_from_fork().
Given that, this initial offset serves no purpose, and can be removed.
This brings us in-line with other architectures (e.g. x86) which do not
rely on this masking.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[Mark: rebase, kill THREAD_START_SP, commit msg additions]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Our __die() implementation tries to dump the stack memory, in addition
to a backtrace, which is problematic.
For contemporary 16K stacks, this can be a lot of data, which can take a
long time to dump, and can push other useful context out of the kernel's
printk ringbuffer (and/or a user's scrollback buffer on an attached
console).
Additionally, the code implicitly assumes that the SP is on the task's
stack, and tries to dump everything between the SP and the highest task
stack address. When the SP points at an IRQ stack (or is corrupted),
this makes the kernel attempt to dump vast amounts of VA space. With
vmap'd stacks, this may result in erroneous accesses to peripherals.
This patch removes the memory dump, leaving us to rely on the backtrace,
and other means of dumping stack memory such as kdump.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
The unwind code sets the sp member of struct stackframe to
'frame pointer + 0x10' unconditionally, without regard for whether
doing so produces a legal value. So let's simply remove it now that
we have stopped using it anyway.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
As it turns out, the unwind code is slightly broken, and probably has
been for a while. The problem is in the dumping of the exception stack,
which is intended to dump the contents of the pt_regs struct at each
level in the call stack where an exception was taken and routed to a
routine marked as __exception (which means its stack frame is right
below the pt_regs struct on the stack).
'Right below the pt_regs struct' is ill defined, though: the unwind
code assigns 'frame pointer + 0x10' to the .sp member of the stackframe
struct at each level, and dump_backtrace() happily dereferences that as
the pt_regs pointer when encountering an __exception routine. However,
the actual size of the stack frame created by this routine (which could
be one of many __exception routines we have in the kernel) is not known,
and so frame.sp is pretty useless to figure out where struct pt_regs
really is.
So it seems the only way to ensure that we can find our struct pt_regs
when walking the stack frames is to put it at a known fixed offset of
the stack frame pointer that is passed to such __exception routines.
The simplest way to do that is to put it inside pt_regs itself, which is
the main change implemented by this patch. As a bonus, doing this allows
us to get rid of a fair amount of cruft related to walking from one stack
to the other, which is especially nice since we intend to introduce yet
another stack for overflow handling once we add support for vmapped
stacks. It also fixes an inconsistency where we only add a stack frame
pointing to ELR_EL1 if we are executing from the IRQ stack but not when
we are executing from the task stack.
To consistly identify exceptions regs even in the presence of exceptions
taken from entry code, we must check whether the next frame was created
by entry text, rather than whether the current frame was crated by
exception text.
To avoid backtracing using PCs that fall in the idmap, or are controlled
by userspace, we must explcitly zero the FP and LR in startup paths, and
must ensure that the frame embedded in pt_regs is zeroed upon entry from
EL0. To avoid these NULL entries showin in the backtrace, unwind_frame()
is updated to avoid them.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[Mark: compare current frame against .entry.text, avoid bogus PCs]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Currently, when unwinding the call stack, we validate the frame pointer
of each frame against frame.sp, whose value is not clearly defined, and
which makes it more difficult to link stack frames together across
different stacks. It is far better to simply check whether the frame
pointer itself points into a valid stack.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Our IRQ_STACK_PTR() and on_irq_stack() helpers both take a cpu argument,
used to generate a percpu address. In all cases, they are passed
{raw_,}smp_processor_id(), so this parameter is redundant.
Since {raw_,}smp_processor_id() use a percpu variable internally, this
approach means we generate a percpu offset to find the current cpu, then
use this to index an array of percpu offsets, which we then use to find
the current CPU's IRQ stack pointer. Thus, most of the work is
redundant.
Instead, we can consistently use raw_cpu_ptr() to generate the CPU's
irq_stack pointer by simply adding the percpu offset to the irq_stack
address, which is simpler in both respects.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Currently, cpu_switch_to and ret_from_fork both live in .entry.text,
though neither form the critical path for an exception entry.
In subsequent patches, we will require that code in .entry.text is part
of the critical path for exception entry, for which we can assume
certain properties (e.g. the presence of exception regs on the stack).
Neither cpu_switch_to nor ret_from_fork will meet these requirements, so
we must move them out of .entry.text. To ensure that neither are kprobed
after being moved out of .entry.text, we must explicitly blacklist them,
requiring a new NOKPROBE() asm helper.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
In most cases, our exception entry assembly branches to C handlers with
a BL instruction, but in cases where we do not expect to return, we use
B instead.
While this is correct today, it means that backtraces for fatal
exceptions miss the entry assembly (as the LR is stale at the point we
call C code), while non-fatal exceptions have the entry assembly in the
LR. In subsequent patches, we will need the LR to be set in these cases
in order to backtrace reliably.
This patch updates these sites to use a BL, ensuring consistency, and
preparing for backtrace rework. An ASM_BUG() is added after each of
these new BLs, which both catches unexpected returns, and ensures that
the LR value doesn't point to another function label.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Now that we have a custom printf format specifier, convert users of
full_name to use %pOF instead. This is preparation to remove storing
of the full path string for each node.
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Will Deacon <will.deacon@arm.com>
In current die(), the irq is disabled for __die() handle, not
including the possible panic() handling. Since the log in __die()
can take several hundreds ms, new irq might come and interrupt
current die().
If the process calling die() holds some critical resource, and some
other process scheduled later also needs it, then it would deadlock.
The first panic will not be executed.
So here disable irq for the whole flow of die().
Signed-off-by: Qiao Zhou <qiaozhou@asrmicro.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
- Better machine check handling for HV KVM
- Ability to support guests with threads=2, 4 or 8 on POWER9
- Fix for a race that could cause delayed recognition of signals
- Fix for a bug where POWER9 guests could sleep with interrupts pending.
ARM:
- VCPU request overhaul
- allow timer and PMU to have their interrupt number selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
s390:
- initial machine check forwarding
- migration support for the CMMA page hinting information
- cleanups and fixes
x86:
- nested VMX bugfixes and improvements
- more reliable NMI window detection on AMD
- APIC timer optimizations
Generic:
- VCPU request overhaul + documentation of common code patterns
- kvm_stat improvements
There is a small conflict in arch/s390 due to an arch-wide field rename.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJZW4XTAAoJEL/70l94x66DkhMH/izpk54KI17PtyQ9VYI2sYeZ
BWK6Kl886g3ij4pFi3pECqjDJzWaa3ai+vFfzzpJJ8OkCJT5Rv4LxC5ERltVVmR8
A3T1I/MRktSC0VJLv34daPC2z4Lco/6SPipUpPnL4bE2HATKed4vzoOjQ3tOeGTy
dwi7TFjKwoVDiM7kPPDRnTHqCe5G5n13sZ49dBe9WeJ7ttJauWqoxhlYosCGNPEj
g8ZX8+cvcAhVnz5uFL8roqZ8ygNEQq2mgkU18W8ZZKuiuwR0gdsG0gSBFNTdwIMK
NoreRKMrw0+oLXTIB8SZsoieU6Qi7w3xMAMabe8AJsvYtoersugbOmdxGCr1lsA=
=OD7H
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"PPC:
- Better machine check handling for HV KVM
- Ability to support guests with threads=2, 4 or 8 on POWER9
- Fix for a race that could cause delayed recognition of signals
- Fix for a bug where POWER9 guests could sleep with interrupts pending.
ARM:
- VCPU request overhaul
- allow timer and PMU to have their interrupt number selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
s390:
- initial machine check forwarding
- migration support for the CMMA page hinting information
- cleanups and fixes
x86:
- nested VMX bugfixes and improvements
- more reliable NMI window detection on AMD
- APIC timer optimizations
Generic:
- VCPU request overhaul + documentation of common code patterns
- kvm_stat improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
Update my email address
kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12
kvm: x86: mmu: allow A/D bits to be disabled in an mmu
x86: kvm: mmu: make spte mmio mask more explicit
x86: kvm: mmu: dead code thanks to access tracking
KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code
KVM: PPC: Book3S HV: Close race with testing for signals on guest entry
KVM: PPC: Book3S HV: Simplify dynamic micro-threading code
KVM: x86: remove ignored type attribute
KVM: LAPIC: Fix lapic timer injection delay
KVM: lapic: reorganize restart_apic_timer
KVM: lapic: reorganize start_hv_timer
kvm: nVMX: Check memory operand to INVVPID
KVM: s390: Inject machine check into the nested guest
KVM: s390: Inject machine check into the guest
tools/kvm_stat: add new interactive command 'b'
tools/kvm_stat: add new command line switch '-i'
tools/kvm_stat: fix error on interactive command 'g'
KVM: SVM: suppress unnecessary NMI singlestep on GIF=0 and nested exit
...
Here is the big driver core update for 4.13-rc1.
The large majority of this is a lot of cleanup of old fields in the
driver core structures and their remaining usages in random drivers.
All of those fixes have been reviewed by the various subsystem
maintainers. There's also some small firmware updates in here, a new
kobject uevent api interface that makes userspace interaction easier,
and a few other minor things.
All of these have been in linux-next for a long while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWVpX4A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymobgCfd0d13IfpZoq1N41wc6z2Z0xD7cwAnRMeH1/p
kEeISGpHPYP9f8PBh9FO
=Hfqt
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big driver core update for 4.13-rc1.
The large majority of this is a lot of cleanup of old fields in the
driver core structures and their remaining usages in random drivers.
All of those fixes have been reviewed by the various subsystem
maintainers. There's also some small firmware updates in here, a new
kobject uevent api interface that makes userspace interaction easier,
and a few other minor things.
All of these have been in linux-next for a long while with no reported
issues"
* tag 'driver-core-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (56 commits)
arm: mach-rpc: ecard: fix build error
zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()
driver-core: remove struct bus_type.dev_attrs
powerpc: vio_cmo: use dev_groups and not dev_attrs for bus_type
powerpc: vio: use dev_groups and not dev_attrs for bus_type
USB: usbip: convert to use DRIVER_ATTR_RW
s390: drivers: convert to use DRIVER_ATTR_RO/WO
platform: thinkpad_acpi: convert to use DRIVER_ATTR_RO/RW
pcmcia: ds: convert to use DRIVER_ATTR_RO
wireless: ipw2x00: convert to use DRIVER_ATTR_RW
net: ehea: convert to use DRIVER_ATTR_RO
net: caif: convert to use DRIVER_ATTR_RO
TTY: hvc: convert to use DRIVER_ATTR_RW
PCI: pci-driver: convert to use DRIVER_ATTR_WO
IB: nes: convert to use DRIVER_ATTR_RW
HID: hid-core: convert to use DRIVER_ATTR_RO and drv_groups
arm: ecard: fix dev_groups patch typo
tty: serdev: use dev_groups and not dev_attrs for bus_type
sparc: vio: use dev_groups and not dev_attrs for bus_type
hid: intel-ish-hid: use dev_groups and not dev_attrs for bus_type
...
Pull SMP hotplug updates from Thomas Gleixner:
"This update is primarily a cleanup of the CPU hotplug locking code.
The hotplug locking mechanism is an open coded RWSEM, which allows
recursive locking. The main problem with that is the recursive nature
as it evades the full lockdep coverage and hides potential deadlocks.
The rework replaces the open coded RWSEM with a percpu RWSEM and
establishes full lockdep coverage that way.
The bulk of the changes fix up recursive locking issues and address
the now fully reported potential deadlocks all over the place. Some of
these deadlocks have been observed in the RT tree, but on mainline the
probability was low enough to hide them away."
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
cpu/hotplug: Constify attribute_group structures
powerpc: Only obtain cpu_hotplug_lock if called by rtasd
ARM/hw_breakpoint: Fix possible recursive locking for arch_hw_breakpoint_init
cpu/hotplug: Remove unused check_for_tasks() function
perf/core: Don't release cred_guard_mutex if not taken
cpuhotplug: Link lock stacks for hotplug callbacks
acpi/processor: Prevent cpu hotplug deadlock
sched: Provide is_percpu_thread() helper
cpu/hotplug: Convert hotplug locking to percpu rwsem
s390: Prevent hotplug rwsem recursion
arm: Prevent hotplug rwsem recursion
arm64: Prevent cpu hotplug rwsem recursion
kprobes: Cure hotplug lock ordering issues
jump_label: Reorder hotplug lock and jump_label_lock
perf/tracing/cpuhotplug: Fix locking order
ACPI/processor: Use cpu_hotplug_disable() instead of get_online_cpus()
PCI: Replace the racy recursion prevention
PCI: Use cpu_hotplug_disable() instead of get_online_cpus()
perf/x86/intel: Drop get_online_cpus() in intel_snb_check_microcode()
x86/perf: Drop EXPORT of perf_check_microcode
...
Pull timer updates from Thomas Gleixner:
"A rather large update for timers/timekeeping:
- compat syscall consolidation (Al Viro)
- Posix timer consolidation (Christoph Helwig / Thomas Gleixner)
- Cleanup of the device tree based initialization for clockevents and
clocksources (Daniel Lezcano)
- Consolidation of the FTTMR010 clocksource/event driver (Linus
Walleij)
- The usual set of small fixes and updates all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (93 commits)
timers: Make the cpu base lock raw
clocksource/drivers/mips-gic-timer: Fix an error code in 'gic_clocksource_of_init()'
clocksource/drivers/fsl_ftm_timer: Unmap region obtained by of_iomap
clocksource/drivers/tcb_clksrc: Make IO endian agnostic
clocksource/drivers/sun4i: Switch to the timer-of common init
clocksource/drivers/timer-of: Fix invalid iomap check
Revert "ktime: Simplify ktime_compare implementation"
clocksource/drivers: Fix uninitialized variable use in timer_of_init
kselftests: timers: Add test for frequency step
kselftests: timers: Fix inconsistency-check to not ignore first timestamp
time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD
time: Clean up CLOCK_MONOTONIC_RAW time handling
posix-cpu-timers: Make timespec to nsec conversion safe
itimer: Make timeval to nsec conversion range limited
timers: Fix parameter description of try_to_del_timer_sync()
ktime: Simplify ktime_compare implementation
clocksource/drivers/fttmr010: Factor out clock read code
clocksource/drivers/fttmr010: Implement delay timer
clocksource/drivers: Add timer-of common init routine
clocksource/drivers/tcb_clksrc: Save timer context on suspend/resume
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)
- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)
- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)
- sched/numa improvements, fixes and updates (Rik van Riel)
- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)
- Improve NOHZ behavior (Frederic Weisbecker)
- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)
- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)
- Debloat and decouple scheduler code some more (Nicolas Pitre)
- Simplify code by making better use of llist primitives (Byungchul
Park)
- ... plus other fixes and improvements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
...
Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:
- Rework the EFI capsule loader to allow for workarounds for
non-compliant firmware (Ard Biesheuvel)
- Implement a capsule loader quirk for Quark X102x (Jan Kiszka)
- Enable SMBIOS/DMI support for the ARM architecture (Ard Biesheuvel)
- Add CONFIG_EFI_PGT_DUMP=y support for x86-32 and kexec (Sai
Praneeth)
- Fixes for EFI support for Xen dom0 guests running under x86-64
hosts (Daniel Kiper)"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen/efi: Initialize only the EFI struct members used by Xen
efi: Process the MEMATTR table only if EFI_MEMMAP is enabled
efi/arm: Enable DMI/SMBIOS
x86/efi: Extend CONFIG_EFI_PGT_DUMP support to x86_32 and kexec as well
efi/efi_test: Use memdup_user() helper
efi/capsule: Add support for Quark security header
efi/capsule-loader: Use page addresses rather than struct page pointers
efi/capsule-loader: Redirect calls to efi_capsule_setup_info() via weak alias
efi/capsule: Remove NULL test on kmap()
efi/capsule-loader: Use a cached copy of the capsule header
efi/capsule: Adjust return type of efi_capsule_setup_info()
efi/capsule: Clean up pr_err/_info() messages
efi/capsule: Remove pr_debug() on ENOMEM or EFAULT
efi/capsule: Fix return code on failing kmap/vmap
With the introduction of struct pci_host_bridge.map_irq pointer it is
possible to assign IRQs for all devices originating from a PCI host bridge
at probe time; this is implemented through pci_assign_irq() that relies on
the struct pci_host_bridge.map_irq pointer to map IRQ for a given device.
The benefits this brings are twofold:
- the IRQ for a device is assigned once at probe time
- the IRQ assignment works also for hotplugged devices
With all DT based PCI host bridges converted to the struct
pci_host_bridge.{map/swizzle}_irq hooks mechanism the DT IRQ allocation in
ARM64 pcibios_alloc_irq() is now redundant and can be removed.
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
- vcpu request overhaul
- allow timer and PMU to have their interrupt number
selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAllWCM0VHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDjJ0QAI16x6+trKhH31lTSYekYfqm4hZ2
Fp7IbALW9KNCaY35tZov2Zuh99qGRduxTh7ewqhKpON8kkU+UKj0F7zH22+vfN4m
yas/+uNr8R9VLyvea4ysPsgx8Q8v1Ix9setohHYNZIL9/klVqtaHpYvArHVF/mzq
p2j/NxRS2dlp9r2TtoMRMhA05u6r0wolhUuh+z9v2ipib0gfOBIG24jsqCTEcD9n
5A/cVd+ztYshkrV95h3y9peahwt3zOA4QBGzrQ2K25jp0s54nqhmC7JTNSa8dtar
YGW2MuAMoIFTwCFAlpwCzrwpOJFzF3Q6A8bOxei2fjclzjPMgT1xQxuhOoe4ntFa
lTPxSHalm5W6dFTW90YSo2DBcPe+N7sQkhjR0cCeY3GYsOFhXMLTlOl5Pt1YK1or
+3FAI74tFRKvVmb9mhZeGTvuzhDgRvtf3Qq5rjwlGzKc2BBOEgtMyj/Wgwo4N6Dz
IjOnoRaUGELoBCWoTorMxLpsPBdPVSUxNyJTdAhqZ/ZtT1xqjhFNLZcrVWmOTzDM
1cav+jZkla4sLmJSNDD54aCSvvtPHis0nZn9PRlh12xgOyYiAVx4K++MNuWP0P37
hbh1gbPT+FcoVxPurUsX/pjNlTucPZcBwFytZDQlpwtPBpEFzJiImLYe/PldRb0f
9WQOH1Y1+q14MF+N
=6hNK
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/ARM updates for 4.13
- vcpu request overhaul
- allow timer and PMU to have their interrupt number
selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
Conflicts:
arch/s390/include/asm/kvm_host.h
Now that compat_vfp_get() uses the regset API to copy the FPSCR
value out to userspace, compat_vfp_set() looks inconsistent. In
particular, compat_vfp_set() will fail if called with kbuf != NULL
&& ubuf == NULL (which is valid usage according to the regset API).
This patch fixes compat_vfp_set() to use user_regset_copyin(),
similarly to compat_vfp_get().
This also squashes a sparse warning triggered by the cast that
drops __user when calling get_user().
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
compat_vfp_set() checks for userspace trying to write an excessive
amount of data to the regset. However this check is conspicuous
for its absence from every other _set() in the arm64 ptrace
implementation. In fact, the core ptrace_regset() already clamps
userspace's iov_len to the regset size before the individual regset
.{get,set}() methods get called.
This patch removes the redundant check.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
If get_user() fails when reading the new FPSCR value from userspace
in compat_vfp_get(), then garbage* will be written to the task's
FPSR and FPCR registers.
This patch prevents this by checking the return from get_user()
first.
[*] Actually, zero, due to the behaviour of get_user() on error, but
that's still not what userspace expects.
Fixes: 478fcb2cdb ("arm64: Debugging support")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
get_alt_insn() is used to read and create ARM instructions, which
are always stored in memory in little-endian order. These values
are thus correctly converted to/from native order when processed
but the pointers used to hold the address of these instructions
are declared as for native order values.
Fix this by declaring the pointers as __le32* instead of u32* and
make the few appropriate needed changes like removing the unneeded
cast '(u32*)' in front of __ALT_PTR()'s definition.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
In the flattened device tree format, all integer properties are
in big-endian order.
Here the property "kaslr-seed" is read from the fdt and then
correctly converted to native order (via fdt64_to_cpu()) but the
pointer used for this is not annotated as being for big-endian.
Fix this by declaring the pointer as fdt64_t instead of u64
(fdt64_t being itself typedefed to __be64).
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Here both variables 'cpu_id' and 'entry_point' are read via
read[lq]_relaxed(), from a little-endian annotated pointer
and then used as a native endian value.
This is correct since the read[lq]() family of function
internally do a little-to-native endian conversion.
But in this case, it is wrong to declare these variable as
little-endian since there are native ones.
Fix this by changing the declaration of these variables
as 'u32' or 'u64' instead of '__le32' / '__le64'.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Here the entrypoint, declared as a 64 bit integer, is read from
a pointer to 64bit integer but the read is done via readl_relaxed()
which is for 32bit quantities.
All the high bits will thus be lost which change the meaning
of the test against zero done later.
Fix this by using readq_relaxed() instead as it should be for
64bit quantities.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Here the functions reloc_insn_movw() & reloc_insn_imm() are used
to read, modify and write back ARM instructions, which are always
stored in memory in little-endian order. These values are thus
correctly converted to/from native order but the pointers used to
hold their addresses are declared as for native order values.
Fix this by declaring the pointers as __le32* and remove the
casts that are now unneeded.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
aarch64_insn_write() is used to write an instruction.
As on ARM64 in-memory instructions are always stored
in little-endian order, this function, taking the instruction
opcode in native order, correctly convert it to little-endian
before sending it to an helper function __aarch64_insn_write()
which will do the effective write.
This is all good, but the variable and argument holding the
converted value are not annotated for a little-endian value
but left for native values.
Fix this by adjusting the prototype of the helper and
directly using the result of cpu_to_le32() without passing
by an intermediate variable (which was not a distinct one
but the same as the one holding the native value).
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
The function arch64_insn_read() is used to read an instruction.
On AM64 instructions are always stored in little-endian order
and thus the function correctly do a little-to-native endian
conversion to the value just read.
However, the variable used to hold the value before the conversion
is not declared for a little-endian value but for a native one.
Fix this by using the correct type for the declaration: __le32
Note: This only works because the function reading the value,
probe_kernel_read((), takes a void pointer and void pointers
are endian-agnostic. Otherwise probe_kernel_read() should
also be properly annotated (or worse, need to be specialized).
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Here we're reading thumb or ARM instructions, which are always
stored in memory in little-endian order. These values are thus
correctly converted to native order but the intermediate value
should be annotated as for little-endian values.
Fix this by declaring the intermediate var as __le32 or __le16.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Here we're reading thumb or ARM instructions, which are always
stored in memory in little-endian order. These values are thus
correctly converted to native order but the intermediate value
should be annotated as for little-endian values.
Fix this by declaring the intermediate var as __le32 or __le16.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
When a kernel is built without CONFIG_ARM64_MODULE_PLTS, we don't
generate the expected branch instruction in ftrace_make_nop(). This
means we pass zero (rather than a valid branch) to ftrace_modify_code()
as the expected instruction to validate. This causes us to return
-EINVAL to the core ftrace code for a valid case, resulting in a splat
at boot time.
This was an unintended effect of commit:
687644209a ("arm64: ftrace: fix building without CONFIG_MODULES")
... which incorrectly moved the generation of the branch instruction
into the ifdef for CONFIG_ARM64_MODULE_PLTS.
This patch fixes the issue by moving the ifdef inside of the relevant
if-else case, and always checking that the branch is in range,
regardless of CONFIG_ARM64_MODULE_PLTS. This ensures that we generate
the expected branch instruction, and also improves our sanity checks.
For consistency, both ftrace_make_nop() and ftrace_make_call() are
updated with this pattern.
Fixes: 687644209a ("arm64: ftrace: fix building without CONFIG_MODULES")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
This patch defines an extra_context signal frame record that can be
used to describe an expanded signal frame, and modifies the context
block allocator and signal frame setup and parsing code to create,
populate, parse and decode this block as necessary.
To avoid abuse by userspace, parse_user_sigframe() attempts to
ensure that:
* no more than one extra_context is accepted;
* the extra context data is a sensible size, and properly placed
and aligned.
The extra_context data is required to start at the first 16-byte
aligned address immediately after the dummy terminator record
following extra_context in rt_sigframe.__reserved[] (as ensured
during signal delivery). This serves as a sanity-check that the
signal frame has not been moved or copied without taking the extra
data into account.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
[will: add __force annotation when casting extra_datap to __user pointer]
Signed-off-by: Will Deacon <will.deacon@arm.com>
When debugging a kernel panic(), it can be useful to know which CPU
features have been detected by the kernel, as some code paths can depend
on these (and may have been patched at runtime).
This patch adds a notifier to dump the detected CPU caps (as a hex
string) at panic(), when we log other information useful for debugging.
On a Juno R1 system running v4.12-rc5, this looks like:
[ 615.431249] Kernel panic - not syncing: Fatal exception in interrupt
[ 615.437609] SMP: stopping secondary CPUs
[ 615.441872] Kernel Offset: disabled
[ 615.445372] CPU features: 0x02086
[ 615.448522] Memory Limit: none
A developer can decode this by looking at the corresponding
<asm/cpucaps.h> bits. For example, the above decodes as:
* bit 1: ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE
* bit 2: ARM64_WORKAROUND_845719
* bit 7: ARM64_WORKAROUND_834220
* bit 13: ARM64_HAS_32BIT_EL0
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
When reading current's user-writable TLS register (which occurs
when dumping core for native tasks), it is possible that userspace
has modified it since the time the task was last scheduled out.
The new TLS register value is not guaranteed to have been written
immediately back to thread_struct in this case.
As a result, a coredump can capture stale data for this register.
Reading the register for a stopped task via ptrace is unaffected.
For native tasks, this patch explicitly flushes the TPIDR_EL0
register back to thread_struct before dumping when operating on
current, thus ensuring that coredump contents are up to date. For
compat tasks, the TLS register is not user-writable and so cannot
be out of sync, so no flush is required in compat_tls_get().
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
When reading the FPSIMD state of current (which occurs when dumping
core), it is possible that userspace has modified the FPSIMD
registers since the time the task was last scheduled out. Such
changes are not guaranteed to be reflected immedately in
thread_struct.
As a result, a coredump can contain stale values for these
registers. Reading the registers of a stopped task via ptrace is
unaffected.
This patch explicitly flushes the CPU state back to thread_struct
before dumping when operating on current, thus ensuring that
coredump contents are up to date.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Currently, VFP registers are omitted from coredumps for compat
processes, due to a bug in the REGSET_COMPAT_VFP regset
implementation.
compat_vfp_get() needs to transfer non-contiguous data from
thread_struct.fpsimd_state, and uses put_user() to handle the
offending trailing word (FPSCR). This fails when copying to a
kernel address (i.e., kbuf && !ubuf), which is what happens when
dumping core. As a result, the ELF coredump core code silently
omits the NT_ARM_VFP note from the dump.
It would be possible to work around this with additional special
case code for the put_user(), but since user_regset_copyout() is
explicitly designed to handle this scenario it is cleaner to port
the put_user() to a user_regset_copyout() call, which this patch
does.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Merge time(keeping) updates from John Stultz:
"Just a small set of changes, the biggest changes being the MONOTONIC_RAW
handling cleanup, and a new kselftest from Miroslav. Also a a clear
warning deprecating CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc
and ia64."
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW,
remove the duplicitive tk->raw_time.tv_nsec, which can be
stored in tk->tkr_raw.xtime_nsec (similarly to how its handled
for monotonic time).
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
This patch factors out the allocator for signal frame optional
records into a separate function, to ensure consistency and
facilitate later expansion.
No overrun checking is currently done, because the allocation is in
user memory and anyway the kernel never tries to allocate enough
space in the signal frame yet for an overrun to occur. This
behaviour will be refined in future patches.
The approach taken in this patch to allocation of the terminator
record is not very clean: this will also be replaced in subsequent
patches.
For future extension, a comment is added in sigcontext.h
documenting the current static allocations in __reserved[]. This
will be important for determining under what circumstances
userspace may or may not see an expanded signal frame.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>