alloc_contig_range() initiates compaction and eventual migration for the
purpose of either CMA or HugeTLB allocations. At present, the reason
code remains the same MR_CMA for either of these cases. Let's make it
MR_CONTIG_RANGE which will appropriately reflect the reason code in both
these cases.
Link: http://lkml.kernel.org/r/20180202091518.18798-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Sync dtc to upstream version v1.4.6-9-gaadd0b65c987. This adds a bunch
more warnings (hidden behind W=1).
- Build dtc lexer and parser files instead of using shipped versions.
- Rework overlay apply API to take an FDT as input and apply overlays in
a single step.
- Add a phandle lookup cache. This improves boot time by hundreds of
msec on systems with large DT.
- Add trivial mcp4017/18/19 potentiometers bindings.
- Remove VLA stack usage in DT code.
-----BEGIN PGP SIGNATURE-----
iQItBAABCAAXBQJaxiUdEBxyb2JoQGtlcm5lbC5vcmcACgkQ+vtdtY28YcM0+w/+
L7nkug1Hz2476eRrsn5bm6oOO0vCrhQcDTJ/AlvU1YO8XBVgGEetLDs8drmvD0/O
FQDcpumX6G0eFoHTnTNWD7keM+0nY5jZBIAqKQNa9a0HKkjYc4HO5Ot9E02XG8W8
759vvCcGeJpysoCls9u8OplzqiDyNVQJd1a0fLivtafdKypuE/Ywh15wrzckPO+F
bxqWQd+uwm98ZVz8/o3vfYtAOJmA06A+hsyVLXYu7iKQcXYVxi+ZNbRV44MQ50NI
1w5m8GgtWe4A2lpXjmeXk1VmLPO3eEgQKnBoH7gcJmCHaVg/SVfMgBscuGSQZRQa
rQvaYRUNGJ0Mtji8EZpZb5Vip4ZCDtZCQBB3snN24CvGXI6WuIIg/8ncXt0AfLqn
pxFmC32ZcwvJR2NCpPVfTgILm6foT9IzJWKl6SQLVtqqVp9nPFua7T3l8AQak7FB
2MMaaqh7L0l0za0ZgArZZo/IWUHRb0MwZdXAkqBZlQ6f3IBqGQeKCnkclAeH8qYr
OorCOmC2OlKXLPHoz8XHeBzPRdnv1dQ//gEkKXBJ2igLU03hRWv9dxnGju/45sun
Ifo79uBAUc9s3F4Kjd/zs2iLztuPrYCSICHtJh9LPeOxoV1ZUNt+6Cm23yQ014Uo
/GsFW+lzh7c9wB1eETjPHd1WuYXiSrmE4zvbdykyLCk=
=ZWpa
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull DeviceTree updates from Rob Herring:
- Sync dtc to upstream version v1.4.6-9-gaadd0b65c987. This adds a
bunch more warnings (hidden behind W=1).
- Build dtc lexer and parser files instead of using shipped versions.
- Rework overlay apply API to take an FDT as input and apply overlays
in a single step.
- Add a phandle lookup cache. This improves boot time by hundreds of
msec on systems with large DT.
- Add trivial mcp4017/18/19 potentiometers bindings.
- Remove VLA stack usage in DT code.
* tag 'devicetree-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (26 commits)
of: unittest: fix an error code in of_unittest_apply_overlay()
of: unittest: move misplaced function declaration
of: unittest: Remove VLA stack usage
of: overlay: Fix forgotten reference to of_overlay_apply()
of: Documentation: Fix forgotten reference to of_overlay_apply()
of: unittest: local return value variable related cleanups
of: unittest: remove unneeded local return value variables
dt-bindings: trivial: add various mcp4017/18/19 potentiometers
of: unittest: fix an error test in of_unittest_overlay_8()
of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
dt-bindings: rockchip-dw-mshc: use consistent clock names
MAINTAINERS: Add linux/of_*.h headers to appropriate subsystems
scripts: turn off some new dtc warnings by default
scripts/dtc: Update to upstream version v1.4.6-9-gaadd0b65c987
scripts/dtc: generate lexer and parser during build instead of shipping
powerpc: boot: add strrchr function
of: overlay: do not include path in full_name of added nodes
of: unittest: clean up changeset test
arm64/efi: Make strrchr() available to the EFI namespace
ARM: boot: add strrchr function
...
This is mostly updates of the usual drivers: arcmsr, qla2xx, lpfc,
ufs, mpt3sas, hisi_sas. In addition we have removed several really
old drivers: sym53c416, NCR53c406a, fdomain, fdomain_cs and removed
the old scsi_module.c initialization from all remaining drivers. Plus
an assortment of bug fixes, initialization errors and other minor
fixes.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWsVSnSYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishbvbAP9ErpTZ
OR5iJ5HIz4W3Bd8aTfEpJrDyeYwSUC+sra5SKQD/ZWyVB3fYFSg+ZROyT26pmtmd
SdImhG7hLaHgVvF5qRQ=
=SQ/n
-----END PGP SIGNATURE-----
Merge tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates of the usual drivers: arcmsr, qla2xx, lpfc,
ufs, mpt3sas, hisi_sas.
In addition we have removed several really old drivers: sym53c416,
NCR53c406a, fdomain, fdomain_cs and removed the old scsi_module.c
initialization from all remaining drivers.
Plus an assortment of bug fixes, initialization errors and other minor
fixes"
* tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (168 commits)
scsi: ufs: Add support for Auto-Hibernate Idle Timer
scsi: ufs: sysfs: reworking of the rpm_lvl and spm_lvl entries
scsi: qla2xxx: fx00 copypaste typo
scsi: qla2xxx: fix error message on <qla2400
scsi: smartpqi: update driver version
scsi: smartpqi: workaround fw bug for oq deletion
scsi: arcmsr: Change driver version to v1.40.00.05-20180309
scsi: arcmsr: Sleep to avoid CPU stuck too long for waiting adapter ready
scsi: arcmsr: Handle adapter removed due to thunderbolt cable disconnection.
scsi: arcmsr: Rename ACB_F_BUS_HANG_ON to ACB_F_ADAPTER_REMOVED for adapter hot-plug
scsi: qla2xxx: Update driver version to 10.00.00.06-k
scsi: qla2xxx: Fix Async GPN_FT for FCP and FC-NVMe scan
scsi: qla2xxx: Cleanup code to improve FC-NVMe error handling
scsi: qla2xxx: Fix FC-NVMe IO abort during driver reset
scsi: qla2xxx: Fix retry for PRLI RJT with reason of BUSY
scsi: qla2xxx: Remove nvme_done_list
scsi: qla2xxx: Return busy if rport going away
scsi: qla2xxx: Fix n2n_ae flag to prevent dev_loss on PDB change
scsi: qla2xxx: Add FC-NVMe abort processing
scsi: qla2xxx: Add changes for devloss timeout in driver
...
POWER8 restores AMOR when waking from deep sleep, but POWER9 does not,
because it does not go through the subcore restore.
Have POWER9 restore it in core restore.
Fixes: ee97b6b99f ("powerpc/mm/radix: Setup AMOR in HV mode to allow key 0")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The CPU_FTR_POWER9_DD2_1 flag is intended to be set for DD2.1 and
above (which is what the dt_cpu_ftrs setup does). Fix cputable for
DD2.2 to match.
This came about due to patches b5af4f2793 ("powerpc: Add CPU feature
bits for TM bug workarounds on POWER9 v2.2"), and 9e9626ed3a
("powerpc/64s: Fix POWER9 DD2.2 and above in DT CPU features") being
in-flight at once. The latter patch fixed dt_cpu_ftrs like this one
does. The former changed cputable to match dt_cpu_ftrs.
Fixes: b5af4f2793 ("powerpc: Add CPU feature bits for TM bug workarounds on POWER9 v2.2")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pkey code added a CPU_FTR_PKEY bit, but did not add it to the
dt_cpu_ftrs feature set. Although capability is supported by all
processors in the base dt_cpu_ftrs set for 64s, it's a significant
and sufficiently well defined feature to make it optional. So add
it as a quirk for now, which can be versioned out then controlled
by the firmware (once dt_cpu_ftrs gains versioning support).
Fixes: cf43d3b264 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Cc: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Presently the dt_cpu_ftrs restore_cpu will only add bits to the LPCR
for secondaries, but some bits must be removed (e.g., UPRT for HPT).
Not clearing these bits on secondaries causes checkstops when booting
with disable_radix.
restore_cpu can not just set LPCR, because it is also called by the
idle wakeup code which relies on opal_slw_set_reg to restore the value
of LPCR, at least on P8 which does not save LPCR to stack in the idle
code.
Fix this by including a mask of bits to clear from LPCR as well, which
is used by restore_cpu.
This is a little messy now, but it's a minimal fix that can be
backported. Longer term, the idle SPR save/restore code can be
reworked to completely avoid calls to restore_cpu, then restore_cpu
would be able to unconditionally set LPCR to match boot processor
environment.
Fixes: 5a61ef74f2 ("powerpc/64s: Support new device tree binding for discovering CPU features")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As described in that commit:
When stop is executed with EC=ESL=0, it appears to execute like a
normal instruction (resuming from NIP when woken by interrupt). So
all the save/restore handling can be avoided completely.
This is true, except in the case of an NMI interrupt (sreset or
machine check) interrupting the instruction. In that case, the NMI
gets an "interrupt occurred while the processor was in power-saving
mode" indication. The power-save wakeup code uses that bit to decide
whether to restore some registers (e.g., LR). Because these are no
longer saved, this causes random register corruption.
It may be possible to restore this optimisation by detecting the case
of no register loss on the wakeup side, and avoid restoring in that
case, but that's not a minor fix because the wakeup code itself uses
some registers that would be live (e.g., LR).
Fixes: b9ee31e100 ("powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These functions will be introduced into the generic iomap.c so they
can deal with PIO accesses in hi-lo/lo-hi variants. Thus, the powerpc
version of iomap.c will need to provide the same functions even
though, in this arch, they are identical to the regular
io{read|write}64 functions.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Horia Geantă <horia.geanta@nxp.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Subsequent patches in this series makes use of the readq and writeq
defines in iomap.h. However, as is, they get missed on the powerpc
platform seeing the include comes before the define. This patch moves
the include down to fix this.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We get the below warning if we try to use kexec on P9:
kexec_core: Starting new kernel
WARNING: CPU: 0 PID: 1223 at arch/powerpc/kernel/process.c:826 __set_breakpoint+0xb4/0x140
[snip]
NIP __set_breakpoint+0xb4/0x140
LR kexec_prepare_cpus_wait+0x58/0x150
Call Trace:
0xc0000000ee70fb20 (unreliable)
0xc0000000ee70fb20
default_machine_kexec+0x234/0x2c0
machine_kexec+0x84/0x90
kernel_kexec+0xd8/0xe0
SyS_reboot+0x214/0x2c0
system_call+0x58/0x6c
This happens since we are trying to clear hw breakpoint on POWER9,
though we don't have CPU_FTR_DAWR enabled. Guard __set_breakpoint()
within hw_breakpoint_disable() with ppc_breakpoint_available() to
address this.
Fixes: 9654153158 ("powerpc: Disable DAWR in the base POWER9 CPU features")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kernel parameter disable_radix takes different options
disable_radix=yes|no|1|0 or just disable_radix.
prom_init parsing is not supporting these options.
Fixes: 1fd6c02207 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kernel parameter disable_radix takes different options
disable_radix=yes|no|1|0 or just disable_radix. When using the later
format we get below error.
`Malformed early option 'disable_radix'`
Fixes: 1fd6c02207 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With 64k page size, we have hugetlb pte entries at the pmd and pud level for
book3s64. We don't need to create a separate page table cache for that. With 4k
we need to make sure hugepd page table cache for 16M is placed at PUD level
and 16G at the PGD level.
Simplify all these by not using HUGEPD_PD_SHIFT which is confusing for book3s64.
Without this patch, with 64k page size we create pagetable caches with shift
value 10 and 7 which are not used at all.
Fixes: 419df06eea ("powerpc: Reduce the PTE_INDEX_SIZE")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With split PTL (page table lock) config, we allocate the level
4 (leaf) page table using pte fragment framework instead of slab cache
like other levels. This was done to enable us to have split page table
lock at the level 4 of the page table. We use page->plt backing the
all the level 4 pte fragment for the lock.
Currently with Radix, we use only 16 fragments out of the allocated
page. In radix each fragment is 256 bytes which means we use only 4k
out of the allocated 64K page wasting 60k of the allocated memory.
This was done earlier to keep it closer to hash.
This patch update the pte fragment count to 256, thereby using the
full 64K page and reducing the memory usage. Performance tests shows
really low impact even with THP disabled. With THP disabled we will be
contenting further less on level 4 ptl and hence the impact should be
further low.
256 threads:
without patch (10 runs of ./ebizzy -m -n 1000 -s 131072 -S 100)
median = 15678.5
stdev = 42.1209
with patch:
median = 15354
stdev = 194.743
This is with THP disabled. With THP enabled the impact of the patch
will be less.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Adds more code comments. We also remove an unnecessary pkey check
after we check for pkey error in this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When stop is executed with EC=ESL=0, it appears to execute like a
normal instruction (resuming from NIP when woken by interrupt). So all
the save/restore handling can be avoided completely. In particular NV
GPRs do not have to be saved, and MSR does not have to be switched
back to kernel MSR.
So move the test for EC=ESL=0 sleep states out to power9_idle_stop,
and return directly to the caller after stop in that case.
This improves performance for ping-pong benchmark with the stop0_lite
idle state by 2.54% for 2 threads in the same core, and 2.57% for
different cores. Performance increase with HV_POSSIBLE defined will be
improved further by avoiding the hwsync.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 3d4fbffdd7 ("powerpc/64s/idle: POWER9 implement a separate
idle stop function for hotplug") that added power9_offline_stop() was
written before commit 7672691a08 ("powerpc/powernv: Provide a way to
force a core into SMT4 mode").
When merging the former I failed to notice that it caused us to skip
the force-SMT4 logic for offline CPUs. The result is that offlined
CPUs will not correctly participate in the force-SMT4 logic, which
presumably will result in badness (not tested).
Reconcile the two commits by making power9_offline_stop() a pre-cursor
to power9_idle_stop(), so that they share the force-SMT4 logic.
This is based on an original commit from Nick, all breakage is my own.
Fixes: 3d4fbffdd7 ("powerpc/64s/idle: POWER9 implement a separate idle stop function for hotplug")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
- add a shell script to get Clang version
- improve portability of build scripts
- drop always-enabled CONFIG_THIN_ARCHIVE and remove unused code
- rename built-in.o which is now thin archive to built-in.a
- process clean/build targets one by one to get along with -j option
- simplify ld-option
- improve building with CONFIG_TRIM_UNUSED_KSYMS
- define KBUILD_MODNAME even for objects shared among multiple modules
- avoid linking multiple instances of same objects from composite objects
- move <linux/compiler_types.h> to c_flags to include it only for C files
- clean-up various Makefiles
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaw6eWAAoJED2LAQed4NsGrK8QAJmbYg83TTNoOgQRK/7Lg+sj
KL1+RGFxmdHRVOqG5n18L7Q4LmTD19tUFNQImrQTTrKrbH2vbMSTF2PfzdmDRwMz
R5vW5+wsagfhSttOce/GR4p9+bM9XEclzEa3liqNVQxijOFXmkV14pn0x5anYfeB
ABthxFFHcVn3exP/q3lmq048x1yNE71wUU5WQIWf6V/ZKf+++wQU8r7HpnATWYeO
vtf8gZq+xyLLjhxoJF6n6olSZXI7Yhz4jV2G68/VroS312AUFWPogK+cSshWGlSw
zGixM1q55oj9CXjZ37nR6pTzQhSZLf/uHX5beatlpeoJ4Hho6HlIABvx2oEnat7b
o5RW64RB0gVJqlYZdKxL29HNrovr9tlWPTaIPGFRvWDpl3c1w+rMKXE+5hwu8jMJ
2jgxd5FZCgBaDsAKojmeQR7PAo2ffAdUO0Dj/SuAaMOpuHWHcnJk9kIN2PUrC+Sf
d/H2soT9x+60KbQmtCEo5VfEN8bvNP3+ZSnadEG/gRN2IIa5FZAUQykU+i50gAvj
tuKAokdRGZHvXM+buYFBfN6RbhVCXzBF/fAG7r37QVR2u1zaUszmgFOUqERhTQfm
RNnyeAs9G9rljtna/AD7cIOdKTg8oETPISxt8Y6EzNMpI8PhF0aGoxso3yD19oH1
M+fq55RigsR48Kic40hY
=N5BL
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- add a shell script to get Clang version
- improve portability of build scripts
- drop always-enabled CONFIG_THIN_ARCHIVE and remove unused code
- rename built-in.o which is now thin archive to built-in.a
- process clean/build targets one by one to get along with -j option
- simplify ld-option
- improve building with CONFIG_TRIM_UNUSED_KSYMS
- define KBUILD_MODNAME even for objects shared among multiple modules
- avoid linking multiple instances of same objects from composite
objects
- move <linux/compiler_types.h> to c_flags to include it only for C
files
- clean-up various Makefiles
* tag 'kbuild-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (29 commits)
kbuild: get <linux/compiler_types.h> out of <linux/kconfig.h>
kbuild: clean up link rule of composite modules
kbuild: clean up archive rule of built-in.a
kbuild: remove partial section mismatch detection for built-in.a
net: liquidio: clean up Makefile for simpler composite object handling
lib: zstd: clean up Makefile for simpler composite object handling
kbuild: link $(real-obj-y) instead of $(obj-y) into built-in.a
kbuild: rename real-objs-y/m to real-obj-y/m
kbuild: move modname and modname-multi close to modname_flags
kbuild: simplify modname calculation
kbuild: fix modname for composite modules
kbuild: define KBUILD_MODNAME even if multiple modules share objects
kbuild: remove unnecessary $(subst $(obj)/, , ...) in modname-multi
kbuild: Use ls(1) instead of stat(1) to obtain file size
kbuild: link vmlinux only once for CONFIG_TRIM_UNUSED_KSYMS
kbuild: move include/config/ksym/* to include/ksym/*
kbuild: move CONFIG_TRIM_UNUSED_KSYMS code unneeded for external module
kbuild: restore autoksyms.h touch to the top Makefile
kbuild: move 'scripts' target below
kbuild: remove wrong 'touch' in adjust_autoksyms.sh
...
Pull sparc updates from David Miller:
1) Add support for ADI (Application Data Integrity) found in more
recent sparc64 cpus. Essentially this is keyed based access to
virtual memory, and if the key encoded in the virual address is
wrong you get a trap.
The mm changes were reviewed by Andrew Morton and others.
Work by Khalid Aziz.
2) Validate DAX completion index range properly, from Rob Gardner.
3) Add proper Kconfig deps for DAX driver. From Guenter Roeck.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
sparc64: Make atomic_xchg() an inline function rather than a macro.
sparc64: Properly range check DAX completion index
sparc: Make auxiliary vectors for ADI available on 32-bit as well
sparc64: Oracle DAX driver depends on SPARC64
sparc64: Update signal delivery to use new helper functions
sparc64: Add support for ADI (Application Data Integrity)
mm: Allow arch code to override copy_highpage()
mm: Clear arch specific VM flags on protection change
mm: Add address parameter to arch_validate_prot()
sparc64: Add auxiliary vectors to report platform ADI properties
sparc64: Add handler for "Memory Corruption Detected" trap
sparc64: Add HV fault type handlers for ADI related faults
sparc64: Add support for ADI register fields, ASIs and traps
mm, swap: Add infrastructure for saving page metadata on swap
signals, sparc: Add signal codes for ADI violations
Currently powernv reboot and shutdown requests just leave secondaries
to do their own things. This is undesirable because they can trigger
any number of watchdogs while waiting for reboot, but also we don't
know what else they might be doing -- they might be causing trouble,
trampling memory, etc.
The opal scheduled flash update code already ran into watchdog problems
due to flashing taking a long time, and it was fixed with 2196c6f1ed
("powerpc/powernv: Return secondary CPUs to firmware before FW update"),
which returns secondaries to opal. It's been found that regular reboots
can take over 10 seconds, which can result in the hard lockup watchdog
firing,
reboot: Restarting system
[ 360.038896709,5] OPAL: Reboot request...
Watchdog CPU:0 Hard LOCKUP
Watchdog CPU:44 detected Hard LOCKUP other CPUS:16
Watchdog CPU:16 Hard LOCKUP
watchdog: BUG: soft lockup - CPU#16 stuck for 3s! [swapper/16:0]
This patch removes the special case for flash update, and calls
smp_send_stop in all cases before calling reboot/shutdown.
smp_send_stop could return CPUs to OPAL, the main reason not to is
that the request could come from a NMI that interrupts OPAL code,
so re-entry to OPAL can cause a number of problems. Putting
secondaries into simple spin loops improves the chances of a
successful reboot.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The hard lockup watchdog can fire under local_irq_disable
on platforms with irq soft masking.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the NMI IPI rather than smp_call_function for smp_send_stop.
Have stopped CPUs hard disable interrupts rather than just soft
disable.
This function is used in crash/panic/shutdown paths to bring other
CPUs down as quickly and reliably as possible, and minimizing their
potential to cause trouble.
Avoiding the Linux smp_call_function infrastructure and (if supported)
using true NMI IPIs makes this more robust.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PSSCR value is not stored to PACA_REQ_PSSCR if the CPU does not
have the XER[SO] bug.
Fix this by storing up-front, outside the workaround code. The initial
test is not required because it is a slow path.
The workaround is made to depend on CONFIG_KVM_BOOK3S_HV_POSSIBLE, to
match pnv_power9_force_smt4_catch() where it is used. Drop the comment
on pnv_power9_force_smt4_catch() as it's no longer true.
Fixes: 7672691a08 ("powerpc/powernv: Provide a way to force a core into SMT4 mode")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
After migration the security feature flags might have changed (e.g.,
destination system with unpatched firmware), but some flags are not
set/clear again in init_cpu_char_feature_flags() because it assumes
the security flags to be the defaults.
Additionally, if the H_GET_CPU_CHARACTERISTICS hypercall fails then
init_cpu_char_feature_flags() does not run again, which potentially
might leave the system in an insecure or sub-optimal configuration.
So, just restore the security feature flags to the defaults assumed
by init_cpu_char_feature_flags() so it can set/clear them correctly,
and to ensure safe settings are in place in case the hypercall fail.
Fixes: f636c14790 ("powerpc/pseries: Set or clear security feature flags")
Depends-on: 19887d6a28e2 ("powerpc: Move default security feature flags")
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the definition of the default security feature flags
(i.e., enabled by default) closer to the security feature flags.
This can be used to restore current flags to the default flags.
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
flush_thread() calls __set_breakpoint() via set_debug_reg_defaults()
without checking ppc_breakpoint_available(). On Power8 or later CPUs
which have the DAWR feature disabled that will cause a write to the
DABR which is incorrect as those CPUs don't have a DABR.
Fix it two ways, by checking ppc_breakpoint_available() in
set_debug_reg_defaults(), and also by reworking __set_breakpoint() to
only write to DABR on Power7 or earlier.
Fixes: 9654153158 ("powerpc: Disable DAWR in the base POWER9 CPU features")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Rework the logic in __set_breakpoint()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
arch/powerpc/kvm/book3s_hv.c: In function ‘kvmppc_h_set_mode’:
arch/powerpc/kvm/book3s_hv.c:745:8: error: implicit declaration of function ‘ppc_breakpoint_available’
if (!ppc_breakpoint_available())
^~~~~~~~~~~~~~~~~~~~~~~~
Fixes: 398e712c00 ("KVM: PPC: Book3S HV: Return error from h_set_mode(SET_DAWR) on POWER9")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 8e0b634b13 ("powerpc/64s: Do not allocate lppaca if we are
not virtualized") removed allocation of lppaca on bare metal
platforms. But with CONFIG_PPC_SPLPAR enabled, we still access the
lppaca on bare metal in some code paths.
Fix this but adding runtime checks for SPLPAR (shared processor LPAR).
Fixes: 8e0b634b13 ("powerpc/64s: Do not allocate lppaca if we are not virtualized")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
"System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or
compat_sys_xyzzy() should only be called from userspace via the
syscall table, but not from elsewhere in the kernel.
At least on 64-bit x86, it will likely be a hard requirement from
v4.17 onwards to not call system call functions in the kernel: It is
better to use use a different calling convention for system calls
there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
which then hands processing over to the actual syscall function. This
means that only those parameters which are actually needed for a
specific syscall are passed on during syscall entry, instead of
filling in six CPU registers with random user space content all the
time (which may cause serious trouble down the call chain). Those
x86-specific patches will be pushed through the x86 tree in the near
future.
Moreover, rules on how data may be accessed may differ between kernel
data and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific
code.
This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"
* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
bpf: whitelist all syscalls for error injection
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
kernel/sys_ni: sort cond_syscall() entries
syscalls/x86: auto-create compat_sys_*() prototypes
syscalls: sort syscall prototypes in include/linux/compat.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
kexec: move sys_kexec_load() prototype to syscalls.h
x86/sigreturn: use SYSCALL_DEFINE0
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
...
Pull x86 dma mapping updates from Ingo Molnar:
"This tree, by Christoph Hellwig, switches over the x86 architecture to
the generic dma-direct and swiotlb code, and also unifies more of the
dma-direct code between architectures. The now unused x86-only
primitives are removed"
* 'x86-dma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
dma-mapping: Don't clear GFP_ZERO in dma_alloc_attrs
swiotlb: Make swiotlb_{alloc,free}_buffer depend on CONFIG_DMA_DIRECT_OPS
dma/swiotlb: Remove swiotlb_{alloc,free}_coherent()
dma/direct: Handle force decryption for DMA coherent buffers in common code
dma/direct: Handle the memory encryption bit in common code
dma/swiotlb: Remove swiotlb_set_mem_attributes()
set_memory.h: Provide set_memory_{en,de}crypted() stubs
x86/dma: Remove dma_alloc_coherent_gfp_flags()
iommu/intel-iommu: Enable CONFIG_DMA_DIRECT_OPS=y and clean up intel_{alloc,free}_coherent()
iommu/amd_iommu: Use CONFIG_DMA_DIRECT_OPS=y and dma_direct_{alloc,free}()
x86/dma/amd_gart: Use dma_direct_{alloc,free}()
x86/dma/amd_gart: Look at dev->coherent_dma_mask instead of GFP_DMA
x86/dma: Use generic swiotlb_ops
x86/dma: Use DMA-direct (CONFIG_DMA_DIRECT_OPS=y)
x86/dma: Remove dma_alloc_coherent_mask()
Using this helper allows us to avoid the in-kernel calls to the
sys_readahead() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_readahead().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_mmap_pgoff() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_mmap_pgoff().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_fadvise64_64() helper allows us to avoid the in-kernel
calls to the sys_fadvise64_64() syscall. The ksys_ prefix denotes that
this function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as ksys_fadvise64_64().
Some compat stubs called sys_fadvise64(), which then just passed through
the arguments to sys_fadvise64_64(). Get rid of this indirection, and call
ksys_fadvise64_64() directly.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_fallocate() wrapper allows us to get rid of in-kernel
calls to the sys_fallocate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_fallocate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_p{read,write}64() wrappers allows us to get rid of
in-kernel calls to the sys_pread64() and sys_pwrite64() syscalls.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_p{read,write}64().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_truncate() wrapper allows us to get rid of in-kernel
calls to the sys_truncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_truncate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_sync_file_range() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_sync_file_range().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_ftruncate() wrapper allows us to get rid of in-kernel
calls to the sys_ftruncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_ftruncate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Pull perf updates from Ingo Molnar:
"The main kernel side changes were:
- Modernize the kprobe and uprobe creation/destruction tooling ABIs:
The existing text based APIs (kprobe_events and uprobe_events in
tracefs), are naive, limited ABIs in that they require user-space
to clean up after themselves, which is both difficult and fragile
if the tool is buggy or exits unexpectedly. In other words they are
not really suited for modern, robust tooling.
So introduce a modern, file descriptor based ABI that does not have
these limitations: introduce the 'perf_kprobe' and 'perf_uprobe'
PMUs and extend the perf_event_open() syscall to create events with
a kprobe/uprobe attached to them. These [k,u]probe are associated
with this file descriptor, so they are not available in tracefs.
(Song Liu)
- Intel Cannon Lake CPU support (Harry Pan)
- Intel PT cleanups (Alexander Shishkin)
- Improve the performance of pinned/flexible event groups by using RB
trees (Alexey Budankov)
- Add PERF_EVENT_IOC_MODIFY_ATTRIBUTES which allows the modification
of hardware breakpoints, which new ABI variant massively speeds up
existing tooling that uses hardware breakpoints to instrument (and
debug) memory usage.
(Milind Chabbi, Jiri Olsa)
- Various Intel PEBS handling fixes and improvements, and other Intel
PMU improvements (Kan Liang)
- Various perf core improvements and optimizations (Peter Zijlstra)
- ... misc cleanups, fixes and updates.
There's over 200 tooling commits, here's an (imperfect) list of
highlights:
- 'perf annotate' improvements:
* Recognize and handle jumps to other functions as calls, which
improves the navigation along jumps and back. (Arnaldo Carvalho
de Melo)
* Add the 'P' hotkey in TUI annotation to dump annotation output
into a file, to ease e-mail reporting of annotation details.
(Arnaldo Carvalho de Melo)
* Add an IPC/cycles column to the TUI (Jin Yao)
* Improve s390 assembly annotation (Thomas Richter)
* Refactor the output formatting logic to better separate it into
interactive and non-interactive features and add the --stdio2
output variant to demonstrate this. (Arnaldo Carvalho de Melo)
- 'perf script' improvements:
* Add Python 3 support (Jaroslav Škarvada)
* Add --show-round-event (Jiri Olsa)
- 'perf c2c' improvements:
* Add NUMA analysis support (Jiri Olsa)
- 'perf trace' improvements:
* Improve PowerPC support (Ravi Bangoria)
- 'perf inject' improvements:
* Integrate ARM CoreSight traces (Robert Walker)
- 'perf stat' improvements:
* Add the --interval-count option (yuzhoujian)
* Add the --timeout option (yuzhoujian)
- 'perf sched' improvements (Changbin Du)
- Vendor events improvements :
* Add IBM s390 vendor events (Thomas Richter)
* Add and improve arm64 vendor events (John Garry, Ganapatrao
Kulkarni)
* Update POWER9 vendor events (Sukadev Bhattiprolu)
- Intel PT tooling improvements (Adrian Hunter)
- PMU handling improvements (Agustin Vega-Frias)
- Record machine topology in perf.data (Jiri Olsa)
- Various overwrite related cleanups (Kan Liang)
- Add arm64 dwarf post unwind support (Kim Phillips, Jean Pihet)
- ... and lots of other changes, cleanups and fixes, see the shortlog
and Git history for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (262 commits)
perf/x86/intel: Enable C-state residency events for Cannon Lake
perf/x86/intel: Add Cannon Lake support for RAPL profiling
perf/x86/pt, coresight: Clean up address filter structure
perf vendor events s390: Add JSON files for IBM z14
perf vendor events s390: Add JSON files for IBM z13
perf vendor events s390: Add JSON files for IBM zEC12 zBC12
perf vendor events s390: Add JSON files for IBM z196
perf vendor events s390: Add JSON files for IBM z10EC z10BC
perf mmap: Be consistent when checking for an unmaped ring buffer
perf mmap: Fix accessing unmapped mmap in perf_mmap__read_done()
perf build: Fix check-headers.sh opts assignment
perf/x86: Update rdpmc_always_available static key to the modern API
perf annotate: Use absolute addresses to calculate jump target offsets
perf annotate: Defer searching for comma in raw line till it is needed
perf annotate: Support jumping from one function to another
perf annotate: Add "_local" to jump/offset validation routines
perf python: Reference Py_None before returning it
perf annotate: Mark jumps to outher functions with the call arrow
perf annotate: Pass function descriptor to its instruction parsing routines
perf annotate: No need to calculate notes->start twice
...
In commit 9690c15742 ("powerpc/mm/radix: Fix always false comparison
against MMU_NO_CONTEXT") an issue was discovered where `mm->context.id` was
being truncated to an `unsigned int`, while the PID is actually an
`unsigned long`. Update the earlier patch by fixing one remaining
occurrence. Discovered during a compilation with W=1:
arch/powerpc/mm/tlb-radix.c:702:19: error: comparison is always false due to limited range of data type [-Werror=type-limits]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When using SIG_DBG_BRANCH_TRACING, MSR.BE is left enabled in the
user context when single_step_exception() prepares the SIGTRAP
delivery. The resulting branch-trap-within-the-SIGTRAP-handler
isn't healthy.
Commit 2538c2d08f broke this, by
replacing an MSR mask operation of ~(MSR_SE | MSR_BE) with a call
to clear_single_step() which only clears MSR_SE.
This patch adds a new helper, clear_br_trace(), which clears the
debug trap before invoking the signal handler. This helper is a
NOP for BookE as SIG_DBG_BRANCH_TRACING isn't supported on BookE.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reduces vmlinux text size by 1kB and data by 1.5kB with a small
build!
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add the recently added CPU_FTRS_POWER9_DD2_2 to the little
endian possible mask as noticed by Nick.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add GENERIC_CPU support for little-endian rather than using POWER8
specific selection for POWER9 and above.
Restrict GENERIC_CPU to POWER8 and above on little endian.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Duplicate GENERIC_CPU to avoid a kbuild warning about the prompt
being redefined. Spell out that GENERIC means >= POWER4 for BE.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER4 has been broken since at least the change 49d09bf2a6
("powerpc/64s: Optimise MSR handling in exception handling"), which
requires mtmsrd L=1 support. This was introduced in ISA v2.01, and
POWER4 supports ISA v2.00.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The last usage was removed in c17b98cf60 ("KVM: PPC: Book3S HV:
Remove code for PPC970 processors") (Dec 2014).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The CPU_FTR_POWER9_DD2_1 flag is intended to be set for DD2.1 and
above (which is what the cputable setup does). Fix DT CPU features
quirk setup to match.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Merge with upstream changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rather than override the machine type in .S code (which can hide wrong
or ambiguous code generation for the target), set the type to power4
for all assembly.
This also means we need to be careful not to build power4-only code
when we're not building for Book3S, such as the "power7" versions of
copyuser/page/memcpy.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix Book3E build, don't build the "power7" variants for non-Book3S]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
ALTIVEC and VSX features are not added by to default to the POWERx CPU
feature sets because they are intended to be enabled by firmware.
Currently they end up in CPU_FTRS_POSSIBLE due to their inclusion in
other the set for other CPUs, eg. PPC970.
But they should be added individually to the CPU_FTRS_POSSIBLE set,
because if we reduce the set of CPUs that are built-for they may
disappear from the possible mask.
It already contains CPU_FTR_VSX, so add ALTIVEC. The _COMP features
should be used because they won't be present if compiled out.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add detail to change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It's not a bug to have features missing in CPU_FTR_ALWAYS, but it is a
missed opportunity for optimisation.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When building a uImage or zImage using ppc6xx_defconfig and some other
defconfigs, the following error occurs with GCC 4.5.1:
/arch/powerpc/boot/libfdt_env.h:10:13: error: redefinition of typedef 'uint32_t'
/arch/powerpc/boot/types.h:21:13: note: previous declaration of 'uint32_t' was here
/arch/powerpc/boot/libfdt_env.h:11:13: error: redefinition of typedef 'uint64_t'
/arch/powerpc/boot/types.h:22:13: note: previous declaration of 'uint64_t' was here
The problem is that commit 656ad58ef1 (powerpc/boot: Add OPAL
console to epapr wrappers) adds typedefs for uint32_t and uint64_t to
type.h but doesn't remove the pre-existing (and now duplicate)
typedefs from libfdt_env.h.
Fix the error by removing the duplicate typedefs from libfdt_env.h
Signed-off-by: Mark Greer <mgreer@animalcreek.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When waking from a CPU idle instruction (e.g., nap or stop), the sync
for ordering the KVM secondary thread state can be avoided if there
wakeup is coming from a kernel context rather than KVM context.
This improves performance for ping-pong benchmark with the stop0 idle
state by 0.46% for 2 threads in the same core, and 1.02% for different
cores.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement a new function to invoke stop, power9_offline_stop, which is
like power9_idle_stop but used by the cpu hotplug code.
Move KVM secondary state manipulation code to the offline case.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
system_reset_exception does most of its own crash handling now,
invoking the debugger or crash dumps if they are registered. If not,
then it goes through to die() to print stack traces, and then is
supposed to panic (according to comments).
However after die() prints oopses, it does its own handling which
doesn't allow system_reset_exception to panic (e.g., it may just
kill the current process). This patch causes sreset exceptions to
return from die after it prints messages but before acting.
This also stops die from invoking the debugger on 0x100 crashes.
system_reset_exception similarly calls the debugger. It had been
thought this was harmless (because if the debugger was disabled,
neither call would fire, and if it was enabled the first call
would return). However in some cases like xmon 'X' command, the
debugger returns 0, which currently causes it to be entered
again (first in system_reset_exception, then in die), which is
confusing.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
System Reset, being an NMI, must return more carefully than other
interrupts. It has traditionally returned via the nromal return
from exception path, but that has a number of problems.
- r13 does not get restored if returning to kernel. This is for
interrupts which may cause a context switch, which sreset will
never do. Interrupting OPAL (which uses a different r13) is one
place where this causes breakage.
- It may cause several other problems returning to kernel with
preempt or TIF_EMULATE_STACK_STORE if it hits at the wrong time.
It's safer just to have a simple restore and return, like machine
check which is the other NMI.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The current EEH callbacks can race with a driver unbind. This can
result in a backtraces like this:
EEH: Frozen PHB#0-PE#1fc detected
EEH: PE location: S000009, PHB location: N/A
CPU: 2 PID: 2312 Comm: kworker/u258:3 Not tainted 4.15.6-openpower1 #2
Workqueue: nvme-wq nvme_reset_work [nvme]
Call Trace:
dump_stack+0x9c/0xd0 (unreliable)
eeh_dev_check_failure+0x420/0x470
eeh_check_failure+0xa0/0xa4
nvme_reset_work+0x138/0x1414 [nvme]
process_one_work+0x1ec/0x328
worker_thread+0x2e4/0x3a8
kthread+0x14c/0x154
ret_from_kernel_thread+0x5c/0xc8
nvme nvme1: Removing after probe failure status: -19
<snip>
cpu 0x23: Vector: 300 (Data Access) at [c000000ff50f3800]
pc: c0080000089a0eb0: nvme_error_detected+0x4c/0x90 [nvme]
lr: c000000000026564: eeh_report_error+0xe0/0x110
sp: c000000ff50f3a80
msr: 9000000000009033
dar: 400
dsisr: 40000000
current = 0xc000000ff507c000
paca = 0xc00000000fdc9d80 softe: 0 irq_happened: 0x01
pid = 782, comm = eehd
Linux version 4.15.6-openpower1 (smc@smc-desktop) (gcc version 6.4.0 (Buildroot 2017.11.2-00008-g4b6188e)) #2 SM P Tue Feb 27 12:33:27 PST 2018
enter ? for help
eeh_report_error+0xe0/0x110
eeh_pe_dev_traverse+0xc0/0xdc
eeh_handle_normal_event+0x184/0x4c4
eeh_handle_event+0x30/0x288
eeh_event_handler+0x124/0x170
kthread+0x14c/0x154
ret_from_kernel_thread+0x5c/0xc8
The first part is an EEH (on boot), the second half is the resulting
crash. nvme probe starts the nvme_reset_work() worker thread. This
worker thread starts touching the device which see a device error
(EEH) and hence queues up an event in the powerpc EEH worker
thread. nvme_reset_work() then continues and runs
nvme_remove_dead_ctrl_work() which results in unbinding the driver
from the device and hence releases all resources. At the same time,
the EEH worker thread starts doing the EEH .error_detected() driver
callback, which no longer works since the resources have been freed.
This fixes the problem in the same way the generic PCIe AER code (in
drivers/pci/pcie/aer/aerdrv_core.c) does. It makes the EEH code hold
the device_lock() while performing the driver EEH callbacks and
associated code. This ensures either the callbacks are no longer
register, or if they are registered the driver will not be removed
from underneath us.
This has been broken forever. The EEH call backs were first introduced
in 2005 (in 77bd741561) but it's not clear if a lock was needed back
then.
Fixes: 77bd741561 ("[PATCH] powerpc: PCI Error Recovery: PPC64 core recovery routines")
Cc: stable@vger.kernel.org # v2.6.16+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file_load() on powerpc doesn't support kdump kernels yet, so it
returns -ENOTSUPP in that case.
I've recently learned that this errno is internal to the kernel and
isn't supposed to be exposed to userspace. Therefore, change to
-EOPNOTSUPP which is defined in an uapi header.
This does indeed make kexec-tools happier. Before the patch, on
ppc64le:
# ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
kexec_file_load failed: Unknown error 524
After the patch:
# ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
kexec_file_load failed: Operation not supported
Fixes: a0458284f0 ("powerpc: Add support code for kexec_file_load()")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Reviewed-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This hack, introduced in commit c5df7f7751 ("powerpc: allow ioremap
within reserved memory regions") is now unnecessary.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Because the two memory blocks (usually called MEM1 and MEM2) are not
merged anymore, __request_region in kernel/resource.c will correctly
allow reserving regions in the physical address space between MEM1 and
MEM2, where many important peripherals are (GPIO, MMC, USB, ...).
A previous change to __ioremap_caller in arch/powerpc/mm/pgtable_32.c
ensures that multiple memblocks are properly considered in ioremap; this
makes it unnecessary to set __allow_ioremap_reserved.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On systems where there is MMIO space between different blocks of RAM in
the physical address space, __ioremap_caller did not allow mapping these
MMIO areas, because they were below the end RAM and thus considered RAM
as well. Use the memblock-based page_is_ram function, which returns
false for such MMIO holes.
v2:
Keep the check for p < virt_to_phys(high_memory). On 32-bit systems
with high memory (memory above physical address 4GiB), the high memory
is expected to be available though ioremap. The high_memory variable
marks the end of low memory; comparing against it means that only
ioremap requests for low RAM will be denied.
Reported by Michael Ellerman.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To support accurate checking for different blocks of memory on PPC32,
use the same memblock-based approach that's already used on PPC64 also
on PPC32.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of open-coding the search in page_is_ram, call memblock_is_memory.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The Wii has a blue LED in the disk drive slot, which is controlled via a
GPIO line. Add this LED to wii.dts, and mark it as a panic-indicator.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These are the GPIO line names on a Nintendo Wii, as documented in:
https://wiibrew.org/wiki/Hardware/Hollywood_GPIOs
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The Hollywood GPIO controller supports 32 GPIOs, but on the Wii, only 24
are used.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The Hollywood chipset's GPIO controller has two sets of registers: One
for access by the PowerPC CPU, and one for access by the ARM coprocessor
(but both are accessible from the PPC because the memory firewall
(AHBPROT) is usually disabled when booting Linux, today).
The wii_power_off function currently assumes that the poweroff GPIO pin
is configured for use via the ARM side, but the upcoming GPIO driver
configures all pins for use via the PPC side, breaking poweroff.
Configure the owner register explicitly in wii_power_off to make
wii_power_off work with and without the new GPIO driver.
I think the Wii can be switched to the generic gpio-poweroff driver,
after the GPIO driver is merged.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Previously, wii_device_probe would only initialize devices under the
/hollywood node. After this patch, platform devices placed outside of
/hollywood will also be initialized.
The intended usecase for this are devices located outside of the
Hollywood chip, such as GPIO LEDs and GPIO buttons.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On 64-bit Book3E systems, in setup_tlb_core_data() we reference other
CPUs pacas. But in commit 59f577743d ("powerpc/64: Defer paca
allocation until memory topology is discovered") the allocation of
non-boot-CPU pacas was deferred until later in boot.
This leads to an oops:
CPU maps initialized for 1 thread per core
Unable to handle kernel paging request for data at address 0x8888888888888918
Faulting instruction address: 0xc000000000e2f0d0
Oops: Kernel access of bad area, sig: 11 [#1]
NIP .setup_tlb_core_data+0xdc/0x160
Call Trace:
.setup_tlb_core_data+0x5c/0x160 (unreliable)
.setup_arch+0x80/0x348
.start_kernel+0x7c/0x598
start_here_common+0x1c/0x40
Luckily setup_tlb_core_data() is called immediately prior to
smp_setup_pacas(). So simply switching their order is sufficient to
fix the oops and seems unlikely to have any other unwanted side
effects.
Fixes: 59f577743d ("powerpc/64: Defer paca allocation until memory topology is discovered")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
SLOF checks for 'sc 1' (hypercall) support by issuing a hcall with
H_SET_DABR. Since the recent commit e8ebedbf31 ("KVM: PPC: Book3S
HV: Return error from h_set_dabr() on POWER9") changed H_SET_DABR to
return H_UNSUPPORTED on Power9, we see guest boot failures, the
symptom is the boot seems to just stop in SLOF, eg:
SLOF ***************************************************************
QEMU Starting
Build Date = Sep 24 2017 12:23:07
FW Version = buildd@ release 20170724
<no further output>
SLOF can cope if H_SET_DABR returns H_HARDWARE. So wwitch the return
value to H_HARDWARE instead of H_UNSUPPORTED so that we don't break
the guest boot.
That does mean we return a different error to PowerVM in this case,
but that's probably not a big concern.
Fixes: e8ebedbf31 ("KVM: PPC: Book3S HV: Return error from h_set_dabr() on POWER9")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Bring in yet another series that touches KVM code, and might need to
be merged into the kvm-ppc branch to resolve conflicts.
This required some changes in pnv_power9_force_smt4_catch/release()
due to the paca array becomming an array of pointers.
PPC:
- Fix a bug causing occasional machine check exceptions on POWER8 hosts
(introduced in 4.16-rc1)
x86:
- Fix a guest crashing regression with nested VMX and restricted guest
(introduced in 4.16-rc1)
- Fix dependency check for pv tlb flush (The wrong dependency that
effectively disabled the feature was added in 4.16-rc4, the original
feature in 4.16-rc1, so it got decent testing.)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJavUt5AAoJEED/6hsPKofo8uQH/RuijrsAIUnymkYY+6BYFXlh
Ri8qhG8VB+C3SpWEtsqcqNVkjJTepCD2Ej5BJTL4Gc9BSTWy7Ht6kqskEgwcnzu2
xRfkg0q0vTj1+GDd+UiTZfxiinoHtB9x3fiXali5UNTCd1fweLxdidETfO+GqMMq
KDhTR+S8dXE5VG7r+iJ80LZPtHQJ94f0fh9XpQk3X2ExTG5RBxag1U2nCfiKRAZk
xRv1CNAxNaBxS38CgYfHzg31NJx38fnq/qREsIdOx0Ju9WQkglBFkhLAGUb4vL0I
nn8YX/oV9cW2G8tyPWjC245AouABOLbzu0xyj5KgCY/z1leA9tdLFX/ET6Zye+E=
=++uZ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"PPC:
- Fix a bug causing occasional machine check exceptions on POWER8
hosts (introduced in 4.16-rc1)
x86:
- Fix a guest crashing regression with nested VMX and restricted
guest (introduced in 4.16-rc1)
- Fix dependency check for pv tlb flush (the wrong dependency that
effectively disabled the feature was added in 4.16-rc4, the
original feature in 4.16-rc1, so it got decent testing)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Fix pv tlb flush dependencies
KVM: nVMX: sync vmcs02 segment regs prior to vmx_set_cr0
KVM: PPC: Book3S HV: Fix duplication of host SLB entries
We need to zero-out pgd table only if we share the slab cache with
pud/pmd level caches. With the support of 4PB, we don't share the slab
cache anymore. Instead of removing the code completely hide it within
an #ifdef. We don't need to do this with any other page table level,
because they all allocate table of double the size and we take of
initializing the first half corrrectly during page table zap.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Consolidate multiple #if / #ifdef into one]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch increases the max virtual (effective) address value to 4PB.
With 4K page size config we continue to limit ourself to 64TB.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Keep the H_PGTABLE_RANGE test, update it to work]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For addresses above 512TB we allocate additional mmu contexts. To make
it all easy, addresses above 512TB are handled with IR/DR=1 and with
stack frame setup.
The mmu_context_t is also updated to track the new extended_ids. To
support upto 4PB we need a total 8 contexts.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Minor formatting tweaks and comment wording, switch BUG to WARN
in get_ea_context().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In a following patch, on finding a free area we will need to do
allocatinon of extra contexts as needed. Consolidating the return path
for slice_get_unmapped_area() will make that easier.
Split into a separate patch to make review easy.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Memory keys are supported only with hash translation mode. Instead of
using #ifdef in generic code move the key related pte bits to
respective headers
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
asm/barrier.h is not always included after asm/synch.h, which meant
it was missing __SUBARCH_HAS_LWSYNC, so in some files smp_wmb() would
be eieio when it should be lwsync. kernel/time/hrtimer.c is one case.
__SUBARCH_HAS_LWSYNC is only used in one place, so just fold it in
to where it's used. Previously with my small simulator config, 377
instances of eieio in the tree. After this patch there are 55.
Fixes: 46d075be58 ("powerpc: Optimise smp_wmb")
Cc: stable@vger.kernel.org # v2.6.29+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
[mpe: Add missing ';' to make it compile]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
thread_pkey_regs_init() initializes the pkey related registers
instead of initializing the fields in the task structures. Fortunately
those key related registers are re-set to zero when the task
gets scheduled on the cpu. However its good to fix this glaringly
visible error.
Fixes: 06bb53b338 ("powerpc: store and restore the pkey state across context switches")
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman reported the following call trace when running
ftracetest:
BUG: using __this_cpu_write() in preemptible [00000000] code: ftracetest/6178
caller is opt_pre_handler+0xc4/0x110
CPU: 1 PID: 6178 Comm: ftracetest Not tainted 4.15.0-rc7-gcc6x-gb2cd1df #1
Call Trace:
[c0000000f9ec39c0] [c000000000ac4304] dump_stack+0xb4/0x100 (unreliable)
[c0000000f9ec3a00] [c00000000061159c] check_preemption_disabled+0x15c/0x170
[c0000000f9ec3a90] [c000000000217e84] opt_pre_handler+0xc4/0x110
[c0000000f9ec3af0] [c00000000004cf68] optimized_callback+0x148/0x170
[c0000000f9ec3b40] [c00000000004d954] optinsn_slot+0xec/0x10000
[c0000000f9ec3e30] [c00000000004bae0] kretprobe_trampoline+0x0/0x10
This is showing up since OPTPROBES is now enabled with CONFIG_PREEMPT.
trampoline_probe_handler() considers itself to be a special kprobe
handler for kretprobes. In doing so, it expects to be called from
kprobe_handler() on a trap, and re-enables preemption before returning a
non-zero return value so as to suppress any subsequent processing of the
trap by the kprobe_handler().
However, with optprobes, we don't deal with special handlers (we ignore
the return code) and just try to re-enable preemption causing the above
trace.
To address this, modify trampoline_probe_handler() to not be special.
The only additional processing done in kprobe_handler() is to emulate
the instruction (in this case, a 'nop'). We adjust the value of
regs->nip for the purpose and delegate the job of re-enabling
preemption and resetting current kprobe to the probe handlers
(kprobe_handler() or optimized_callback()).
Fixes: 8a2d71a3f2 ("powerpc/kprobes: Disable preemption before invoking probe handler for optprobes")
Cc: stable@vger.kernel.org # v4.15+
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
opal_nvram_write currently just assumes success if it encounters an
error other than OPAL_BUSY or OPAL_BUSY_EVENT. Have it return -EIO
on other errors instead.
Fixes: 628daa8d5a ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks")
Cc: stable@vger.kernel.org # v3.2+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Stewart Smith <stewart@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The H_CPU_BEHAV_* flags should be checked for in the 'behaviour' field
of 'struct h_cpu_char_result' -- 'character' is for H_CPU_CHAR_*
flags.
Found by playing around with QEMU's implementation of the hypercall:
H_CPU_CHAR=0xf000000000000000
H_CPU_BEHAV=0x0000000000000000
This clears H_CPU_BEHAV_FAVOUR_SECURITY and H_CPU_BEHAV_L1D_FLUSH_PR
so pseries_setup_rfi_flush() disables 'rfi_flush'; and it also
clears H_CPU_CHAR_L1D_THREAD_PRIV flag. So there is no RFI flush
mitigation at all for cpu_show_meltdown() to report; but currently
it does:
Original kernel:
# cat /sys/devices/system/cpu/vulnerabilities/meltdown
Mitigation: RFI Flush
Patched kernel:
# cat /sys/devices/system/cpu/vulnerabilities/meltdown
Not affected
H_CPU_CHAR=0x0000000000000000
H_CPU_BEHAV=0xf000000000000000
This sets H_CPU_BEHAV_BNDS_CHK_SPEC_BAR so cpu_show_spectre_v1() should
report vulnerable; but currently it doesn't:
Original kernel:
# cat /sys/devices/system/cpu/vulnerabilities/spectre_v1
Not affected
Patched kernel:
# cat /sys/devices/system/cpu/vulnerabilities/spectre_v1
Vulnerable
Brown-paper-bag-by: Michael Ellerman <mpe@ellerman.id.au>
Fixes: f636c14790 ("powerpc/pseries: Set or clear security feature flags")
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Try to allocate kernel page tables for direct mapping and vmemmap
according to the node of the memory they will map. The node is not
available for the linear map in early boot, so use range allocation
to allocate the page tables from the region they map, which is
effectively node-local.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix build error in radix__create_section_mapping()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Per-node allocations are possible on 64s with radix that does
not have the bolted SLB limitation.
Hash would be able to do the same if all CPUs had the bottom of
their node-local memory bolted as well. This is left as an
exercise for the reader.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add dummy definition of boot_cpuid for !SMP]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Rename the dummy allocate_pacas() to fix 32-bit build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Build an array that finds hardware CPU number from logical CPU
number in firmware CPU discovery. Use that rather than setting
paca of other CPUs directly, to begin with. Subsequent patch will
not have pacas allocated at this point.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix SMP=n build by adding #ifdef in arch_match_cpu_phys_id()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move this into the early setup code, and don't iterate over CPU masks.
We don't want to call into sysfs so early from setup, and a future patch
won't initialize CPU masks by the time this is called.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fold in incremental fix from Nick for DSCR handling]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Split sparsemem initialisation from basic numa topology discovery.
Move the parsing earlier in boot, before pacas are allocated.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
slb_shadow structures are avoided for radix environment.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We no longer allocate lppacas in an array, so this patch removes the
1kB static alignment for the structure, and enforces the PAPR
alignment requirements at allocation time. We can not reduce the 1kB
allocation size however, due to existing KVM hypervisors.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Change the paca array into an array of pointers to pacas. Allocate
pacas individually.
This allows flexibility in where the PACAs are allocated. Future work
will allocate them node-local. Platforms that don't have address limits
on PACAs would be able to defer PACA allocations until later in boot
rather than allocate all possible ones up-front then freeing unused.
This is slightly more overhead (one additional indirection) for cross
CPU paca references, but those aren't too common.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The "lppaca" is a structure registered with the hypervisor. This is
unnecessary when running on non-virtualised platforms. One field from
the lppaca (pmcregs_in_use) is also used by the host, so move the host
part out into the paca (lppaca field is still updated in
guest mode).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix non-pseries build with some #ifdefs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In mpic_physmask() we loop over all CPUs up to 32, then get the hard
SMP processor id of that CPU.
Currently that's possibly walking off the end of the paca array, but
in a future patch we will change the paca array to be an array of
pointers, and in that case we will get a NULL for missing CPUs and
oops. eg:
Unable to handle kernel paging request for data at address 0x88888888888888b8
Faulting instruction address: 0xc00000000004e380
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP .mpic_set_affinity+0x60/0x1a0
LR .irq_do_set_affinity+0x48/0x100
Fix it by checking the CPU is possible, this also fixes the code if
there are gaps in the CPU numbering which probably never happens on
mpic systems but who knows.
Debugged-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Improvements for the radix page fault handler for HV KVM on POWER9.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJavHlMAAoJEJ2a6ncsY3GfWTAIANbY6b8xvhUY6c3WM1dt6o78
6MWJW1vCcsDYuM5yyIjYTds2xjY6vm7oWbo9thDdLIE9sWP2uKBDbdem8KxZb6JL
t/tdzuLOYkB60BfwQL0z77UmLlHSQYF5RfJjbYVe5oXt7OU5TCe43udkHsT9QtLH
HTpxvl7ebf2TIoex1+XqrD0eJ93tSOVFWB7Ay7WRUQu08CMEQRcHaszyMqdNfHfs
LHoPwLgxyWf+7/zt2T4++ebfysQDFpQgsEuEBugXkaHkw6pSGi6R+BOrZwVVmcGm
jiHq5+mdho3wL+B47Rt7UTjkpsMLyFbWR6TrhMD7y84/CislizKhEdnEYymKwH4=
=0igD
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-next-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
KVM PPC update for 4.17
- Improvements for the radix page fault handler for HV KVM on POWER9.
These are actually all fixes for pre-4.16 code, or new hardware workarounds.
Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new firmware interface
for describing CPU features.
Fix lost pending interrupts due to a race in our interrupt soft-masking code.
A workaround for a nest MMU bug with TLB invalidations on Power9.
A workaround for broadcast TLB invalidations on Power9.
Fix a bug in our instruction SLB miss handler, when handling bad addresses
(eg. >= TASK_SIZE), which could corrupt non-volatile user GPRs.
Thanks to:
Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt, Nicholas Piggin.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJau3wfExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYCz
dA/+JnB5iKCXCCebnqoaX4AFTqMfxT3nr/+JkfchovZLV0PBVzKME5JtL61udmDe
j1JZU8UASLqN/8/j652s87XuuRi6xPjSPjMNXmU1LFQ7DjS9yA6FOAsbE4c1Xg4D
jSded2BSnMRtA/yw8AupvdYr4w72zKMQYzo8/Or3eUQAAge+oX3d1SQiRkD3DOUg
EdpHnOScSwz6GL9amfaQBhXwvik+4crTQ/wZ/SsTpQrfJkVzHXLn/DnHEP1qO+ky
v/Y0ix5TxpH132XsVM7UaUvy1ZcZSyEmT2qGOisGm0fj4jesVn9dQMzP+97W4QeW
ghfHj2fvzx6IsPM3PhNKITknQi/GTrukjSuzYNuj7MyvKY15HUP1MPXNeJUl5thw
kI5uYWuTvyI3daQKFXRQa7V6H0auuYeEV6/RvIlJ2YtUfqmvyECviNM/+mDC0+Jk
bgqz47qqeEz2cwIUu/vQm2phVpq+15cLPwmdA37IdyT6GvYgGmsW4HWVIsyxLR2z
fo9ghX+1oMhmMNhgVYtL2P9BfCzQenK2R+uAmUOHdNyc0LBlGKN+RPAQqQkBhKGp
BB1L2F13kpeNBNTOsPU4yH3DpPaJFtfnaeL7jd5SanwsxNnoKApFglf0nE73bvbw
AwRF/vWokbd3WzuPmOtldtluWUHQhaLECU24odVGB/r3XCI=
=qP8V
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 4.16. Apologies if this is a bit big at
rc7, but they're all reasonably important fixes. None are actually for
new code, so they aren't indicative of 4.16 being in bad shape from
our point of view.
- Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new
firmware interface for describing CPU features.
- Fix lost pending interrupts due to a race in our interrupt
soft-masking code.
- A workaround for a nest MMU bug with TLB invalidations on Power9.
- A workaround for broadcast TLB invalidations on Power9.
- Fix a bug in our instruction SLB miss handler, when handling bad
addresses (eg. >= TASK_SIZE), which could corrupt non-volatile user
GPRs.
Thanks to: Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt,
Nicholas Piggin"
* tag 'powerpc-4.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile GPRs
powerpc/mm: Fixup tlbie vs store ordering issue on POWER9
powerpc/mm/radix: Move the functions that does the actual tlbie closer
powerpc/mm/radix: Remove unused code
powerpc/mm: Workaround Nest MMU bug with TLB invalidations
powerpc/mm: Add tracking of the number of coprocessors using a context
powerpc/64s: Fix lost pending interrupt due to race causing lost update to irq_happened
powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features