Add pci_controller_ops.dma_bus_setup, shadowing ppc_md.pci_dma_bus_setup.
Add a shim, and changes the callsites to use the shim.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Introduces the pci_controller_ops structure.
Add pci_controller_ops.dma_dev_setup, shadowing ppc_md.pci_dma_dev_setup.
Add a shim, and change the callsites to use the shim.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pcibios_enable_device_hook returned an int. Every implementation
returned either -EINVAL or 0. The return value wasn't propagated by
the caller: any non-zero return value caused pcibios_enable_device
to return -EINVAL itself. Therefore, make the hook return a bool.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Previously, find_and_init_phbs() was used in both PowerNV and pSeries
setup. However, since RTAS support has been dropped from PowerNV, we
can move it into a platform-specific file.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
smp_ops->probe() is currently supposed to return the number of cpus in
the system.
The last actual usage of the value was removed in May 2007 in e147ec8f18
"[POWERPC] Simplify smp_space_timers". We still passed the value around
until June 2010 when even that was finally removed in c1aa687d49
"powerpc: Clean up obsolete code relating to decrementer and timebase".
So drop that requirement, probe() now returns void, and update all
implementations.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have a powerpc specific global called mem_init_done which is "set on
boot once kmalloc can be called".
But that's not *quite* true. We set it at the bottom of mem_init(), and
rely on the fact that mm_init() calls kmem_cache_init() immediately
after that, and nothing is running in parallel.
So replace it with the generic and 100% correct slab_is_available().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The celleb code has seen no actual development for ~7 years.
We (maintainers) have no access to test hardware, and it is highly
likely the code has bit-rotted.
As far as we're aware the hardware was never widely available, and is
certainly no longer available, and no one on the list has shown any
interest in it over the years.
So remove it. If anyone has one and cares please speak up.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Freescale updates from Scott:
"Highlights include BMan device tree nodes, an MSI erratum workaround, a
couple minor performance improvements, config updates, and misc
fixes/cleanup."
All the cache line size of the current book3e 64bit SoCs are 64 bytes.
So we should use this size to align the member of paca_struct.
This only change the paca_struct's members which are private to book3e
CPUs, and should not have any effect to book3s ones. With this, we save
192 bytes. Also change it to __aligned(size) since it is preferred over
__attribute__((aligned(size))).
Before:
/* size: 1920, cachelines: 30, members: 46 */
/* sum members: 1667, holes: 6, sum holes: 141 */
/* padding: 112 */
After:
/* size: 1728, cachelines: 27, members: 46 */
/* sum members: 1667, holes: 4, sum holes: 13 */
/* padding: 48 */
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
OPAL has its own list of return codes. The patch provides a translation
of such codes in errnos for the opal_sensor_read call, and possibly
others if needed.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the normal return values for bool functions
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We currently have a "special" syscall for switching endianness. This is
syscall number 0x1ebe, which is handled explicitly in the 64-bit syscall
exception entry.
That has a few problems, firstly the syscall number is outside of the
usual range, which confuses various tools. For example strace doesn't
recognise the syscall at all.
Secondly it's handled explicitly as a special case in the syscall
exception entry, which is complicated enough without it.
As a first step toward removing the special syscall, we need to add a
regular syscall that implements the same functionality.
The logic is simple, it simply toggles the MSR_LE bit in the userspace
MSR. This is the same as the special syscall, with the caveat that the
special syscall clobbers fewer registers.
This version clobbers r9-r12, XER, CTR, and CR0-1,5-7.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
During suspend/migration operation we must wait for the VASI state reported
by the hypervisor to become Suspending prior to making the ibm,suspend-me
RTAS call. Calling routines to rtas_ibm_supend_me() pass a vasi_state variable
that exposes the VASI state to the caller. This is unnecessary as the caller
only really cares about the following three conditions; if there is an error
we should bailout, success indicating we have suspended and woken back up so
proceed to device tree update, or we are not suspendable yet so try calling
rtas_ibm_suspend_me again shortly.
This patch removes the extraneous vasi_state variable and simply uses the
return code to communicate how to proceed. We either succeed, fail, or get
-EAGAIN in which case we sleep for a second before trying to call
rtas_ibm_suspend_me again. The behaviour of ppc_rtas() remains the same,
but migrate_store() now returns the propogated error code on failure.
Previously -1 was returned from migrate_store() in the failure case which
equates to -EPERM and was clearly wrong.
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Cc: Nathan Fontenont <nfont@linux.vnet.ibm.com>
Cc: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Provide an unregister interface for the opal message notifiers
to be called when not needed like during driver unload/remove.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The powerpc specific st_le*() and ld_le*() functions in
arch/powerpc/asm/swab.h no longer have any users. They are also
misleadingly named, since they always byteswap, even on a little-endian
host.
This patch removes them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Sometimes the KVM code on powerpc needs to emulate load or store
instructions from the guest, which can include both normal and byte
reversed forms.
We currently (AFAICT) handle this correctly, but some variable names are
very misleading. In particular we use "is_bigendian" in several places to
actually mean "is the IO the same endian as the host", but we now support
little-endian powerpc hosts. This also ties into the misleadingly named
ld_le*() and st_le*() functions, which in fact always byteswap, even on
an LE host.
This patch cleans this up by renaming to more accurate "host_swabbed", and
uses the generic swab*() functions instead of the powerpc specific and
misleadingly named ld_le*() and st_le*() functions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch removes struct eeh_dev::dn and the corresponding helper
functions: eeh_dev_to_of_node() and of_node_to_eeh_dev(). Instead,
eeh_dev_to_pdn() and pdn_to_eeh_dev() should be used to get the
pdn, which might contain device_node on PowerNV platform.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There are 3 EEH operations whose arguments contain device_node:
read_config(), write_config() and restore_config(). The patch
replaces device_node with pci_dn.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Originally, EEH core probes on device_node or pci_dev to populate
EEH devices and PEs, which conflicts with the fact: SRIOV VFs are
usually enabled and created by PF's driver and they don't have the
corresponding device_nodes. Instead, SRIOV VFs have dynamically
created pci_dn, which can be used for EEH probe.
The patch reworks EEH probe for PowerNV and pSeries platforms to
do probing based on pci_dn, instead of pci_dev or device_node any
more.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch adds function traverse_pci_dn(), which is similar to
traverse_pci_devices() except it takes pci_dn, not device_node
as parameter. The pci_dev.c has been reworked to create eeh_dev
from pci_dn, instead of device_node.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Originally, EEH probes on device_node or pci_dev and populates the
corresponding eeh_dev. In the subsequent patches, EEH will probes
on pci_dn and populates the corresponding eeh_dev. So we have to
cache some information in pci_dn, either from device_node or SRIOV
PF's enablement platform hook, to populate the eeh_dev properly.
The motivation to probe pci_dn, instead of device node or pci_dev,
to populate eeh_dev is SRIOV VFs are dynamically created and we
don't have the corresponding device nodes for them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, the PCI config accessors are implemented based on device node.
Unfortunately, SRIOV VFs won't have the corresponding device nodes. pci_dn
will be used in replacement with device node for SRIOV VFs. So we have to
use pci_dn in PCI config accessors.
The patch refactors pci_dn in following aspects to make it ready to be used
in PCI config accessors as we do in subsequent patch:
* pci_dn is organized as a hierarchy tree. PCI device's pci_dn is
put to the child list of pci_dn of its upstream bridge or PHB. VF's
pci_dn will be put to the child list of pci_dn of PF's bridge.
* For one particular PCI device (VF or not), its pci_dn can be
found from pdev->dev.archdata.pci_data, PCI_DN(devnode), or
parent's list. The fast path (fetching pci_dn through PCI device
instance) is populated during early fixup time.
[bhelgaas: changelog]
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
MPIC version is useful information for both mpic_alloc() and mpic_init().
The patch provide an API to get MPIC version for reusing the code.
Also, some other IP block may need MPIC version for their own use.
The API for external use is also provided.
This function had been previously added but was removed by commit
5e86bfde9c ("powerpc/mpic: remove unused functions") due to the
lack of a user. This function will be used by "powerpc/mpic: Add
get_version API both for internal and external use".
Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
[scottwood@freescale.com: changelog update]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Replace one line asm-generic include files declared in
arch/powerpc/include/asm/ by generic-y declaration
which creates arch/powerpc/include/generated/asm equivalent.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
ppc has special instruction forms to efficiently load and store values
in non-native endianness. These can be accessed via the arch-specific
{ld,st}_le{16,32}() inlines in arch/powerpc/include/asm/swab.h.
However, gcc is perfectly capable of generating the byte-reversing
load/store instructions when using the normal, generic cpu_to_le*() and
le*_to_cpu() functions eaning the arch-specific functions don't have much
point.
Worse the "le" in the names of the arch specific functions is now
misleading, because they always generate byte-reversing forms, but some
ppc machines can now run a little-endian kernel.
To start getting rid of the arch-specific forms, this patch removes them
from all the old Power Macintosh drivers, replacing them with the
generic byteswappers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While we are here, let us make timestamp related code y2038-safe.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With minor checks, we can move most of the code for nvram
under pseries to a common place to be re-used by other
powerpc platforms like powernv. This patch moves such
common code to arch/powerpc/kernel/nvram_64.c file.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
[mpe: Move select of ZLIB_DEFLATE to PPC64 to fix the build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These functions are only used from one place each. If the cacheable_*
versions really are more efficient, then those changes should be
migrated into the common code instead.
NOTE: The old routines are just flat buggy on kernels that support
hardware with different cacheline sizes.
Signed-off-by: Kyle Moffett <Kyle.D.Moffett@boeing.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch fixes the comments about ppc_md.pcibios_fixup(), which
should be called after allocating resources.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The flush_tlb hook in cpu_spec was introduced as a generic function hook
to invalidate TLBs. But the current implementation of flush_tlb hook
takes IS (invalidation selector) as an argument which is architecture
dependent. Hence, It is not right to have a generic routine where caller
has to pass non-generic argument.
This patch fixes this and makes flush_tlb hook as high level API.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Drop unused fsl_mpic_primary_get_version(), mpic_set_clk_ratio(),
mpic_set_serial_int().
+ fsl_mpic_primary_get_version() is just a safe wrapper around
fsl_mpic_get_version() for SMP configurations. While the latter is
called explicitly for handling PIC initialization and setting up error
interrupt vector depending on PIC hardware version, the former isn't
used for anything.
+ As for mpic_set_clk_ratio() and mpic_set_serial_int(), they both are
almost nine years old[1] but still have no chance to be called even from
out-of-tree modules because they both are __init and of course aren't
exported.
[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2006-June/023867.html
Signed-off-by: Arseny Solokha <asolokha@kb.kras.ru>
Cc: hongtao.jia@freescale.com
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Drop ucc_slow_poll_transmitter_now() which has no users since its
inception in 2007 in commit 9865853851 ("[POWERPC] Add QUICC
Engine (QE) infrastructure").
Signed-off-by: Arseny Solokha <asolokha@kb.kras.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This removes definitions in opal-api.h that are completely unused in
Linux.
For each of these I see three possibilities, 1) we *should* be using
them in Linux and patches will arrive to do that, 2) they are not used
but should stay in the header to document the API for some important
reason, 3) they are not used and needn't be part of the API.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
This commit gets opal-api.h to mostly match the version in Skiboot as of
commit ea7d806ab0ba.
The exceptions are things which are not (currently) used in Linux.
Most of this is just whitespace and a few things moving around. I think
the diff is readable.
Also OpalMessageType became opal_msg_type, requiring a change in the
Linux code.
Finally Skiboot and Linux disagree on CAPI vs CXL, because CAPI means
something else in Linux. To handle that we just point the Linux wrapper,
which is named "cxl" to the OPAL token OPAL_PCI_SET_PHB_CAPI_MODE.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
We'd like to get to the stage where the OPAL API is defined in a header
that is identical between Linux and Skiboot.
As step one, split the bits that actually define the API into
opal-api.h. The Linux specific parts stay in opal.h.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
As our various loops (copy, string, crypto etc) get more complicated,
we want to share implementations between userspace (eg glibc) and
the kernel. We also want to write userspace test harnesses to put
in tools/testing/selftest.
One gratuitous difference between userspace and the kernel is the
VSX register definitions - the kernel uses vsrX whereas gcc uses
vsX.
Change the kernel to match userspace.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As our various loops (copy, string, crypto etc) get more complicated,
we want to share implementations between userspace (eg glibc) and
the kernel. We also want to write userspace test harnesses to put
in tools/testing/selftest.
One gratuitous difference between userspace and the kernel is the
VMX register definitions - the kernel uses vrX whereas both gcc and
glibc use vX.
Change the kernel to match userspace.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
After d905c5df9a ("PPC: POWERNV: move iommu_add_device earlier"), the
refcnt on the kobject backing the IOMMU group for a PCI device is
elevated by each call to pci_dma_dev_setup_pSeriesLP() (via
set_iommu_table_base_and_group). When we go to dlpar a multi-function
PCI device out:
iommu_reconfig_notifier ->
iommu_free_table ->
iommu_group_put
BUG_ON(tbl->it_group)
We trip this BUG_ON, because there are still references on the table, so
it is not freed. Fix this by moving the powernv bus notifier to common
code and calling it for both powernv and pseries.
Fixes: d905c5df9a ("PPC: POWERNV: move iommu_add_device earlier")
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement arch_irq_work_has_interrupt() for powerpc
Commit 9b01f5bf3 introduced a dependency on "IRQ work self-IPIs" for
full dynamic ticks to be enabled, by expecting architectures to
implement a suitable arch_irq_work_has_interrupt() routine.
Several arches have implemented this routine, including x86 (3010279f)
and arm (09f6edd4), but powerpc was omitted.
This patch implements this routine for powerpc.
The symptom, at boot (on powerpc systems) with "nohz_full=<CPU list>"
is displayed:
NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs
after this patch:
NO_HZ: Full dynticks CPUs: <CPU list>.
Tested against 3.19.
powerpc implements "IRQ work self-IPIs" by setting the decrementer to 1 in
arch_irq_work_raise(), which causes a decrementer exception on the next
timebase tick. We then handle the work in __timer_interrupt().
CC: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul A. Clarke <pc@us.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[mpe: Flesh out change log, fix ws & include guards, remove include of processor.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Here's the big char/misc driver update for 3.20-rc1.
Lots of little things in here, all described in the changelog. Nothing
major or unusual, except maybe the binder selinux stuff, which was all
acked by the proper selinux people and they thought it best to come
through this tree.
All of this has been in linux-next with no reported issues for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlTgs80ACgkQMUfUDdst+yn86gCeMLbxANGExVLd+PR46GNsAUQb
SJ4AmgIqrkIz+5LCwZWM02ldbYhPeBVf
=lfmM
-----END PGP SIGNATURE-----
Merge tag 'char-misc-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc patches from Greg KH:
"Here's the big char/misc driver update for 3.20-rc1.
Lots of little things in here, all described in the changelog.
Nothing major or unusual, except maybe the binder selinux stuff, which
was all acked by the proper selinux people and they thought it best to
come through this tree.
All of this has been in linux-next with no reported issues for a while"
* tag 'char-misc-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (90 commits)
coresight: fix function etm_writel_cp14() parameter order
coresight-etm: remove check for unknown Kconfig macro
coresight: fixing CPU hwid lookup in device tree
coresight: remove the unnecessary function coresight_is_bit_set()
coresight: fix the debug AMBA bus name
coresight: remove the extra spaces
coresight: fix the link between orphan connection and newly added device
coresight: remove the unnecessary replicator property
coresight: fix the replicator subtype value
pdfdocs: Fix 'make pdfdocs' failure for 'uio-howto.tmpl'
mcb: Fix error path of mcb_pci_probe
virtio/console: verify device has config space
ti-st: clean up data types (fix harmless memory corruption)
mei: me: release hw from reset only during the reset flow
mei: mask interrupt set bit on clean reset bit
extcon: max77693: Constify struct regmap_config
extcon: adc-jack: Release IIO channel on driver remove
extcon: Remove duplicated include from extcon-class.c
Drivers: hv: vmbus: hv_process_timer_expiration() can be static
Drivers: hv: vmbus: serialize Offer and Rescind offer
...
Common: Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other architectures).
This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes
or TCP_RR netperf tests). This also has to be enabled manually for now,
but the plan is to auto-tune this in the future.
ARM/ARM64: the highlights are support for GICv3 emulation and dirty page
tracking
s390: several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS: Bugfixes.
x86: Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested virtualization
improvements (nested APICv---a nice optimization), usual round of emulation
fixes. There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
ARM has other conflicts where functions are added in the same place
by 3.19-rc and 3.20 patches. These are not large though, and entirely
within KVM.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJU28rkAAoJEL/70l94x66DXqQH/1TDOfJIjW7P2kb0Sw7Fy1wi
cEX1KO/VFxAqc8R0E/0Wb55CXyPjQJM6xBXuFr5cUDaIjQ8ULSktL4pEwXyyv/s5
DBDkN65mriry2w5VuEaRLVcuX9Wy+tqLQXWNkEySfyb4uhZChWWHvKEcgw5SqCyg
NlpeHurYESIoNyov3jWqvBjr4OmaQENyv7t2c6q5ErIgG02V+iCux5QGbphM2IC9
LFtPKxoqhfeB2xFxTOIt8HJiXrZNwflsTejIlCl/NSEiDVLLxxHCxK2tWK/tUXMn
JfLD9ytXBWtNMwInvtFm4fPmDouv2VDyR0xnK2db+/axsJZnbxqjGu1um4Dqbak=
=7gdx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"Fairly small update, but there are some interesting new features.
Common:
Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other
architectures). This can improve latency up to 50% on some
scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This
also has to be enabled manually for now, but the plan is to
auto-tune this in the future.
ARM/ARM64:
The highlights are support for GICv3 emulation and dirty page
tracking
s390:
Several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS:
Bugfixes.
x86:
Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested
virtualization improvements (nested APICv---a nice optimization),
usual round of emulation fixes.
There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
Powerpc:
Nothing yet.
The KVM/PPC changes will come in through the PPC maintainers,
because I haven't received them yet and I might end up being
offline for some part of next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: ia64: drop kvm.h from installed user headers
KVM: x86: fix build with !CONFIG_SMP
KVM: x86: emulate: correct page fault error code for NoWrite instructions
KVM: Disable compat ioctl for s390
KVM: s390: add cpu model support
KVM: s390: use facilities and cpu_id per KVM
KVM: s390/CPACF: Choose crypto control block format
s390/kernel: Update /proc/sysinfo file with Extended Name and UUID
KVM: s390: reenable LPP facility
KVM: s390: floating irqs: fix user triggerable endless loop
kvm: add halt_poll_ns module parameter
kvm: remove KVM_MMIO_SIZE
KVM: MIPS: Don't leak FPU/DSP to guest
KVM: MIPS: Disable HTW while in guest
KVM: nVMX: Enable nested posted interrupt processing
KVM: nVMX: Enable nested virtual interrupt delivery
KVM: nVMX: Enable nested apic register virtualization
KVM: nVMX: Make nested control MSRs per-cpu
KVM: nVMX: Enable nested virtualize x2apic mode
KVM: nVMX: Prepare for using hardware MSR bitmap
...
If an attacker can cause a controlled kernel stack overflow, overwriting
the restart block is a very juicy exploit target. This is because the
restart_block is held in the same memory allocation as the kernel stack.
Moving the restart block to struct task_struct prevents this exploit by
making the restart_block harder to locate.
Note that there are other fields in thread_info that are also easy
targets, at least on some architectures.
It's also a decent simplification, since the restart code is more or less
identical on all architectures.
[james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Miller <davem@davemloft.net>
Acked-by: Richard Weinberger <richard@nod.at>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the NUMA PTE bits and associated helpers. As a
side-effect it increases the maximum possible swap space on x86-64.
One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory. The base case is safe but transhuge has problems in
the past due to an different migration mechanism and a dependance on page
lock to serialise migrations and warrants a closer look.
task_work hinting update parallel fault
------------------------ --------------
change_pmd_range
change_huge_pmd
__pmd_trans_huge_lock
pmdp_get_and_clear
__handle_mm_fault
pmd_none
do_huge_pmd_anonymous_page
read? pmd_lock blocks until hinting complete, fail !pmd_none test
write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
pmd_modify
set_pmd_at
task_work hinting update parallel migration
------------------------ ------------------
change_pmd_range
change_huge_pmd
__pmd_trans_huge_lock
pmdp_get_and_clear
__handle_mm_fault
do_huge_pmd_numa_page
migrate_misplaced_transhuge_page
pmd_lock waits for updates to complete, recheck pmd_same
pmd_modify
set_pmd_at
Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged. The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok. I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed. If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>