linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 07:55:25 +07:00

Author	SHA1	Message	Date
Alistair Popple	6b3d12a948	powerpc/powernv: Fix TCE kill on NVLink2 Commit `616badd2fb` ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2") forced all TCE kills to go via the OPAL call for NVLink2. However the PHB3 implementation of TCE kill was still being called directly from some functions which in some circumstances caused a machine check. This patch adds an equivalent IODA2 version of the function which uses the correct invalidation method depending on PHB model and changes all external callers to use it instead. Fixes: `616badd2fb` ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-05-03 20:45:55 +10:00
Alistair Popple	1ab66d1fba	powerpc/powernv: Introduce address translation services for Nvlink2 Nvlink2 supports address translation services (ATS) allowing devices to request address translations from an mmu known as the nest MMU which is setup to walk the CPU page tables. To access this functionality certain firmware calls are required to setup and manage hardware context tables in the nvlink processing unit (NPU). The NPU also manages forwarding of TLB invalidates (known as address translation shootdowns/ATSDs) to attached devices. This patch exports several methods to allow device drivers to register a process id (PASID/PID) in the hardware tables and to receive notification of when a device should stop issuing address translation requests (ATRs). It also adds a fault handler to allow device drivers to demand fault pages in. Signed-off-by: Alistair Popple <alistair@popple.id.au> [mpe: Fix up comment formatting, use flush_tlb_mm()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-04 13:27:26 +10:00
Alistair Popple	616badd2fb	powerpc/powernv: Use OPAL call for TCE kill on NVLink2 Add detection of NPU2 PHBs. NPU2/NVLink2 has a different register layout for the TCE kill register therefore TCE invalidation should be done via the OPAL call rather than using the register directly as it is for PHB3 and NVLink1. This changes TCE invalidation to use the OPAL call in the case of a NPU2 PHB model. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-01-30 20:34:53 +11:00
Krzysztof Kozlowski	00085f1efa	dma-mapping: use unsigned long for dma_attrs The dma-mapping core and the implementations do not change the DMA attributes passed by pointer. Thus the pointer can point to const data. However the attributes do not have to be a bitfield. Instead unsigned long will do fine: 1. This is just simpler. Both in terms of reading the code and setting attributes. Instead of initializing local attributes on the stack and passing pointer to it to dma_set_attr(), just set the bits. 2. It brings safeness and checking for const correctness because the attributes are passed by value. Semantic patches for this change (at least most of them): virtual patch virtual context @r@ identifier f, attrs; @@ f(..., - struct dma_attrs attrs + unsigned long attrs , ...) { ... } @@ identifier r.f; @@ f(..., - NULL + 0 ) and // Options: --all-includes virtual patch virtual context @r@ identifier f, attrs; type t; @@ t f(..., struct dma_attrs attrs); @@ identifier r.f; @@ f(..., - NULL + 0 ) Link: http://lkml.kernel.org/r/1468399300-5399-2-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> Acked-by: Mark Salter <msalter@redhat.com> [c6x] Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> [cris] Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [drm] Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Joerg Roedel <jroedel@suse.de> [iommu] Acked-by: Fabien Dessenne <fabien.dessenne@st.com> [bdisp] Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com> [vb2-core] Acked-by: David Vrabel <david.vrabel@citrix.com> [xen] Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen swiotlb] Acked-by: Joerg Roedel <jroedel@suse.de> [iommu] Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32] Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc] Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-08-04 08:50:07 -04:00
Benjamin Herrenschmidt	fd141d1a99	powerpc/powernv/pci: Rework accessing the TCE invalidate register It's architected, always in a known place, so there is no need to keep a separate pointer to it, we use the existing "regs", and we complement it with a real mode variant. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> # Conflicts: # arch/powerpc/platforms/powernv/pci-ioda.c # arch/powerpc/platforms/powernv/pci.h Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-07-17 16:42:47 +10:00
Benjamin Herrenschmidt	a34ab7c328	powerpc/powernv/pci: Rename TCE invalidation calls The TCE invalidation functions are fairly implementation specific, and while the IODA specs more/less describe the register, in practice various implementation workarounds may be required. So name the functions after the target PHB. Note today and for the foreseeable future, there's a 1:1 relationship between an IODA version and a PHB implementation. There exist another variant of IODA1 (Torrent) but we never supported in with OPAL and never will. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-07-17 16:42:46 +10:00
Ian Munsie	a2f67d5ee8	cxl: Add support for interrupts on the Mellanox CX4 The Mellanox CX4 in cxl mode uses a hybrid interrupt model, where interrupts are routed from the networking hardware to the XSL using the MSIX table, and from there will be transformed back into an MSIX interrupt using the cxl style interrupts (i.e. using IVTE entries and ranges to map a PE and AFU interrupt number to an MSIX address). We want to hide the implementation details of cxl interrupts as much as possible. To this end, we use a special version of the MSI setup & teardown routines in the PHB while in cxl mode to allocate the cxl interrupts and configure the IVTE entries in the process element. This function does not configure the MSIX table - the CX4 card uses a custom format in that table and it would not be appropriate to fill that out in generic code. The rest of the functionality is similar to the "Full MSI-X mode" described in the CAIA, and this could be easily extended to support other adapters that use that mode in the future. The interrupts will be associated with the default context. If the maximum number of interrupts per context has been limited (e.g. by the mlx5 driver), it will automatically allocate additional kernel contexts to associate extra interrupts as required. These contexts will be started using the same WED that was used to start the default context. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-07-14 20:27:08 +10:00
Ian Munsie	4361b03430	powerpc/powernv: Add support for the cxl kernel api on the real phb This adds support for the peer model of the cxl kernel api to the PowerNV PHB, in which physical function 0 represents the cxl function on the card (an XSL in the case of the CX4), which other physical functions will use for memory access and interrupt services. It is referred to as the peer model as these functions are peers of one another, as opposed to the Virtual PHB model which forms a hierarchy. This patch exports APIs to enable the peer mode, check if a PCI device is attached to a PHB in this mode, and to set and get the peer AFU for this mode. The cxl driver will enable this mode for supported cards by calling pnv_cxl_enable_phb_kernel_api(). This will set a flag in the PHB to note that this mode is enabled, and switch out it's controller_ops for the cxl version. The cxl version of the controller_ops struct implements it's own versions of the enable_device_hook and release_device to handle refcounting on the peer AFU and to allocate a default context for the device. Once enabled, the cxl kernel API may not be disabled on a PHB. Currently there is no safe way to disable cxl mode short of a reboot, so until that changes there is no reason to support the disable path. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-07-14 20:26:54 +10:00
Ian Munsie	f456834a6c	powerpc/powernv: Split cxl code out into a separate file The support for using the Mellanox CX4 in cxl mode will require additions to the PHB code. In preparation for this, move the existing cxl code out of pci-ioda.c into a separate pci-cxl.c file to keep things more organised. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-07-14 20:26:31 +10:00
Gavin Shan	c5f7700bbd	powerpc/powernv: Dynamically release PE This supports releasing PEs dynamically. A reference count is introduced to PE representing number of PCI devices associated with the PE. The reference count is increased when PCI device joins the PE and decreased when PCI device leaves the PE in pnv_pci_release_device(). When the count becomes zero, the PE and its consumed resources are released. Note that the count is accessed concurrently. So a counter with "int" type is enough here. In order to release the sources consumed by the PE, couple of helper functions are introduced as below: * pnv_pci_ioda1_unset_window() - Unset IODA1 DMA32 window * pnv_pci_ioda1_release_dma_pe() - Release IODA1 DMA32 segments * pnv_pci_ioda2_release_dma_pe() - Release IODA2 DMA resource * pnv_ioda_release_pe_seg() - Unmap IO/M32/M64 segments Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-06-21 15:30:55 +10:00
Gavin Shan	63803c39c8	powerpc/powernv: Setup PE for root bus There is no parent bridge for root bus, meaning pcibios_setup_bridge() isn't invoked for root bus. The PE for root bus is the ancestor of other PEs in PELTV. It means we need PE for root bus populated before all others. This populates the PE for root bus in pcibios_setup_bridge() path if it's not populated yet. The PE number next to the reserved one is used as the PE# to avoid holes in continuous M64 space. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-06-21 15:30:54 +10:00
Gavin Shan	c127562ae1	powerpc/powernv: Increase PE# capacity Each PHB maintains an array helping to translate 2-bytes Request ID (RID) to PE# with the assumption that PE# takes one byte, meaning that we can't have more than 256 PEs. However, pci_dn->pe_number already had 4-bytes for the PE#. This extends the PE# capacity for every PHB. After that, the PE number is represented by 4-bytes value. Then we can reuse IODA_INVALID_PE to check the PE# in phb->pe_rmap[] is valid or not. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-06-21 15:30:53 +10:00
Alexey Kardashevskiy	b5cb9ab1a0	powerpc/powernv/npu: Enable NVLink pass through IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which also has a couple of fast speed links (NVLink). The interface to links is exposed as an emulated PCI bridge which is included into the same IOMMU group as the corresponding GPU. In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE which behave pretty much as the standard IODA2 PHB except NPU PHB has just a single TVE in the hardware which means it can have either 32bit window or 64bit window or DMA bypass but never two of these. In order to make these links work when GPU is passed to the guest, these bridges need to be passed as well; otherwise performance will degrade. This implements and exports API to manage NPU state in regard to VFIO; it replicates iommu_table_group_ops. This defines a new pnv_pci_ioda2_npu_ops which is assigned to the IODA2 bridge if there are NPUs for a GPU on the bridge. The new callbacks call the default IODA2 callbacks plus new NPU API. This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2 table_group, it is not expected to fail as the helper is only called from the pnv_pci_ioda2_npu_ops. This does not define NPU-specific .release_ownership() so after VFIO is finished, DMA on NPU is disabled which is ok as the nvidia driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU. This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to the GPU group if any found. The helper uses helpers to look for the "ibm,gpu" property in the device tree which is a phandle of the corresponding GPU. This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main loop skips NPU PEs as they do not have 32bit DMA segments. As pnv_npu_set_window() and pnv_npu_unset_window() are started being used by the new IODA2-NPU IOMMU group, this makes the helpers public and adds the DMA window number parameter. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-By: Alistair Popple <alistair@popple.id.au> [mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:31 +10:00
Alexey Kardashevskiy	85674868ce	powerpc/powernv/npu: Rework TCE Kill handling The pnv_ioda_pe struct keeps an array of peers. At the moment it is only used to link GPU and NPU for 2 purposes: 1. Access NPU quickly when configuring DMA for GPU - this was addressed in the previos patch by removing use of it as DMA setup is not what the kernel would constantly do. 2. Invalidate TCE cache for NPU when it is invalidated for GPU. GPU and NPU are in different PE. There is already a mechanism to attach multiple iommu_table_group to the same iommu_table (used for VFIO), we can reuse it here so does this patch. This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are not needed anymore. While we are here, add TCE cache invalidation after enabling bypass. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-By: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:31 +10:00
Alexey Kardashevskiy	7d623e4256	powerpc/powernv/ioda2: Export debug helper pe_level_printk() This exports debugging helper pe_level_printk() and corresponding macroses so they can be used in npu-dma.c. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-By: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:30 +10:00
Alexey Kardashevskiy	f9f8345674	powerpc/powernv/npu: Simplify DMA setup NPU devices are emulated in firmware and mainly used for NPU NVLink training; one NPU device is per a hardware link. Their DMA/TCE setup must match the GPU which is connected via PCIe and NVLink so any changes to the DMA/TCE setup on the GPU PCIe device need to be propagated to the NVLink device as this is what device drivers expect and it doesn't make much sense to do anything else. This makes NPU DMA setup explicit. pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda, made static and prints warning as dma_set_mask() should never be called on this function as in any case it will not configure GPU; so we make this explicit. Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will remove), we test every PCI device if there are corresponding NVLink devices. If there are any, we propagate bypass mode to just found NPU devices by calling the setup helper directly (which takes @bypass) and avoid guessing (i.e. calculating from DMA mask) whether we need bypass or not on NPU devices. Since DMA setup happens in very rare occasion, this will not slow down booting or VFIO start/stop much. This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it more clear what the function really does which is programming 32bit table address to the TVT ("disabling bypass" means writing zeroes to the TVT). This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as the DMA configuration on NPU does not matter until dma_set_mask() is called on GPU and that will do the NPU DMA configuration. This removes phb->dma_dev_setup initialization for NPU as pnv_pci_ioda_dma_dev_setup is no-op for it anyway. This stops using npe->tce_bypass_base as it never changes and values other than zero are not supported. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:29 +10:00
Alexey Kardashevskiy	0bbcdb437d	powerpc/powernv/npu: TCE Kill helpers cleanup NPU PHB TCE Kill register is exactly the same as in the rest of POWER8 so let's reuse the existing code for NPU. The only bit missing is a helper to reset the entire TCE cache so this moves such a helper from NPU code and renames it. Since pnv_npu_tce_invalidate() does really invalidate the entire cache, this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU. This adds an explicit comment for workaround for invalidating NPU TCE cache. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:29 +10:00
Gavin Shan	1e9167726c	powerpc/powernv: Use PE instead of number during setup and release In current implementation, the PEs that are allocated or picked from the reserved list are identified by PE number. The PE instance has to be picked according to the PE number eventually. We have same issue when PE is released. For pnv_ioda_pick_m64_pe() and pnv_ioda_alloc_pe(), this returns PE instance so that pnv_ioda_setup_bus_PE() can use the allocated or reserved PE instance directly. Also, pnv_ioda_setup_bus_PE() returns the reserved/allocated PE instance to be used in subsequent patches. On the other hand, pnv_ioda_free_pe() uses PE instance (not number) as its argument. No logical changes introduced. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:23 +10:00
Gavin Shan	2b923ed1bd	powerpc/powernv/ioda1: Improve DMA32 segment track In current implementation, the DMA32 segments required by one specific PE isn't calculated with the information hold in the PE independently. It conflicts with the PCI hotplug design: PE centralized, meaning the PE's DMA32 segments should be calculated from the information hold in the PE independently. This introduces an array (@dma32_segmap) for every PHB to track the DMA32 segmeng usage. Besides, this moves the logic calculating PE's consumed DMA32 segments to pnv_pci_ioda1_setup_dma_pe() so that PE's DMA32 segments are calculated/allocated from the information hold in the PE (DMA32 weight). Also the logic is improved: we try to allocate as much DMA32 segments as we can. It's acceptable that number of DMA32 segments less than the expected number are allocated. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:22 +10:00
Gavin Shan	801846d1de	powerpc/powernv: Remove DMA32 PE list PEs are put into PHB DMA32 list (phb->ioda.pe_dma_list) according to their DMA32 weight. The PEs on the list are iterated to setup their TCE32 tables at system booting time. The list is used for once at boot time and no need to keep it. This moves the logic calculating DMA32 weight of PHB and PE to pnv_ioda_setup_dma() to drop PHB's DMA32 list. Also, every PE traces the consumed DMA32 segment by @tce32_seg and @tce32_segcount are useless and they're removed. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:22 +10:00
Gavin Shan	93289d8c08	powerpc/powernv: Track M64 segment consumption When unplugging PCI devices, their parent PEs might be offline. The consumed M64 resource by the PEs should be released at that time. As we track M32 segment consumption, this introduces an array to the PHB to track the mapping between M64 segment and PE number. Note: M64 mapping isn't covered by pnv_ioda_setup_pe_seg() as IODA2 doesn't support the mapping explicitly while it's supported on IODA1. Until now, no M64 is supported on IODA1 in software. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:19 +10:00
Gavin Shan	689ee8c95f	powerpc/powernv: Data type unsigned int for PE number This changes the data type of PE number from "int" to "unsigned int" in order to match the fact PE number is never negative: * The number of PE to which the specified PCI device is attached. * The PE number map for SRIOV VFs. * The returned PE number from pnv_ioda_alloc_pe(). * The returned PE number from pnv_ioda2_pick_m64_pe(). Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-By: Alistair Popple <alistair@popple.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:17 +10:00
Gavin Shan	92b8f137b3	powerpc/powernv: Rename PE# fields in struct pnv_phb This renames the fields related to PE number in "struct pnv_phb" for better reflecting of their usages as Alexey suggested. No logical changes introduced. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:17 +10:00
Gavin Shan	13ce7598b6	powerpc/powernv: Reorder fields in struct pnv_phb This moves those fields in struct pnv_phb that are related to PE allocation around. No logical change. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:16 +10:00
Gavin Shan	475d92c27f	powerpc/powernv: Drop phb->bdfn_to_pe() The last usage of pnv_phb::bdfn_to_pe() was removed in `ff57b454dd` ("powerpc/eeh: Do probe on pci_dn"), so drop it. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-05-11 21:54:16 +10:00
Michael Ellerman	2527083cb8	powerpc fixes for 4.5 #3 - eeh: Fix partial hotplug criterion from Gavin Shan - mm: Clear the invalid slot information correctly from Aneesh Kumar K.V -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWzXquAAoJEFHr6jzI4aWADHsP/2lbwqz/vS3Ep4zlySHNvStL /DrRN2TN35THZ59FPRxgEfeqPxTCXtbpD6zEXwD0gf6m39I2zArhaQMOHXMtVPvV p0nAtwR0PX/PxlQTJDpHlg074vVAD7s3iuvad6oNQObLcXhoZ7wYtbStZ9Ithm4R YfqZTelzsw+GfMuTYnvAQf5aoRYztUpy7OheaJbbDmSZgMFwF896ZPJnaG9rAOPE xcSsRaSfHiUR2NE2ua1K5yya+1ilZqrZhib7QxXgzGuxoVa2AAiPR7Hpx2kX1Wm+ z0DqPXISzRbVf9zyLgWD3TpJ4OMHI/CYVW+t/Gx/yWCMfNcfavUrh0vPdHRVEPZu zxmIUoI6yv7jQ6bcfdzR5s0Mr5pYWlUj5MZg2r8aGqloYcLPk5DiENg+c0QmKI05 kQPCBoQz2ezzJWAt1BYshkc+mlimv3ODaNWFP34Nc6kcDaSO6a0rhVOecvKuR6dv UBNpeh5np1rKq1wX0ri0yAmnm//yXqe+bK0I8Ctipi0++e73sVJGzfFdVvXwEhhW h+v1BkdgW8WK/xlH+JCPiXd5dfXrUeFI0D65Kgpb7IbFc9hcXDmp2Dv7+8zx/Wcl L2NpuucSDxi+LHkE10QiypgLWSKjn9OSi8PLocqABNXG8uHxIp54jRfyViBNALXF XlPveqTgpt7On3aa0qVh =bk3U -----END PGP SIGNATURE----- Merge tag 'powerpc-4.5-4' into next Pull in our current fixes from 4.5, in particular the "Fix Multi hit ERAT" bug is causing folks some grief when testing next.	2016-02-25 21:52:58 +11:00
Gavin Shan	1bc74f1ccd	powerpc/powernv: Fix stale PE primary bus When PCI bus is unplugged during full hotplug for EEH recovery, the platform PE instance (struct pnv_ioda_pe) isn't released and it dereferences the stale PCI bus that has been released. It leads to kernel crash when referring to the stale PCI bus. This fixes the issue by correcting the PE's primary bus when it's oneline at plugging time, in pnv_pci_dma_bus_setup() which is to be called by pcibios_fixup_bus(). Cc: stable@vger.kernel.org # v4.1+ Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reported-by: Pradipta Ghosh <pradghos@in.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-02-15 21:10:04 +11:00
Russell Currey	2de50e9674	powerpc/powernv: Remove support for p5ioc2 "p5ioc2 is used by approximately 2 machines in the world, and has never ever been a supported configuration." The code for p5ioc2 is essentially unused and complicates what is already a very complicated codebase. Its removal is essentially a "free win" in the effort to simplify the powernv PCI code. In addition, support for p5ioc2 has been dropped from skiboot. There's no reason to keep it around in the kernel. Signed-off-by: Russell Currey <ruscur@russell.cc> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2016-02-08 22:03:37 +11:00
Alistair Popple	5d2aa710e6	powerpc/powernv: Add support for Nvlink NPUs NVLink is a high speed interconnect that is used in conjunction with a PCI-E connection to create an interface between CPU and GPU that provides very high data bandwidth. A PCI-E connection to a GPU is used as the control path to initiate and report status of large data transfers sent via the NVLink. On IBM Power systems the NVLink processing unit (NPU) is similar to the existing PHB3. This patch adds support for a new NPU PHB type. DMA operations on the NPU are not supported as this patch sets the TCE translation tables to be the same as the related GPU PCIe device for each NVLink. Therefore all DMA operations are setup and controlled via the PCIe device. EEH is not presently supported for the NPU devices, although it may be added in future. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-12-17 22:41:00 +11:00
Andrew Donnellan	53522982fc	powerpc/powernv: move dma_get_required_mask from pnv_phb to pci_controller_ops Simplify the dma_get_required_mask call chain by moving it from pnv_phb to pci_controller_ops, similar to commit `763d2d8df1` ("powerpc/powernv: Move dma_set_mask from pnv_phb to pci_controller_ops"). Previous call chain: 0) call dma_get_required_mask() (kernel/dma.c) 1) call ppc_md.dma_get_required_mask, if it exists. On powernv, that points to pnv_dma_get_required_mask() (platforms/powernv/setup.c) 2) device is PCI, therefore call pnv_pci_dma_get_required_mask() (platforms/powernv/pci.c) 3) call phb->dma_get_required_mask if it exists 4) it only exists in the ioda case, where it points to pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c) New call chain: 0) call dma_get_required_mask() (kernel/dma.c) 1) device is PCI, therefore call pci_controller_ops.dma_get_required_mask if it exists 2) in the ioda case, that points to pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c) In the p5ioc2 case, the call chain remains the same - dma_get_required_mask() does not find either a ppc_md call or pci_controller_ops call, so it calls __dma_get_required_mask(). Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-08-18 19:32:11 +10:00
Gavin Shan	26ba248d52	powerpc/powernv: Pick M64 PEs based on BARs On PHB3, PE might be reserved in advance to reflect the M64 segments consumed by the PE according to M64 BARs (exclude VF BARs) of the PCI devices included in the PE. The PE is picked based on M64 BARs instead of the bridge's M64 windows, which might include VF BARs. Otherwise, wrong PE could be picked. The patch calculates the used M64 segments and PE numbers according to the M64 BARs, excluding VF BARs, of PCI devices in one particular PE, instead of the bridge's M64 windows. Then the right PE number is picked. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-07-13 16:12:01 +10:00
Gavin Shan	d1203852df	powerpc/powernv: Boolean argument for pnv_ioda_setup_bus_PE() The patch changes the type of last argument of pnv_ioda_setup_bus_PE() and phb::pick_m64_pe() to boolean. No functional change. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-07-13 16:12:00 +10:00
Gavin Shan	96a2f92bf8	powerpc/powernv: Reserve M64 PEs based on BARs On PHB3, some PEs might be reserved in advance to reflect the M64 segments consumed by those PEs. We're reserving PEs based on the M64 window of root port, which might contain VF BAR. The PEs for VFs are allocated dynamically, not reserved based on the consumed M64 segments. So the M64 window of root port isn't reliable for the task. Instead, we go through M64 BARs (VF BARs excluded) of PCI devices under the specified root bus and reserve PEs accordingly, as the patch does. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-07-13 16:12:00 +10:00
Alexey Kardashevskiy	05c6cfb9dc	powerpc/iommu/powernv: Release replaced TCE At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The exchange() receives a physical address unlike set() which receives linear mapping address; and returns a physical address as the clear() does. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-06-11 15:16:49 +10:00
Alexey Kardashevskiy	5780fb0426	powerpc/powernv/ioda2: Move TCE kill register address to PE At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property which contains the TCE kill register address. Writing to this register invalidates TCE cache on IODA/IODA2 hub. This moves the register address from iommu_table to pnv_pnb as this register belongs to PHB and invalidates TCE cache for all tables of all attached PEs. This moves the property reading/remapping code to a helper which is called when DMA is being configured for PE and which does DMA setup for both IODA1 and IODA2. This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which invalidates cache for the entire table. It should be called after every call to opal_pci_map_pe_dma_window(). It was not required before because there was just a single TCE table and 64bit DMA was handled via bypass window (which has no table so no cache was used) but this is going to change with Dynamic DMA windows (DDW). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-06-11 15:16:48 +10:00
Alexey Kardashevskiy	0eaf4defc7	powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group So far one TCE table could only be used by one IOMMU group. However IODA2 hardware allows programming the same TCE table address to multiple PE allowing sharing tables. This replaces a single pointer to a group in a iommu_table struct with a linked list of groups which provides the way of invalidating TCE cache for every PE when an actual TCE table is updated. This adds pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group() helpers to manage the list. However without VFIO, it is still going to be a single IOMMU group per iommu_table. This changes iommu_add_device() to add a device to a first group from the group list of a table as it is only called from the platform init code or PCI bus notifier and at these moments there is only one group per table. This does not change TCE invalidation code to loop through all attached groups in order to simplify this patch and because it is not really needed in most cases. IODA2 is fixed in a later patch. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-06-11 15:16:15 +10:00
Alexey Kardashevskiy	b348aa6529	powerpc/spapr: vfio: Replace iommu_table with iommu_table_group Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. This defines iommu_table_group struct which stores pointers to iommu_group and iommu_table(s). This replaces iommu_table with iommu_table_group where iommu_table was used to identify a group: - iommu_register_group(); - iommudata of generic iommu_group; This removes @data from iommu_table as it_table_group provides same access to pnv_ioda_pe. For IODA, instead of embedding iommu_table, the new iommu_table_group keeps pointers to those. The iommu_table structs are allocated dynamically. For P5IOC2, both iommu_table_group and iommu_table are embedded into PE struct. As there is no EEH and SRIOV support for P5IOC2, iommu_free_table() should not be called on iommu_table struct pointers so we can keep it embedded in pnv_phb::p5ioc2. For pSeries, this replaces multiple calls of kzalloc_node() with a new iommu_pseries_alloc_group() helper and stores the table group struct pointer into the pci_dn struct. For release, a iommu_table_free_group() helper is added. This moves iommu_table struct allocation from SR-IOV code to the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized. This change is here because those lines had to be changed anyway. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-06-11 15:14:57 +10:00
Alexey Kardashevskiy	da004c3600	powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table This adds a iommu_table_ops struct and puts pointer to it into the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush callbacks from ppc_md to the new struct where they really belong to. This adds the requirement for @it_ops to be initialized before calling iommu_init_table() to make sure that we do not leave any IOMMU table with iommu_table_ops uninitialized. This is not a parameter of iommu_init_table() though as there will be cases when iommu_init_table() will not be called on TCE tables, for example - VFIO. This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_" redundant prefixes. This removes tce_xxx_rm handlers from ppc_md but does not add them to iommu_table_ops as this will be done later if we decide to support TCE hypercalls in real mode. This removes _vm callbacks as only virtual mode is supported by now so this also removes @rm parameter. For pSeries, this always uses tce_buildmulti_pSeriesLP/ tce_buildmulti_pSeriesLP. This changes multi callback to fall back to tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not present. The reason for this is we still have to support "multitce=off" boot parameter in disable_multitce() and we do not want to walk through all IOMMU tables in the system and replace "multi" callbacks with single ones. For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2. This makes the callbacks for them public. Later patches will extend callbacks for IODA1/2. No change in behaviour is expected. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-06-11 15:14:56 +10:00
Michael Neuling	7a8e6bbf85	powerpc/pci: Add shutdown hook to pci_controller_ops Currently pnv_pci_shutdown() calls the PHB shutdown code for all PHBs in the system. It dereferences the private_data assuming it's a powernv PHB, which won't be the case when we have different PHB in the systems (like when we add vPHBs for CXL). This moves the shutdown hook to the pci_controller_ops and fixes the call site to use that instead. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-06-03 13:27:16 +10:00
Daniel Axtens	763d2d8df1	powerpc/powernv: Move dma_set_mask() from pnv_phb to pci_controller_ops Previously, dma_set_mask() on powernv was convoluted: 0) Call dma_set_mask() (a/p/kernel/dma.c) 1) In dma_set_mask(), ppc_md.dma_set_mask() exists, so call it. 2) On powernv, that function pointer is pnv_dma_set_mask(). In pnv_dma_set_mask(), the device is pci, so call pnv_pci_dma_set_mask(). 3) In pnv_pci_dma_set_mask(), call pnv_phb->set_dma_mask() if it exists. 4) It only exists in the ioda case, where it points to pnv_pci_ioda_dma_set_mask(), which is the final function. So the call chain is: dma_set_mask() -> pnv_dma_set_mask() -> pnv_pci_dma_set_mask() -> pnv_pci_ioda_dma_set_mask() Both ppc_md and pnv_phb function pointers are used. Rip out the ppc_md call, pnv_dma_set_mask() and pnv_pci_dma_set_mask(). Instead: 0) Call dma_set_mask() (a/p/kernel/dma.c) 1) In dma_set_mask(), the device is pci, and pci_controller_ops.dma_set_mask() exists, so call pci_controller_ops.dma_set_mask() 2) In the ioda case, that points to pnv_pci_ioda_dma_set_mask(). The new call chain is dma_set_mask() -> pnv_pci_ioda_dma_set_mask() Now only the pci_controller_ops function pointer is used. The fallback paths for p5ioc2 are the same. Previously, pnv_pci_dma_set_mask() would find no pnv_phb->set_dma_mask() function, to it would call __set_dma_mask(). Now, dma_set_mask() finds no ppc_md call or pci_controller_ops call, so it calls __set_dma_mask(). Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-06-02 13:18:49 +10:00
Daniel Axtens	92ae035326	powerpc/powernv: Specialise pci_controller_ops for each controller type Remove powernv generic PCI controller operations. Replace it with controller ops for each of the two supported PHBs. As an added bonus, make the two new structs const, which will help guard against bugs such as the one introduced in `65ebf4b63` ("powerpc/powernv: Move controller ops from ppc_md to controller_ops") Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2015-06-02 13:18:49 +10:00
Wei Yang	781a868f31	powerpc/powernv: Shift VF resource with an offset On PowerNV platform, resource position in M64 BAR implies the PE# the resource belongs to. In some cases, adjustment of a resource is necessary to locate it to a correct position in M64 BAR . This patch adds pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address according to an offset. Note: After doing so, there would be a "hole" in the /proc/iomem when offset is a positive value. It looks like the device return some mmio back to the system, which actually no one could use it. [bhelgaas: rework loops, rework overlap check, index resource[] conventionally, remove pci_regs.h include, squashed with next patch] Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-31 13:02:38 +11:00
Wei Yang	9e8d4a19ab	powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically Previously the iommu_table had the same lifetime as a struct pnv_ioda_pe and was embedded in it. The pnv_ioda_pe was assigned to a PE on the bootup stage. Since PEs are based on the hardware layout which is static in the system, they will never get released. This means the iommu_table in the pnv_ioda_pe will never get released either. This no longer works for VF PE. VF PEs are created and released dynamically when VFs are created and released. So we need to assign pnv_ioda_pe to VF PEs respectively when VFs are enabled and clean up those resources for VF PE when VFs are disabled. And iommu_table is one of the resources we need to handle dynamically. Current iommu_table is a static field in pnv_ioda_pe, which will face a problem when freeing it. During the disabling of a VF, pnv_pci_ioda2_release_dma_pe will call iommu_free_table to release the iommu_table for this PE. A static iommu_table will fail in iommu_free_table. According to these requirement, this patch allocates iommu_table dynamically. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-31 13:02:37 +11:00
Gavin Shan	3532a741f8	powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor The PCI config accessors previously relied on device_node. Unfortunately, VFs don't have a corresponding device_node, so change the accessors to use pci_dn instead. [bhelgaas: changelog] Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-24 13:15:50 +11:00
Gavin Shan	2f6cf79448	powerpc/powernv: Remove unused file The patch removes unused file eeh-ioda.c and updates makefile accordingly. Besides, the definition of "struct pnv_eeh_ops" and the instances are all removed. Until now, the chip layer of EEH implementation for PowerNV platform is removed completely. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-17 10:31:20 +11:00
Gavin Shan	cadf364d14	powerpc/powernv: Drop PHB operation reset() The patch drops PHB EEH operation reset() and merges its logic to eeh_ops::reset(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-17 10:31:19 +11:00
Gavin Shan	2a485ad7c8	powerpc/powernv: Drop PHB operation next_error() The patch drops PHB EEH operation next_error() and merges its logic to eeh_ops::next_error(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-17 10:31:19 +11:00
Gavin Shan	40ae5f693f	powerpc/powernv: Drop PHB operation get_state() The patch drops PHB EEH operation get_state() and merges its logic to eeh_ops::get_state(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-17 10:31:19 +11:00
Gavin Shan	7e3e4f8d5e	powerpc/powernv: Drop PHB operation set_option() The patch drops PHB EEH operation set_option() and merges its logic to eeh_ops::set_option(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-17 10:31:19 +11:00
Gavin Shan	bbe170ede1	powerpc/powernv: Drop PHB operation configure_bridge() The patch drops PHB EEH operation configure_bridge() and merges its logic to eeh_ops::configure_bridge(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-17 10:31:19 +11:00

1 2

89 Commits