Commit Graph

55 Commits

Author SHA1 Message Date
Greg Kurz
02c5f53949 powerpc/powernv/npu: Fix reference leak
Since 902bdc5745, get_pci_dev() calls pci_get_domain_bus_and_slot(). This
has the effect of incrementing the reference count of the PCI device, as
explained in drivers/pci/search.c:

 * Given a PCI domain, bus, and slot/function number, the desired PCI
 * device is located in the list of PCI devices. If the device is
 * found, its reference count is increased and this function returns a
 * pointer to its data structure.  The caller must decrement the
 * reference count by calling pci_dev_put().  If no device is found,
 * %NULL is returned.

Nothing was done to call pci_dev_put() and the reference count of GPU and
NPU PCI devices rockets up.

A natural way to fix this would be to teach the callers about the change,
so that they call pci_dev_put() when done with the pointer. This turns
out to be quite intrusive, as it affects many paths in npu-dma.c,
pci-ioda.c and vfio_pci_nvlink2.c. Also, the issue appeared in 4.16 and
some affected code got moved around since then: it would be problematic
to backport the fix to stable releases.

All that code never cared for reference counting anyway. Call pci_dev_put()
from get_pci_dev() to revert to the previous behavior.

Fixes: 902bdc5745 ("powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn")
Cc: stable@vger.kernel.org # v4.16
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-02 19:39:36 +10:00
Heiner Kallweit
51c51a48de powerpc/powernv/npu: Use pci_dev_id() helper
Use new helper pci_dev_id() to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2019-04-29 16:12:28 -05:00
Peter Xu
1b58a975be powerpc/powernv/npu: Remove redundant change_pte() hook
The change_pte() notifier was designed to use as a quick path to
update secondary MMU PTEs on write permission changes or PFN changes.
For KVM, it could reduce the vm-exits when vcpu faults on the pages
that was touched up by KSM. It's not used to do cache invalidations,
for example, if we see the notifier will be called before the real PTE
update after all (please see set_pte_at_notify that set_pte_at was
called later).

All the necessary cache invalidation should all be done in
invalidate_range() already.

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22 00:10:14 +11:00
Michael Ellerman
d0055df0c9 Merge branch 'topic/dma' into next
Merge hch's big DMA rework series. This is in a topic branch in case he
wants to merge it to minimise conflicts.
2019-02-21 23:15:10 +11:00
Michael Ellerman
637cfeb9f9 Merge branch 'fixes' into next
There's a few important fixes in our fixes branch, in particular the
pgd/pud_present() one, so merge it now.
2019-02-19 19:56:26 +11:00
Christoph Hellwig
68005b67d1 powerpc/dma: use the generic direct mapping bypass
Now that we've switched all the powerpc nommu and swiotlb methods to
use the generic dma_direct_* calls we can remove these ops vectors
entirely and rely on the common direct mapping bypass that avoids
indirect function calls entirely.  This also allows to remove a whole
lot of boilerplate code related to setting up these operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-18 22:41:04 +11:00
Alexey Kardashevskiy
797eadd9c8 powerpc/powernv/npu: Remove obsolete comment about TCE_KILL_INVAL_ALL
TCE_KILL_INVAL_ALL has moved long ago but the comment was forgotted so
finish the move and remove the comment.

Fixes: 0bbcdb437d "powerpc/powernv/npu: TCE Kill helpers cleanup"
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-01-15 11:17:09 +11:00
Dan Carpenter
d7b6cc199b powerpc/powernv/npu: Allocate enough memory in pnv_try_setup_npu_table_group()
There is a typo so we accidentally allocate enough memory for a pointer
when we wanted to allocate enough for a struct.

Fixes: 0bd971676e ("powerpc/powernv/npu: Add compound IOMMU groups")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-01-11 23:44:54 +11:00
Alexey Kardashevskiy
58629c0dc3 powerpc/powernv/npu: Fault user page into the hypervisor's pagetable
When a page fault happens in a GPU, the GPU signals the OS and the GPU
driver calls the fault handler which populated a page table; this allows
the GPU to complete an ATS request.

On the bare metal get_user_pages() is enough as it adds a pte to
the kernel page table but under KVM the partition scope tree does not get
updated so ATS will still fail.

This reads a byte from an effective address which causes HV storage
interrupt and KVM updates the partition scope tree.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy
135ef95405 powerpc/powernv/npu: Check mmio_atsd array bounds when populating
A broken device tree might contain more than 8 values and introduce hard
to debug memory corruption bug. This adds the boundary check.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy
1b785611e1 powerpc/powernv/npu: Add release_ownership hook
In order to make ATS work and translate addresses for arbitrary
LPID and PID, we need to program an NPU with LPID and allow PID wildcard
matching with a specific MSR mask.

This implements a helper to assign a GPU to LPAR and program the NPU
with a wildcard for PID and a helper to do clean-up. The helper takes
MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for
ATS checkout requests support.

This exports pnv_npu2_unmap_lpar_dev() as following patches will use it
from the VFIO driver.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy
0bd971676e powerpc/powernv/npu: Add compound IOMMU groups
At the moment the powernv platform registers an IOMMU group for each
PE. There is an exception though: an NVLink bridge which is attached
to the corresponding GPU's IOMMU group making it a master.

Now we have POWER9 systems with GPUs connected to each other directly
bypassing PCI. At the moment we do not control state of these links so
we have to put such interconnected GPUs to one IOMMU group which means
that the old scheme with one GPU as a master won't work - there will
be up to 3 GPUs in such group.

This introduces a npu_comp struct which represents a compound IOMMU
group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for
NVLink bridges). This converts the existing NVLink1 code to use the
new scheme. >From now on, each PE must have a valid
iommu_table_group_ops which will either be called directly (for a
single PE group) or indirectly from a compound group handlers.

This moves IOMMU group registration for NVLink-connected GPUs to
npu-dma.c. For POWER8, this stores a new compound group pointer in the
PE (so a GPU is still a master); for POWER9 the new group pointer is
stored in an NPU (which is allocated per a PCI host controller).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[mpe: Initialise npdev to NULL in pnv_try_setup_npu_table_group()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy
83fb8ccf97 powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops
At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
IOMMU groups with several PEs from several different PHB (such as
interconnected GPUs and NPUs) so there will be no single master but
a one big IOMMU group.

This makes a first step and converts an NPU PE with a set of extern
function to a table group.

This should cause no behavioral change. Note that
pnv_npu_release_ownership() has never been implemented.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy
b04149c2dd powerpc/powernv/npu: Move single TVE handling to NPU PE
Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only
one which points to one of two tables of the corresponding PCI PE.

So whenever a new DMA window is programmed to PEs, the NPU PE needs to
release old table in order to use the new one.

Commit d41ce7b1bc ("powerpc/powernv/npu: Do not try invalidating 32bit
table when 64bit table is enabled") did just that but in pci-ioda.c
while it actually belongs to npu-dma.c.

This moves the single TVE handling to npu-dma.c. This does not implement
restoring though as it is highly unlikely that we can set the table to
PCI PE and cannot to NPU PE and if that fails, we could only set 32bit
table to NPU PE and this configuration is not really supported or wanted.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy
0e759bd752 powerpc/powernv/npu: Move OPAL calls away from context manipulation
When introduced, the NPU context init/destroy helpers called OPAL which
enabled/disabled PID (a userspace memory context ID) filtering in an NPU
per a GPU; this was a requirement for P9 DD1.0. However newer chip
revision added a PID wildcard support so there is no more need to
call OPAL every time a new context is initialized. Also, since the PID
wildcard support was added, skiboot does not clear wildcard entries
in the NPU so these remain in the hardware till the system reboot.

This moves LPID and wildcard programming to the PE setup code which
executes once during the booting process so NPU2 context init/destroy
won't need to do additional configuration.

This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as
this is the way to tell if the NPU support is present and configured.

This moves pnv_npu2_init() declaration as pseries should be able to use it.
This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
call that. This exports pnv_npu2_map_lpar_dev() as following patches
will use it from the VFIO driver.

While at it, replace redundant list_for_each_entry_safe() with
a simpler list_for_each_entry().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy
46a1449d9e powerpc/powernv: Move npu struct from pnv_phb to pci_controller
The powernv PCI code stores NPU data in the pnv_phb struct. The latter
is referenced by pci_controller::private_data. We are going to have NPU2
support in the pseries platform as well but it does not store any
private_data in in the pci_controller struct; and even if it did,
it would be a different data structure.

This makes npu a pointer and stores it one level higher in
the pci_controller struct.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy
fa1ada7889 powerpc/powernv/npu: Remove unused headers and a macro.
The macro and few headers are not used so remove them.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-20 22:59:03 +11:00
Alistair Popple
3182215dd0 powerpc/powernv/npu: Remove NPU DMA ops
The NPU IOMMU is setup to mirror the parent PCIe device IOMMU
setup. Therefore it does not make sense to call dma operations such as
dma_map_page(), etc. directly on these devices. The existing dma_ops
simply print a warning if they are ever called, however this is
unnecessary and the warnings are likely to go unnoticed.

It is instead simpler to remove these operations and let the generic
DMA code print warnings (eg. via a NULL pointer deref) in cases of
buggy drivers attempting dma operations on NVLink devices.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-11-05 16:05:22 +11:00
Mark Hairgrove
f86ad3e019 powerpc/powernv/npu: Remove atsd_threshold debugfs setting
This threshold is no longer used now that all invalidates issue a single
ATSD to each active NPU.

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 16:55:54 +10:00
Mark Hairgrove
3689c37d23 powerpc/powernv/npu: Use size-based ATSD invalidates
Prior to this change only two types of ATSDs were issued to the NPU:
invalidates targeting a single page and invalidates targeting the whole
address space. The crossover point happened at the configurable
atsd_threshold which defaulted to 2M. Invalidates that size or smaller
would issue per-page invalidates for the whole range.

The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
These invalidates target addresses aligned to their size. 2M is a common
invalidation size for GPU-enabled applications because that is a GPU
page size, so reducing the number of invalidates by 32x in that case is a
clear improvement.

ATSD latency is high in general so now we always issue a single invalidate
rather than multiple. This will over-invalidate in some cases, but for any
invalidation size over 2M it matches or improves the prior behavior.
There's also an improvement for single-page invalidates since the prior
version issued two invalidates for that case instead of one.

With this change all issued ATSDs now perform a flush, so the flush
parameter has been removed from all the helpers.

To show the benefit here are some performance numbers from a
microbenchmark which creates a 1G allocation then uses mprotect with
PROT_NONE to trigger invalidates in strides across the allocation.

One NPU (1 GPU):

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         5.3        5.6           5%
1M         39.3       57.4          46%
2M         49.7       82.6          66%
4M        286.6      285.7           0%

Two NPUs (6 GPUs):

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         6.5        7.4          13%
1M         33.4       67.9         103%
2M         38.7       93.1         141%
4M        356.7      354.6          -1%

Anything over 2M is roughly the same as before since both cases issue a
single ATSD.

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 16:55:53 +10:00
Mark Hairgrove
7ead15a144 powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
There are two types of ATSDs issued to the NPU: invalidates targeting a
specific virtual address and invalidates targeting the whole address
space. In both cases prior to this change, the sequence was:

    for each NPU
        - Write the target address to the XTS_ATSD_AVA register
        - EIEIO
        - Write the launch value to issue the ATSD

First, a target address is not required when invalidating the whole
address space, so that write and the EIEIO have been removed. The AP
(size) field in the launch is not needed either.

Second, for per-address invalidates the above sequence is inefficient in
the common case of multiple NPUs because an EIEIO is issued per NPU. This
unnecessarily forces the launches of later ATSDs to be ordered with the
launches of earlier ones. The new sequence only issues a single EIEIO:

    for each NPU
        - Write the target address to the XTS_ATSD_AVA register
    EIEIO
    for each NPU
        - Write the launch value to issue the ATSD

Performance results were gathered using a microbenchmark which creates a
1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
strides across the allocation.

With only a single NPU active (one GPU) the difference is in the noise for
both types of invalidates (+/-1%).

With two NPUs active (on a 6-GPU system) the effect is more noticeable:

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         5.9        6.5          10%
1M         31.2       33.4           7%
2M         36.3       38.7           7%
4M        322.6      356.7          11%

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 16:55:52 +10:00
Reza Arbab
9eab9901b0 powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usage
We've encountered a performance issue when multiple processors stress
{get,put}_mmio_atsd_reg(). These functions contend for
mmio_atsd_usage, an unsigned long used as a bitmask.

The accesses to mmio_atsd_usage are done using test_and_set_bit_lock()
and clear_bit_unlock(). As implemented, both of these will require
a (successful) stwcx to that same cache line.

What we end up with is thread A, attempting to unlock, being slowed by
other threads repeatedly attempting to lock. A's stwcx instructions
fail and retry because the memory reservation is lost every time a
different thread beats it to the punch.

There may be a long-term way to fix this at a larger scale, but for
now resolve the immediate problem by gating our call to
test_and_set_bit_lock() with one to test_bit(), which is obviously
implemented without using a store.

Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Acked-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-03 20:09:01 +10:00
Alistair Popple
99c3ce33a0 powerpc/powernv/npu: Add a debugfs setting to change ATSD threshold
The threshold at which it becomes more efficient to coalesce a range
of ATSDs into a single per-PID ATSD is currently not well understood
due to a lack of real-world work loads. This patch adds a debugfs
parameter allowing the threshold to be altered at runtime in order to
aid future development and refinement of the value.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-19 21:58:10 +10:00
Michael Ellerman
c786cf767b powerpc/powernv: Use __raw_[rm_]writeq_be() in npu-dma.c
This allows us to squash some sparse warnings and also avoids having
to do explicity endian conversions in the code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
2018-05-18 22:00:04 +10:00
Alistair Popple
d0cf9b561c powerpc/powernv/npu: Do a PID GPU TLB flush when invalidating a large address range
The NPU has a limited number of address translation shootdown (ATSD)
registers and the GPU has limited bandwidth to process ATSDs. This can
result in contention of ATSD registers leading to soft lockups on some
threads, particularly when invalidating a large address range in
pnv_npu2_mn_invalidate_range().

At some threshold it becomes more efficient to flush the entire GPU
TLB for the given MM context (PID) than individually flushing each
address in the range. This patch will result in ranges greater than
2MB being converted from 32+ ATSDs into a single ATSD which will flush
the TLB for the given PID on each GPU.

Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Tested-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-24 09:46:57 +10:00
Alistair Popple
a1409adac7 powerpc/powernv/npu: Prevent overwriting of pnv_npu2_init_contex() callback parameters
There is a single npu context per set of callback parameters. Callers
should be prevented from overwriting existing callback values so
instead return an error if different parameters are passed.

Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Mark Hairgrove <mhairgrove@nvidia.com>
Tested-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-24 09:46:57 +10:00
Alistair Popple
28a5933e8d powerpc/powernv/npu: Add lock to prevent race in concurrent context init/destroy
The pnv_npu2_init_context() and pnv_npu2_destroy_context() functions
are used to allocate/free contexts to allow address translation and
shootdown by the NPU on a particular GPU. Context initialisation is
implicitly safe as it is protected by the requirement mmap_sem be held
in write mode, however pnv_npu2_destroy_context() does not require
mmap_sem to be held and it is not safe to call with a concurrent
initialisation for a different GPU.

It was assumed the driver would ensure destruction was not called
concurrently with initialisation. However the driver may be simplified
by allowing concurrent initialisation and destruction for different
GPUs. As npu context creation/destruction is not a performance
critical path and the critical section is not large a single spinlock
is used for simplicity.

Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Mark Hairgrove <mhairgrove@nvidia.com>
Tested-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-24 09:46:56 +10:00
Mark Hairgrove
720c84046c powerpc/npu-dma.c: Fix crash after __mmu_notifier_register failure
pnv_npu2_init_context wasn't checking the return code from
__mmu_notifier_register. If  __mmu_notifier_register failed, the
npu_context was still assigned to the mm and the caller wasn't given any
indication that things went wrong. Later on pnv_npu2_destroy_context would
be called, which in turn called mmu_notifier_unregister and dropped
mm->mm_count without having incremented it in the first place. This led to
various forms of corruption like mm use-after-free and mm double-free.

__mmu_notifier_register can fail with EINTR if a signal is pending, so
this case can be frequent.

This patch calls opal_npu_destroy_context on the failure paths, and makes
sure not to assign mm->context.npu_context until past the failure points.

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Acked-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-14 20:04:43 +11:00
Alistair Popple
2b74e2a9b3 powerpc/powernv/npu: Fix deadlock in mmio_invalidate()
When sending TLB invalidates to the NPU we need to send extra flushes due
to a hardware issue. The original implementation would lock the all the
ATSD MMIO registers sequentially before unlocking and relocking each of
them sequentially to do the extra flush.

This introduced a deadlock as it is possible for one thread to hold one
ATSD register whilst waiting for another register to be freed while the
other thread is holding that register waiting for the one in the first
thread to be freed.

For example if there are two threads and two ATSD registers:

  Thread A	Thread B
  ----------------------
  Acquire 1
  Acquire 2
  Release 1	Acquire 1
  Wait 1	Wait 2

Both threads will be stuck waiting to acquire a register resulting in an
RCU stall warning or soft lockup.

This patch solves the deadlock by refactoring the code to ensure registers
are not released between flushes and to ensure all registers are either
acquired or released together and in order.

Fixes: bbd5ff50af ("powerpc/powernv/npu-dma: Add explicit flush when sending an ATSD")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-13 15:50:29 +11:00
Alexey Kardashevskiy
902bdc5745 powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn
The pcidev value stored in pci_dn is only used for NPU/NPU2
initialization. We can easily drop the cached pointer and
use an ancient helper - pci_get_domain_bus_and_slot() instead in order
to reduce complexity.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-27 20:39:01 +11:00
Frederic Barrat
7f2c39e91f powerpc/powernv: Introduce new PHB type for opencapi links
The NPU was already abstracted by opal as a virtual PHB for nvlink,
but it helps to be able to differentiate between a nvlink or opencapi
PHB, as it's not completely transparent to linux. In particular, PE
assignment differs and we'll also need the information in later
patches.

So rename existing PNV_PHB_NPU type to PNV_PHB_NPU_NVLINK and add a
new type PNV_PHB_NPU_OCAPI.

Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-24 11:42:56 +11:00
Alistair Popple
1b2c2b1238 powerpc/powernv/npu: Don't explicitly flush nmmu tlb
The nest mmu required an explicit flush as a tlbi would not flush it in the
same way as the core. However an alternate firmware fix exists which should
eliminate the need for this flush, so instead add a device-tree property
(ibm,nmmu-flush) on the NVLink2 PHB to enable it only if required.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-13 08:00:30 +11:00
Alistair Popple
2a31ad093b powerpc/powernv/npu: Use flush_all_mm() instead of flush_tlb_mm()
With the optimisations introduced by commit a46cc7a908 ("powerpc/mm/radix:
Improve TLB/PWC flushes"), flush_tlb_mm() no longer flushes the page walk
cache with radix. Switch to using flush_all_mm() to ensure the pwc and tlb
are properly flushed on the nmmu.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-13 08:00:29 +11:00
Linus Torvalds
bac65d9d87 powerpc updates for 4.14
Nothing really major this release, despite quite a lot of activity. Just lots of
 things all over the place.
 
 Some things of note include:
 
  - Access via perf to a new type of PMU (IMC) on Power9, which can count both
    core events as well as nest unit events (Memory controller etc).
 
  - Optimisations to the radix MMU TLB flushing, mostly to avoid unnecessary Page
    Walk Cache (PWC) flushes when the structure of the tree is not changing.
 
  - Reworks/cleanups of do_page_fault() to modernise it and bring it closer to
    other architectures where possible.
 
  - Rework of our page table walking so that THP updates only need to send IPIs
    to CPUs where the affected mm has run, rather than all CPUs.
 
  - The size of our vmalloc area is increased to 56T on 64-bit hash MMU systems.
    This avoids problems with the percpu allocator on systems with very sparse
    NUMA layouts.
 
  - STRICT_KERNEL_RWX support on PPC32.
 
  - A new sched domain topology for Power9, to capture the fact that pairs of
    cores may share an L2 cache.
 
  - Power9 support for VAS, which is a new mechanism for accessing coprocessors,
    and initial support for using it with the NX compression accelerator.
 
  - Major work on the instruction emulation support, adding support for many new
    instructions, and reworking it so it can be used to implement the emulation
    needed to fixup alignment faults.
 
  - Support for guests under PowerVM to use the Power9 XIVE interrupt controller.
 
 And probably that many things again that are almost as interesting, but I had to
 keep the list short. Plus the usual fixes and cleanups as always.
 
 Thanks to:
   Alexey Kardashevskiy, Alistair Popple, Andreas Schwab, Aneesh Kumar K.V, Anju
   T Sudhakar, Arvind Yadav, Balbir Singh, Benjamin Herrenschmidt, Bhumika Goyal,
   Breno Leitao, Bryant G. Ly, Christophe Leroy, Cédric Le Goater, Dan Carpenter,
   Dou Liyang, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand,
   Hannes Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall, LABBE
   Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring, Masahiro
   Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo, Nathan Fontenot,
   Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Rashmica
   Gupta, Rob Herring, Rui Teng, Sam Bobroff, Santosh Sivaraj, Scott Wood,
   Shilpasri G Bhat, Sukadev Bhattiprolu, Suraj Jitindar Singh, Tobin C. Harding,
   Victor Aoqui.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZr83SAAoJEFHr6jzI4aWA6pUP/3CEaj2bSxNzWIwidqyYjuoS
 O1moEsP0oYH7eBEWVHalYxvo0QPIIAhbFPaFyrOrgtfDH01Szwu9LcCALGb8orC5
 Hg3IY8mpNG3Q1T8wEtTa56Ik4b5ZFty35S5+X9qLNSFoDUqSvGlSsLzhPNN7f2tl
 XFm2hWqd8wXCwDsuVSFBCF61M3SAm+g6NMVNJ+VL2KIDCwBrOZLhKDPRoxLTAuMa
 jjSdjVIozWyXjUrBFi8HVcoOWLxcT1HsNF0tRs51LwY/+Mlj2jAtFtsx+a06HZa6
 f2p/Kcp/MEispSTk064Ap9cC1seXWI18zwZKpCUFqu0Ec2yTAiGdjOWDyYQldIp+
 ttVPSHQ01YrVKwDFTtM9CiA0EET6fVPhWgAPkPfvH5TvtKwGkNdy0b+nQLuWrYip
 BUmOXmjdIG3nujCzA9sv6/uNNhjhj2y+HWwuV7Qo002VFkhgZFL67u2SSUQLpYPj
 PxdkY8pPVq+O+in94oDV3c36dYFF6+g6A6505Vn6eKUm/TLpszRFGkS3bKKA5vtn
 74FR+guV/5RwYJcdZbfm04DgAocl7AfUDxpwRxibt6KtAK2VZKQuw4ugUTgYEd7W
 mL2+AMmPKuajWXAMTHjCZPbUp9gFNyYyBQTFfGVX/XLiM8erKBnGfoa1/KzUJkhr
 fVZLYIO/gzl34PiTIfgD
 =UJtt
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Nothing really major this release, despite quite a lot of activity.
  Just lots of things all over the place.

  Some things of note include:

   - Access via perf to a new type of PMU (IMC) on Power9, which can
     count both core events as well as nest unit events (Memory
     controller etc).

   - Optimisations to the radix MMU TLB flushing, mostly to avoid
     unnecessary Page Walk Cache (PWC) flushes when the structure of the
     tree is not changing.

   - Reworks/cleanups of do_page_fault() to modernise it and bring it
     closer to other architectures where possible.

   - Rework of our page table walking so that THP updates only need to
     send IPIs to CPUs where the affected mm has run, rather than all
     CPUs.

   - The size of our vmalloc area is increased to 56T on 64-bit hash MMU
     systems. This avoids problems with the percpu allocator on systems
     with very sparse NUMA layouts.

   - STRICT_KERNEL_RWX support on PPC32.

   - A new sched domain topology for Power9, to capture the fact that
     pairs of cores may share an L2 cache.

   - Power9 support for VAS, which is a new mechanism for accessing
     coprocessors, and initial support for using it with the NX
     compression accelerator.

   - Major work on the instruction emulation support, adding support for
     many new instructions, and reworking it so it can be used to
     implement the emulation needed to fixup alignment faults.

   - Support for guests under PowerVM to use the Power9 XIVE interrupt
     controller.

  And probably that many things again that are almost as interesting,
  but I had to keep the list short. Plus the usual fixes and cleanups as
  always.

  Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab,
  Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh,
  Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly,
  Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang,
  Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes
  Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall,
  LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring,
  Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo,
  Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
  Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff,
  Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu,
  Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui"

* tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits)
  powerpc/xive: Fix section __init warning
  powerpc: Fix kernel crash in emulation of vector loads and stores
  powerpc/xive: improve debugging macros
  powerpc/xive: add XIVE Exploitation Mode to CAS
  powerpc/xive: introduce H_INT_ESB hcall
  powerpc/xive: add the HW IRQ number under xive_irq_data
  powerpc/xive: introduce xive_esb_write()
  powerpc/xive: rename xive_poke_esb() in xive_esb_read()
  powerpc/xive: guest exploitation of the XIVE interrupt controller
  powerpc/xive: introduce a common routine xive_queue_page_alloc()
  powerpc/sstep: Avoid used uninitialized error
  axonram: Return directly after a failed kzalloc() in axon_ram_probe()
  axonram: Improve a size determination in axon_ram_probe()
  axonram: Delete an error message for a failed memory allocation in axon_ram_probe()
  powerpc/powernv/npu: Move tlb flush before launching ATSD
  powerpc/macintosh: constify wf_sensor_ops structures
  powerpc/iommu: Use permission-specific DEVICE_ATTR variants
  powerpc/eeh: Delete an error out of memory message at init time
  powerpc/mm: Use seq_putc() in two functions
  macintosh: Convert to using %pOF instead of full_name
  ...
2017-09-07 10:15:40 -07:00
Alistair Popple
bab9f954aa powerpc/powernv/npu: Move tlb flush before launching ATSD
The nest MMU tlb flush needs to happen before the GPU translation
shootdown is launched to avoid the GPU refilling its tlb with stale
nmmu translations prior to the nmmu flush completing.

Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:42:55 +10:00
Jérôme Glisse
d1d5762e47 powerpc/powernv: update to new mmu_notifier semantic
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and now are bracketed by calls to
mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 16:12:59 -07:00
Alistair Popple
bbd5ff50af powerpc/powernv/npu-dma: Add explicit flush when sending an ATSD
NPU2 requires an extra explicit flush to an active GPU PID when
sending address translation shoot downs (ATSDs) to reliably flush the
GPU TLB. This patch adds just such a flush at the end of each sequence
of ATSDs.

We can safely use PID 0 which is always reserved and active on the
GPU. PID 0 is only used for init_mm which will never be a user mm on
the GPU. To enforce this we add a check in pnv_npu2_init_context()
just in case someone tries to use PID 0 on the GPU.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
[mpe: Use true/false for bool literals]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-22 21:21:08 +10:00
Alistair Popple
377aa6b0ef powerpc/npu-dma: Remove spurious WARN_ON when a PCI device has no of_node
Commit 4c3b89effc ("powerpc/powernv: Add sanity checks to
pnv_pci_get_{gpu|npu}_dev") introduced explicit warnings in
pnv_pci_get_npu_dev() when a PCIe device has no associated device-tree
node. However not all PCIe devices have an of_node and
pnv_pci_get_npu_dev() gets indirectly called at least once for every
PCIe device in the system. This results in spurious WARN_ON()'s so
remove it.

The same situation should not exist for pnv_pci_get_gpu_dev() as any
NPU based PCIe device requires a device-tree node.

Fixes: 4c3b89effc ("powerpc/powernv: Add sanity checks to pnv_pci_get_{gpu|npu}_dev")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-14 15:23:19 +10:00
Alistair Popple
415ba3c157 powerpc/powernv/npu-dma.c: Fix opal_npu_destroy_context() call
opal_npu_destroy_context() should be called with the NPU PHB, not the
PCIe PHB.

Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-25 23:07:39 +10:00
Alistair Popple
6b3d12a948 powerpc/powernv: Fix TCE kill on NVLink2
Commit 616badd2fb ("powerpc/powernv: Use OPAL call for TCE kill on
NVLink2") forced all TCE kills to go via the OPAL call for
NVLink2. However the PHB3 implementation of TCE kill was still being
called directly from some functions which in some circumstances caused
a machine check.

This patch adds an equivalent IODA2 version of the function which uses
the correct invalidation method depending on PHB model and changes all
external callers to use it instead.

Fixes: 616badd2fb ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 20:45:55 +10:00
Alistair Popple
1ab66d1fba powerpc/powernv: Introduce address translation services for Nvlink2
Nvlink2 supports address translation services (ATS) allowing devices
to request address translations from an mmu known as the nest MMU
which is setup to walk the CPU page tables.

To access this functionality certain firmware calls are required to
setup and manage hardware context tables in the nvlink processing unit
(NPU). The NPU also manages forwarding of TLB invalidates (known as
address translation shootdowns/ATSDs) to attached devices.

This patch exports several methods to allow device drivers to register
a process id (PASID/PID) in the hardware tables and to receive
notification of when a device should stop issuing address translation
requests (ATRs). It also adds a fault handler to allow device drivers
to demand fault pages in.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
[mpe: Fix up comment formatting, use flush_tlb_mm()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-04 13:27:26 +10:00
Alistair Popple
4c3b89effc powerpc/powernv: Add sanity checks to pnv_pci_get_{gpu|npu}_dev
The pnv_pci_get_{gpu|npu}_dev functions are used to find associations
between nvlink PCIe devices and standard PCIe devices. However they
lacked basic sanity checking which results in NULL pointer
dereferencing if they are incorrect called can be harder to spot than
an explicit WARN_ON.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-03 23:36:00 +10:00
Bart Van Assche
5299709d0a treewide: Constify most dma_map_ops structures
Most dma_map_ops structures are never modified. Constify these
structures such that these can be write-protected. This patch
has been generated as follows:

git grep -l 'struct dma_map_ops' |
  xargs -d\\n sed -i \
    -e 's/struct dma_map_ops/const struct dma_map_ops/g' \
    -e 's/const struct dma_map_ops {/struct dma_map_ops {/g' \
    -e 's/^const struct dma_map_ops;$/struct dma_map_ops;/' \
    -e 's/const const struct dma_map_ops /const struct dma_map_ops /g';
sed -i -e 's/const \(struct dma_map_ops intel_dma_ops\)/\1/' \
  $(git grep -l 'struct dma_map_ops intel_dma_ops');
sed -i -e 's/const \(struct dma_map_ops dma_iommu_ops\)/\1/' \
  $(git grep -l 'struct dma_map_ops' | grep ^arch/powerpc);
sed -i -e '/^struct vmd_dev {$/,/^};$/ s/const \(struct dma_map_ops[[:blank:]]dma_ops;\)/\1/' \
       -e '/^static void vmd_setup_dma_ops/,/^}$/ s/const \(struct dma_map_ops \*dest\)/\1/' \
       -e 's/const \(struct dma_map_ops \*dest = \&vmd->dma_ops\)/\1/' \
    drivers/pci/host/*.c
sed -i -e '/^void __init pci_iommu_alloc(void)$/,/^}$/ s/dma_ops->/intel_dma_ops./' arch/ia64/kernel/pci-dma.c
sed -i -e 's/static const struct dma_map_ops sn_dma_ops/static struct dma_map_ops sn_dma_ops/' arch/ia64/sn/pci/pci_dma.c
sed -i -e 's/(const struct dma_map_ops \*)//' drivers/misc/mic/bus/vop_bus.c

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Russell King <linux@armlinux.org.uk>
Cc: x86@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24 12:23:35 -05:00
Russell Currey
1f52f17614 powerpc/pci: Always print PHB and PE numbers as hexadecimal
PHB, PE (and by association MVE) numbers are printed as a mix of decimal
and hexadecimal throughout the kernel.  This can be misleading, so make
them all hexadecimal.

Standardising on hex instead of dec because:

 - PHB numbers are presented in hex in sysfs/debugfs (and lspci, etc)
 - PE numbers are presented as hex in sysfs and parsed in hex in debugfs

The only place I think this could cause confusing are the messages during
boot, i.e.

	pci 000a:01     : [PE# 000] Secondary bus 1 associated with PE#0

which can be a quick way to check PE numbers.  pe_level_printk() will
only print two characters instead of three, so the above would be

	pci 000a:01     : [PE# 00] Secondary bus 1 associated with PE#0

which gives a hint it's in hex.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-22 11:57:07 +11:00
Daniel Axtens
7c98bd7208 powerpc/sparse: Make a bunch of things static
Squash a bunch of sparse warnings by making things static.

Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13 17:35:47 +10:00
Krzysztof Kozlowski
00085f1efa dma-mapping: use unsigned long for dma_attrs
The dma-mapping core and the implementations do not change the DMA
attributes passed by pointer.  Thus the pointer can point to const data.
However the attributes do not have to be a bitfield.  Instead unsigned
long will do fine:

1. This is just simpler.  Both in terms of reading the code and setting
   attributes.  Instead of initializing local attributes on the stack
   and passing pointer to it to dma_set_attr(), just set the bits.

2. It brings safeness and checking for const correctness because the
   attributes are passed by value.

Semantic patches for this change (at least most of them):

    virtual patch
    virtual context

    @r@
    identifier f, attrs;

    @@
    f(...,
    - struct dma_attrs *attrs
    + unsigned long attrs
    , ...)
    {
    ...
    }

    @@
    identifier r.f;
    @@
    f(...,
    - NULL
    + 0
     )

and

    // Options: --all-includes
    virtual patch
    virtual context

    @r@
    identifier f, attrs;
    type t;

    @@
    t f(..., struct dma_attrs *attrs);

    @@
    identifier r.f;
    @@
    f(...,
    - NULL
    + 0
     )

Link: http://lkml.kernel.org/r/1468399300-5399-2-git-send-email-k.kozlowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
Acked-by: Mark Salter <msalter@redhat.com> [c6x]
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> [cris]
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [drm]
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Fabien Dessenne <fabien.dessenne@st.com> [bdisp]
Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com> [vb2-core]
Acked-by: David Vrabel <david.vrabel@citrix.com> [xen]
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen swiotlb]
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32]
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc]
Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 08:50:07 -04:00
Benjamin Herrenschmidt
a34ab7c328 powerpc/powernv/pci: Rename TCE invalidation calls
The TCE invalidation functions are fairly implementation specific,
and while the IODA specs more/less describe the register, in practice
various implementation workarounds may be required. So name the
functions after the target PHB.

Note today and for the foreseeable future, there's a 1:1 relationship
between an IODA version and a PHB implementation. There exist another
variant of IODA1 (Torrent) but we never supported in with OPAL and
never will.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:46 +10:00
Alexey Kardashevskiy
b5cb9ab1a0 powerpc/powernv/npu: Enable NVLink pass through
IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
also has a couple of fast speed links (NVLink). The interface to links
is exposed as an emulated PCI bridge which is included into the same
IOMMU group as the corresponding GPU.

In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
which behave pretty much as the standard IODA2 PHB except NPU PHB has
just a single TVE in the hardware which means it can have either
32bit window or 64bit window or DMA bypass but never two of these.

In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade.

This implements and exports API to manage NPU state in regard to VFIO;
it replicates iommu_table_group_ops.

This defines a new pnv_pci_ioda2_npu_ops which is assigned to
the IODA2 bridge if there are NPUs for a GPU on the bridge.
The new callbacks call the default IODA2 callbacks plus new NPU API.
This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
table_group, it is not expected to fail as the helper is only called
from the pnv_pci_ioda2_npu_ops.

This does not define NPU-specific .release_ownership() so after
VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.

This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
the GPU group if any found. The helper uses helpers to look for
the "ibm,gpu" property in the device tree which is a phandle of
the corresponding GPU.

This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.

As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
by the new IODA2-NPU IOMMU group, this makes the helpers public and
adds the DMA window number parameter.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
[mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:31 +10:00
Alexey Kardashevskiy
85674868ce powerpc/powernv/npu: Rework TCE Kill handling
The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:

1. Access NPU quickly when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.

2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.

This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.

While we are here, add TCE cache invalidation after enabling bypass.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:31 +10:00
Alexey Kardashevskiy
b575c731fe powerpc/powernv/npu: Add set/unset window helpers
The upcoming NVLink passthrough support will require NPU code to cope
with two DMA windows.

This adds a pnv_npu_set_window() helper which programs 32bit window to
the hardware. This also adds multilevel TCE support.

This adds a pnv_npu_unset_window() helper which removes the DMA window
from the hardware. This does not make difference now as the caller -
pnv_npu_dma_set_bypass() - enables bypass in the hardware but the next
patch will use it to manage TCE table lists for TCE Kill handling.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:30 +10:00