The coming nldev needs to iterate over all IB devices in the system
and in order to not expose the ib_devices list outside the devices.c,
it is necessary to provide function iterator.
Current version is written explicitly for nldev callback to avoid
over-engineering at this stage, but it can be easily extended for
other types.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
The RDMA netlink client infrastructure was removed and made obsolete.
The old infrastructure defined struct ibnl_client_cbs. Now that all
uses of this have been updated to the new infrastructure, rename the
struct to be compliant with the current stack naming standards:
struct rdma_nl_cbs.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Make ibnl_chk_listeners function to be one line by removing
unneeded comparison.
Rename that function to be complaint to other functions in RDMA netlink.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
The pointer to netlink header was not used in the ibnl_multicast
function, so let's remove it and simplify the function
signature.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Netlink message header is not needed for unicast reply, hence remove it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reuse standard macros to cancel the netlink message
in case of error.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Add ability to provide flags to control RDMA netlink callbacks
and convert addr.c and sa_query.c to be first users of such
infrastructure. It allows to move their CAP_NET_ADMIN checks
into netlink core.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
The iwcm exports functions which are not used outside of ib_core.
This patch simply removes these EXPORT_SYMBOLS.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chien Tin Tung <chien.tin.tung@intel.com>
RDMA netlink implementation guarantees that supplied
client number is in allowed range.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chien Tin Tung <chien.tin.tung@intel.com>
The standard netlink_rcv_skb function skips messages without
NLM_F_REQUEST flag in it, while SA netlink client issues them.
In commit bc10ed7d3d ("IB/core: Add rdma netlink helper functions")
the local function was introduced to allow such messages.
This led to double pass for every incoming message.
In this patch, we unify that local implementation and netlink_rcv_skb
functions, so there will be no need for double pass anymore.
As a outcome, this combined function gained more strict check
for NLM_F_REQUEST flag and it is now allowed for SA pathquery
client only.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Owner field is not needed to be set because netlink is part of ib_core
which will be unloaded last after all other modules are unloaded.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
RDMA netlink has a complicated infrastructure for dynamically
registering and de-registering netlink clients to the NETLINK_RDMA
group. The complicated portion of this code is not widely used because
2 of the 3 current clients are statically compiled together with
netlink.c. The infrastructure, therefore, is deemed overkill.
Refactor the code to eliminate the dynamically added clients. Now all
clients are pre-registered in a client array at compile time, and at run
time they merely check-in with the infrastructure to pass their callback
table for inclusion in the pre-sized client array.
This also allows for future cleanups and removal of unneeded code in the
iwcm* netlink handler.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chien Tin Tung <chien.tin.tung@intel.com>
Add a wait/retry version of ibnl_unicast, ibnl_unicast_wait,
and modify ibnl_unicast to not wait/retry. This eliminates
the undesirable wait for future users of ibnl_unicast.
Change Portmapper calls originating from kernel to user-space
to use ibnl_unicast_wait and take advantage of the wait/retry
logic in netlink_unicast.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Chien Tin Tung <chien.tin.tung@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Always initiate an offline transition request
when a link down occurs. The firmware will
use this request to confirm that the driver
has seen the link down message. A host version
is set to indicate this driver behavior to the
firmware.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When link interrupts occur, multiple link down requests
could be queued up when only one is needed. This could get
the hfi1 out of sync with its link partner during LNI.
Only allow one link down request to be queued at any one time.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, link down interrupts queue link entries
on a workqueue intended for sending events only.
Create a workqueue for queuing link events.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The error messages when checksum validation of the platform
configuration fields populated into the ASIC scratch registers fails are
ambiguous. Disambiguate them.
Reviewed-by: Jakub Byczkowski <jakub.byczkowski@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The allocate_ctxt() function adds the context to the fd data structure.
Since the context is not completely initialized, this can cause confusion
as to whether the context is valid or not.
Move the fd reference from allocate_ctxt() to setup_base_ctxt().
Update the necessary functions to be aware of this move.
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix issue where a disabled port can be enabled by
inserting a cable. The port should be explicitly
enabled instead.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Jakub Byczkowski <jakub.byczkowski@intel.com>
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a window that allows other threads to read state of
'host_link_state' as a new, before the hardware actual state is set.
This patch closes the window by indicating a new state only after
hardware transition is complete.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A copy_to_user() call assumes that two members of a data structure
are sequential. Since this may not always be true, separate the copies
to ensure a safe copy.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a window where the FM can read the buffer control table
and decide not to program buffers. When a port goes down, the code
clears the table and if it is not programmed, posted SDMA descriptors
will never complete due to no buffer credits.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
During PCIe initialization some registers' values from
PCI config space are saved in order to restore them later
(i.e. after reset). Restoring those value is done by a
function called restore_pci_variables, while saving them
is put directly into function hfi1_pcie_ddinit.
Move saving values to a separate function in the image
of restoring functionality.
Reviewed-by: Jakub Byczkowski <jakub.byczkowski@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Bartlomiej Dudek <bartlomiej.dudek@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Loading debug signed firmware fails if started immediately after
failed attempt to load production firmware. A short delay is
required so add about a 100us delay after RSA check failure.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jakub Byczkowski <jakub.byczkowski@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Code structure is not consistent for if/else blocks and break
instructions in set_link_state for case HLS_UP_INIT. Physical
state uses break in case of an error and if/else blocks for
logical use cases. These blocks should be implemented consistently.
Reviewed-by: Jakub Byczkowski <jakub.byczkowski@intel.com>
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A trap should be sent to the FM until the FM sends a repress message.
This is in line with the IBTA 13.4.9.
Add the ability to resend traps until a repress message is received.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael N. Henry <michael.n.henry@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The hfi1_rcvctrl() function receives an index which it then converts
to an rcd. Since most functions have the rcd, use that instead.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The hfi1_<set|clear>_ctxt_<j|p>key functions take a context index and
look up the context based on that index.
Since the context index is being retrieved from the context, this
doesn't seem optimal.
Pass the context pointer for use, rather than the context index.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The array index for the rcd array is sized several different ways
throughout the code.
Use the user interface size (u16) as the standard size and update the
necessary code to reflect this.
u16 is large enough for the largest amount of supported contexts.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Several data members of the user context have become unused over time.
Cleaning them up.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In the error path for context allocation, the file descriptor pointer
should not point to a context when an error occurs.
Clean up the appropriate references on error.
Fixes: Commit 62239fc6e5 ("IB/hfi1: Clean up on context initialization failure")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When an egress resource(SDMA descriptors, pio credits) is not available,
a sending thread will be put on the resource's wait queue. When the
resource becomes available again, up to a fixed number of sending threads
can be awakened sequentially and removed from the wait queue, depending
on the number of waiting threads and the number of free resources. Since
each awakened sending thread will send as many packets as possible, it
is highly likely that the first sending thread will consume all the
egress resources. Subsequently, it will be put back to the end of the wait
queue. Depending on the timing when the later sending threads wake up,
they may not be able to send any packet and be again put back to the end
of the wait queue sequentially, right behind the first sending thread.
This starvation cycle continues until some sending threads exceed their
retry limit and consequently fail.
This patch fixes the issue by two simple approaches:
(1) Any starved sending thread will be put to the head of the wait queue
while a served sending thread will be put to the tail;
(2) The most starved sending thread will be served first.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When the debugpat kernel boot flag is turned on the following
traces are printed:
[ 1884.793168] x86/PAT: Overlap at 0x90000000-0x92000000
[ 1884.803510] x86/PAT: reserve_memtype added [mem 0x91200000-0x9127ffff],
track uncached-minus, req write-combining, ret uncached-minus
[ 1884.818167] hfi1 0000:05:00.0: hfi1_0: WC Remapped RcvArray:
ffffc9000a980000
The ioremap_wc() clearly is not returning a write combining mapping due
to an overlap where the RcvArray is mapped in a uncached mapping prior
to creating the proposed write combining mapping.
The patch replaces the single base register for uncached CSRs that
used to overlap the RcvArray with two mappings. One, kregbase1, from the
bar0 up to the RcvArray and another, kregbase2, from the end of the
RcvArray to the pio send buffer space. A new dd field, base2_start,
is used to convert the zero-based offset in the CSR routines to the
correct kregbase1/kregbase2 mapping. A single direct write of the
RcvArray CSRs is replaced with hfi1_put_tid() to insure correct access
using the new disjoint mapping.
Additionally, the kregend field is deleted since it is only ever written.
patdebug now shows the RcvArray as write combining:
[ 35.688990] x86/PAT: reserve_memtype added [mem 0x91200000-0x9127ffff],
track write-combining, req write-combining, ret write-combining
To insulate from any potential issues with write combining, all
writeq are now flushed in hfi1_put_tid() and rcv_array_wc_fill().
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Ensure that return values from kernel PCI config access
API calls in HFI driver are checked and react properly if
they are not expected (i.e. not successful).
Reviewed-by: Jakub Byczkowski <jakub.byczkowski@intel.com>
Signed-off-by: Bartlomiej Dudek <bartlomiej.dudek@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
I ran into this build error on linux-next:
drivers/infiniband/hw/hns/hns_roce_eq.c:477:8: error: unknown type name 'irqreturn_t'
static irqreturn_t hns_roce_msi_x_interrupt(int irq, void *eq_ptr)
^~~~~~~~~~~
drivers/infiniband/hw/hns/hns_roce_eq.c: In function 'hns_roce_msi_x_interrupt':
drivers/infiniband/hw/hns/hns_roce_eq.c:485:9: error: implicit declaration of function 'IRQ_RETVAL'; did you mean 'BPF_RVAL'? [-Werror=implicit-function-declaration]
return IRQ_RETVAL(int_work);
I have bisected this to a seemingly unrelated change that happened
to remove some indirect header inclusions. Simply including the
required header explicitly fixes the build failure.
Fixes: 09c7570480 ("xfrm: remove flow cache")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The 0day build system flags implicit includes as errors. A patch from
Matan Barak to allow hns_roce, an aarch64 specific RDMA driver, to be
built on other arches, but it resulted in build regressions. The
problem is that hns_roce_device.h needs a definition for __raw_writeq
but did not have an include to provide it. Add <linux/io.h> as an
include to resolve the issue.
Signed-off-by: Doug Ledford <dledford@redhat.com>
drivers/infiniband/hw/hns/hns_roce_hw_v1.c:2026:5-8: Unneeded variable: "ret". Return "0" on line 2046
Remove unneeded variable used to store return value.
Generated by: scripts/coccinelle/misc/returnvar.cocci
Fixes: 7d1b6a678e0b ("IB/hns: Support compile test for hns RoCE driver")
CC: Matan Barak <matanb@mellanox.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
drivers/infiniband/hw/hns/hns_roce_qp.c:802:9-10: WARNING: return of 0/1 in function 'hns_roce_wq_overflow' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
Fixes: 7d1b6a678e0b ("IB/hns: Support compile test for hns RoCE driver")
CC: Matan Barak <matanb@mellanox.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Compiling the hns RoCE driver requires ARM architecture.
In order to simplify development of IB/core, support
compile test. Add the necessary includes for that too.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The initial patch for changing the stack to use RoCEv2 GIDs by default
set the CMA_PREFERRED_ROCE_GID_TYPE to an incorrect value. Instead of
an absolute value, we needed to set the right bit in a bitmask. Correct
the default setting so we use RoCEv2 by default.
Fixes: 63a5f483af (IB/cma: Set default gid type to RoCEv2)
Signed-off-by: Doug Ledford <dledford@redhat.com>
The number of supported WIDs, if they are supported at all, can be
limited due to resources. Notifying the user space application the
number of available WIDs allows it to utilize them correctly.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Direct Packet Mode support may be disabled, e.g, due to limited
resources. Notifying the user application prevents wasting cycles
on attempting to send these kind of packets.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Trivial fix to spelling mistake in mlx5_ib_dbg message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds qp->s_ack_queue indices and size to qp_stats printout.
This information will provide information about the receiving side.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As a final step to support RSS feature, expose the RSS capabilities in
query device verb.
It includes:
- Max rwq indirection tables.
- Max rwq indirection table size.
- Supported qp types.
Signed-off-by: Guy Levi <guyle@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support to work with a RSS QP by using an indirection table object
upon QP creation. Other related QP verbs (e.g. modify/destroy/query) were
updated as well for that QP mode.
Notes:
- The RX hash properties are supplied as driver private data.
- The RSS QP port is used on the associated WQs in its indirection
table. Applying different ports during WQ life time is not allowed.
- The expected RSS QP flow is: create, modify(RST->INIT),
modify(RST->RTR), destroy.
Signed-off-by: Guy Levi <guyle@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>