For messages sent from the host to fw that solicit a reply from fw,
the c4iw_wr_wait struct pointer is passed in the host->fw message, and
included in the fw->host fw6_msg reply. This allows the sender to wait
until the reply is received, and the code processing the ingress reply
to wake up the sender.
If c4iw_wait_for_reply() times out, however, we need to keep the
c4iw_wr_wait object around in case the reply eventually does arrive.
Otherwise we have touch-after-free bugs in the wake_up paths.
This was hit due to a bad kernel driver that blocked ingress processing
of cxgb4 for a long time, causing iw_cxgb4 timeouts, but eventually
resuming ingress processing and thus hitting the touch-after-free bug.
So I want to fix iw_cxgb4 such that we'll at least keep the wait object
around until the reply comes. If it never comes we leak a small amount of
memory, but if it does come late, we won't potentially crash the system.
So add a kref struct in the c4iw_wr_wait struct, and take a reference
before sending a message to FW that will generate a FW6 reply. And remove
the reference (and potentially free the wait object) when the reply
is processed.
The ep code also uses the wr_wait for non FW6 CPL messages and doesn't
embed the c4iw_wr_wait object in the message sent to firmware. So for
those cases we add c4iw_wake_up_noref().
The mr/mw, cq, and qp object create/destroy paths do need this reference
logic. For these paths, c4iw_ref_send_wait() is introduced to take the
wr_wait reference, send the msg to fw, and then wait for the reply.
So going forward, iw_cxgb4 either uses c4iw_ofld_send(),
c4iw_wait_for_reply() and c4iw_wake_up_noref() like is done in the some
of the endpoint logic, or c4iw_ref_send_wait() and c4iw_wake_up_deref()
(formerly c4iw_wake_up()) when sending messages with the c4iw_wr_wait
object pointer embedded in the message and resulting FW6 reply.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove the local stack allocated c4iw_wr_wait object in preparation for
correctly handling timeouts.
Also refactored some code to simplify it and make errpath unwinding
more readable.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
pr_debug() can be enabled to print function names, So removing the
unwanted __func__ parameters from debug logs.
Realign function parameters.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_put, __skb_put };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = { skb_put, __skb_put };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
which actually doesn't cover pskb_put since there are only three
users overall.
A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Size of pages are held by struct ib_umem in page_size field.
It is better to store it as an exponent, because page size by nature
is always power-of-two and used as a factor, divisor or ilog2's argument.
The conversion of page_size to be page_shift allows to have portable
code and avoid following error while compiling on ARM:
ERROR: "__aeabi_uldivmod" [drivers/infiniband/core/ib_core.ko] undefined!
CC: Selvin Xavier <selvin.xavier@broadcom.com>
CC: Steve Wise <swise@chelsio.com>
CC: Lijun Ou <oulijun@huawei.com>
CC: Shiraz Saleem <shiraz.saleem@intel.com>
CC: Adit Ranadive <aditr@vmware.com>
CC: Dennis Dalessandro <dennis.dalessandro@intel.com>
CC: Ram Amrani <Ram.Amrani@Cavium.com>
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Ram Amrani <Ram.Amrani@cavium.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enable the use of dsgl by default and determine whether dsgl is
supported from lld info.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Bharat Potnuri <bharat@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use a more typical logging style.
Miscellanea:
o Obsolete the c4iw_debug module parameter
o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Convert printks to pr_<level>
Miscellanea:
o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Also, rearrange things a bit to have a common c4iw_invalidate_mr()
function used everywhere that we need to invalidate.
Fixes: 49b53a93a6 ("iw_cxgb4: add fast-path for small REG_MR operations")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When processing a REG_MR work request, if fw supports the
FW_RI_NSMR_TPTE_WR work request, and if the page list for this
registration is <= 2 pages, and the current state of the mr is INVALID,
then use FW_RI_NSMR_TPTE_WR to pass down a fully populated TPTE for FW
to write. This avoids FW having to do an async read of the TPTE blocking
the SQ until the read completes.
To know if the current MR state is INVALID or not, iw_cxgb4 must track the
state of each fastreg MR. The c4iw_mr struct state is updated as REG_MR
and LOCAL_INV WRs are posted and completed, when a reg_mr is destroyed,
and when RECV completions are processed that include a local invalidation.
This optimization increases small IO IOPS for both iSER and NVMF.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The commit 0f8ab0b6e9 ("RDMA/iw_cxgb4: Low resource fixes for Memory
registration") from Jun 10, 2016, leads to the following static checker
warning:
drivers/infiniband/hw/cxgb4/mem.c:612 c4iw_alloc_mw()
error: use kfree_skb() here instead of kfree(mhp->dereg_skb)
Also fixes skb leak in c4iw_dealloc_mw
Fixes: 0f8ab0b6e9 ("RDMA/iw_cxgb4: Low resource fixes for Memory registration")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pre-allocate buffers for deregistering memory region and memory window
during RDMA connection close, when system is running out of memory.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The FID value in a ULP_MEMIO command needs to be set to an IQ ID of
a queue configured for our PF. The FID/IQ id is used to index into the
PCIE FID table, to find out on which function the DMA needs to be
issued. Essentially, every DMA needs to have the ingress queue. The exact
ingress queue doesn't matter, but it needs to be an ingress queue
associated with the function you want to see the DMA on.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The SRP initiator allows to set max_sectors to a value that exceeds
the largest amount of data that can be mapped at once with an mlx4
HCA using fast registration and a page size of 4 KB. Hence modify
ib_map_mr_sg() such that it can map partial sg-elements. If an
sg-element has been mapped partially, let the caller know
which fraction has been mapped by adjusting *sg_offset.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Passing udata to the vendor's driver in order to pass data from the
user-space driver to the kernel-space driver. This data will be
used in downstream patches.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The max depth of a fastreg mr depends on whether the device supports
DSGL or not. So compute it dynamically based on the device support and
the module use_dsgl option.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We have stopped using phys MRs in the kernel a while ago, so let's
remove all the cruft used to implement them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> [core]
Reviewed-By: Devesh Sharma<devesh.sharma@avagotech.com> [ocrdma]
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No ULP uses it anymore, go ahead and remove it.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Support the new memory registration API by allocating a
private page list array in c4iw_mr and populate it when
c4iw_map_mr_sg is invoked. Also, support IB_WR_REG_MR
by duplicating build_fastreg just take the needed information
from different places:
- page_size, iova, length (ib_mr)
- page array (c4iw_mr)
- key, access flags (ib_reg_wr)
The IB_WR_FAST_REG_MR handlers will be removed later when
all the ULPs will be converted.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Casting a pointer to __be64 produces a warning on 32-bit architectures:
drivers/infiniband/hw/cxgb4/mem.c:147:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
req->wr.wr_lo = (__force __be64)&wr_wait;
This was fixed at least twice for this driver in different places,
and accidentally reverted once more. This puts the correct version
back in place.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 6198dd8d7a ("iw_cxgb4: 32b platform fixes")
Signed-off-by: Doug Ledford <dledford@redhat.com>
- get_dma_mr() was using ~0UL which is should be ~0ULL. This causes the
DMA MR to get setup incorrectly in hardware.
- wr_log_show() needed a 64b divide function div64_u64() instead of
doing
division directly.
- fixed warnings about recasting a pointer to a u64
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
set_filter_wr is requesting __GFP_NOFAIL allocation although it can return
ENOMEM without any problems obviously (t4_l2t_set_switching does that
already). So the non-failing requirement is too strong without any
obvious reason. Drop __GFP_NOFAIL and reorganize the code to have the
failure paths easier.
The same applies to _c4iw_write_mem_dma_aligned which uses __GFP_NOFAIL
and then checks the return value and returns -ENOMEM on failure. This
doesn't make any sense what so ever. Either the allocation cannot fail or
it can.
del_filter_wr seems to be safe as well because the filter entry is not
marked as pending and the return value is propagated up the stack up to
c4iw_destroy_listen.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup all the MACROS that are defined in t4fw_ri_api.h and affected files
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch cleanups all other macros/register define related to
CPL messages that are defined in t4_msg.h and the affected files
Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
T4/T5 hardware can't handle MRs >= 8GB due to a hardware bug. So limit
registrations to < 8GB for thse devices.
Based on original work by Steve Wise <swise@opengridcomputing.com>.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
0B MRs need some tweaks to work correctly with HW. When writing the
TPTE, if the MR length is zero we now:
1) turn off all permissions
2) set the length to -1
While functionality/capabilities of the MR are the same with these
changes, it resolves a dapltest 0B RDMA Read test failure. Based on
original work by Steve Wise <swise@opengridcomputing.com>.
Signed-off-by: Pramod Kumar <pramod@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Refactored all macros used in cxgb4i as part of previously started cxgb4 macro
names cleanup. Makes them more uniform and avoids namespace collision.
Minor changes in other drivers where required as some of these macros are used
by multiple drivers, affected drivers are iw_cxgb4, cxgb4(vf) & csiostor
Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various patches have ended up changing the style of the symbolic macros/register
defines to different style.
As a result, the current kernel.org files are a mix of different macro styles.
Since this macro/register defines is used by different drivers a
few patch series have ended up adding duplicate macro/register define entries
with different styles. This makes these register define/macro files a complete
mess and we want to make them clean and consistent. This patch cleans up a part
of it.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current hardware doesn't correctly support DSGL.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We cannot save the mapped length using the rdma max_page_list_len field
of the ib_fast_reg_page_list struct because the core code uses it. This
results in an incorrect unmap of the page list in c4iw_free_fastreg_pbl().
I found this with dma mapping debugging enabled in the kernel. The
fix is to save the length in the c4iw_fr_page_list struct.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch refactors the IB core umem code and vendor drivers to use a
linear (chained) SG table instead of chunk list. With this change the
relevant code becomes clearer—no need for nested loops to build and
use umem.
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Building mem.o for 32 bits x86 triggers a GCC warning:
drivers/infiniband/hw/cxgb4/mem.c: In function '_c4iw_write_mem_dma_aligned':
drivers/infiniband/hw/cxgb4/mem.c:79:25: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
Silence that warning by casting "&wr_wait" to unsigned long before
casting it to __be64. That's what _c4iw_write_mem_inline() already does.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch marks the function _c4iw_write_mem_dma() as static
because it is not used outside this file, which fixes the warning:
drivers/infiniband/hw/cxgb4/mem.c:176:5: warning: no previous prototype for ‘_c4iw_write_mem_dma’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
It enables direct DMA by HW to memory region PBL arrays and fast register PBL
arrays from host memory, vs the T4 way of passing these arrays in the WR itself.
The result is lower latency for memory registration, and larger PBL array
support for fast register operations.
This patch also updates ULP_TX_MEM_WRITE command fields for T5. Ordering bit of
ULP_TX_MEM_WRITE is at bit position 22 in T5 and at 23 in T4.
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enhances the IB core support for Memory Windows (MWs).
MWs allow an application to have better/flexible control over remote
access to memory.
Two types of MWs are supported, with the second type having two flavors:
Type 1 - associated with PD only
Type 2A - associated with QPN only
Type 2B - associated with PD and QPN
Applications can allocate a MW once, and then repeatedly bind the MW
to different ranges in MRs that are associated to the same PD. Type 1
windows are bound through a verb, while type 2 windows are bound by
posting a work request.
The 32-bit memory key is composed of a 24-bit index and an 8-bit
key. The key is changed with each bind, thus allowing more control
over the peer's use of the memory key.
The changes introduced are the following:
* add memory window type enum and a corresponding parameter to ib_alloc_mw.
* type 2 memory window bind work request support.
* create a struct that contains the common part of the bind verb struct
ibv_mw_bind and the bind work request into a single struct.
* add the ib_inc_rkey helper function to advance the tag part of an rkey.
Consumer interface details:
* new device capability flags IB_DEVICE_MEM_WINDOW_TYPE_2A and
IB_DEVICE_MEM_WINDOW_TYPE_2B are added to indicate device support
for these features.
Devices can set either IB_DEVICE_MEM_WINDOW_TYPE_2A or
IB_DEVICE_MEM_WINDOW_TYPE_2B if it supports type 2A or type 2B
memory windows. It can set neither to indicate it doesn't support
type 2 windows at all.
* modify existing provides and consumers code to the new param of
ib_alloc_mw and the ib_mw_bind_info structure
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In the error path of registering memory when there's a failure to
allocate a chunk from the memory pool, we try to free the same chunk
we just failed to allocate, which will BUG().
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Using kfifos for ID management was limiting the number of QPs and
preventing NP384 MPI jobs. So replace it with a simple bitmap
allocator.
Remove IDs from the IDR tables before deallocating them. This bug was
causing the BUG_ON() in insert_handle() to fire because the ID was
getting reused before being removed from the IDR table.
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove left-over code from T3 that limited MR sizes to 32b.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>