Provide the device interface when resolving route information to
ensure that the correct outbound device is used. This will also
simplify processing of sin6_scope_id for IPv6 support.
Based on work from:
David Wilder <dwilder@us.ibm.com>
Jason Gunthorpe <jgunthrope@obsidianresearch.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Export rdma_set_ib_paths to user space to allow applications to
manually set the IB path used for connections. This allows
alternative ways for a user space application or library to obtain
path record information, including retrieving path information
from cached data, avoiding direct interaction with the IB SA.
The IB SA is a single, centralized entity that can limit scaling
on large clusters running MPI applications.
Future changes to the rdma cm can expand on this framework to
support the full range of features allowed by the IB CM, such as
separate forward and reverse paths and APM.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1750 commits)
ixgbe: Allow Priority Flow Control settings to survive a device reset
net: core: remove unneeded include in net/core/utils.c.
e1000e: update version number
e1000e: fix close interrupt race
e1000e: fix loss of multicast packets
e1000e: commonize tx cleanup routine to match e1000 & igb
netfilter: fix nf_logger name in ebt_ulog.
netfilter: fix warning in ebt_ulog init function.
netfilter: fix warning about invalid const usage
e1000: fix close race with interrupt
e1000: cleanup clean_tx_irq routine so that it completely cleans ring
e1000: fix tx hang detect logic and address dma mapping issues
bridge: bad error handling when adding invalid ether address
bonding: select current active slave when enslaving device for mode tlb and alb
gianfar: reallocate skb when headroom is not enough for fcb
Bump release date to 25Mar2009 and version to 0.22
r6040: Fix second PHY address
qeth: fix wait_event_timeout handling
qeth: check for completion of a running recovery
qeth: unregister MAC addresses during recovery.
...
Manually fixed up conflicts in:
drivers/infiniband/hw/cxgb3/cxio_hal.h
drivers/infiniband/hw/nes/nes_nic.c
Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime. In
the 8-bit field of the RMPP header, the first 5 bits are RRespTime and
next 3 bits are RMPPFlags. Hence to retain the first 5 bits, the mask
should be 0xF8 instead of 0xF1.
ack_recv()-->format_ack() calls ib_set_rmpp_flags() and due to the
incorrect ANDing with 0xF1, RRespTime got changed incorrectly and RMPP
Acks sent back always had a RRespTime of 0x1E (30) which caused the
other end to consider the time outs to be approximately 4297 seconds
(i.e. in the order of 4*2^30) instead of the usual ~4 seconds (order
of 4*2^20).
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi@qlogic.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Base versions handle constant folding now. For headers exposed to
userspace, we must only expose the __ prefixed versions.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The base versions handle constant folding just fine, use them
directly. The replacements are OK in the include/ files as they are
not exported to userspace so we don't need the __ prefixed versions.
This patch does not affect code generation at all.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There are a few places where the RDMA CM code handles IPv6 by doing
struct sockaddr addr;
u8 pad[sizeof(struct sockaddr_in6) -
sizeof(struct sockaddr)];
This is fragile and ugly; handle this in a better way with just
struct sockaddr_storage addr;
[ Also roll in patch from Aleksey Senin <alekseys@voltaire.com> to
switch to struct sockaddr_storage and get rid of padding arrays in
struct rdma_addr. ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
architecture does:
This enables us to cleanly fix the Calgary IOMMU issue that some devices
are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).
I think that per-device dma_mapping_ops support would be also helpful for
KVM people to support PCI passthrough but Andi thinks that this makes it
difficult to support the PCI passthrough (see the above thread). So I
CC'ed this to KVM camp. Comments are appreciated.
A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's
NULL, the system-wide dma_ops pointer is used as before.
If it's useful for KVM people, I plan to implement a mechanism to register
a hook called when a new pci (or dma capable) device is created (it works
with hot plugging). It enables IOMMUs to set up an appropriate
dma_mapping_ops per device.
The major obstacle is that dma_mapping_error doesn't take a pointer to the
device unlike other DMA operations. So x86 can't have dma_mapping_ops per
device. Note all the POWER IOMMUs use the same dma_mapping_error function
so this is not a problem for POWER but x86 IOMMUs use different
dma_mapping_error functions.
The first patch adds the device argument to dma_mapping_error. The patch
is trivial but large since it touches lots of drivers and dma-mapping.h in
all the architecture.
This patch:
dma_mapping_error() doesn't take a pointer to the device unlike other DMA
operations. So we can't have dma_mapping_ops per device.
Note that POWER already has dma_mapping_ops per device but all the POWER
IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device
argument.
[akpm@linux-foundation.org: fix sge]
[akpm@linux-foundation.org: fix svc_rdma]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix bnx2x]
[akpm@linux-foundation.org: fix s2io]
[akpm@linux-foundation.org: fix pasemi_mac]
[akpm@linux-foundation.org: fix sdhci]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc]
[akpm@linux-foundation.org: fix ibmvscsi]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Consumers that want to re-use their QPs in new connections need to
know when the QP has exited the timewait state. Report the timewait
event through the rdma_cm.
Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add an RDMA_CM_EVENT_ADDR_CHANGE event can be used by rdma-cm
consumers that wish to have their RDMA sessions always use the same
links (eg <hca/port>) as the IP stack does. In the current code, this
does not happen when bonding is used and fail-over happened but the IB
link used by an already existing session is operating fine.
Use the netevent notification for sensing that a change has happened
in the IP stack, then scan the rdma-cm ID list to see if there is an
ID that is "misaligned" with respect to the IP stack, and deliver
RDMA_CM_EVENT_ADDR_CHANGE for this ID. The consumer can act on the
event or just ignore it.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Keep a pointer to the local (src) netdevice in struct rdma_dev_addr,
and copy it in as part of rdma_copy_addr(). Use rdma_translate_ip()
in cma_new_conn_id() to reduce some code duplication and also make
sure the src_dev member gets set.
In a high-availability configuration the netdevice pointer can be used
by the RDMA CM to align RDMA sessions to use the same links as the IP
stack does under fail-over and route change cases.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name
IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0
STag support and IB HCAs to indicate reserved L_Key support.
- Add a u32 local_dma_lkey member to struct ib_device. Drivers fill
this in with the appropriate local DMA L_Key (if they support it).
- Fix up the drivers using this flag.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch also adds a creation flag for QPs,
IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK, which when set means that
multicast sends from the QP to a group that the QP is attached to will
not be looped back to the QP's receive queue. This can be used to
save receive resources when a consumer does not want a local copy of
multicast traffic; for example IPoIB must waste CPU time throwing away
such local copies of multicast traffic.
This patch also adds a device capability flag that shows whether a
device supports this feature or not.
Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds a sysfs attribute group called "proto_stats" under
/sys/class/infiniband/$device/ and populates this group with protocol
statistics if they exist for a given device. Currently, only iWARP
stats are defined, but the code is designed to allow InfiniBand
protocol stats if they become available. These stats are per-device
and more importantly -not- per port.
Details:
- Add union rdma_protocol_stats in ib_verbs.h. This union allows
defining transport-specific stats. Currently only iwarp stats are
defined.
- Add struct iw_protocol_stats to define the current set of iwarp
protocol stats.
- Add new ib_device method called get_proto_stats() to return protocol
statistics.
- Add logic in core/sysfs.c to create iwarp protocol stats attributes
if the device is an RNIC and has a get_proto_stats() method.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds support for the IB "base memory management extension"
(BMME) and the equivalent iWARP operations (which the iWARP verbs
mandates all devices must implement). The new operations are:
- Allocate an ib_mr for use in fast register work requests.
- Allocate/free a physical buffer lists for use in fast register work
requests. This allows device drivers to allocate this memory as
needed for use in posting send requests (eg via dma_alloc_coherent).
- New send queue work requests:
* send with remote invalidate
* fast register memory region
* local invalidate memory region
* RDMA read with invalidate local memory region (iWARP only)
Consumer interface details:
- A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
to indicate device support for these features.
- New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
IB_WR_RDMA_READ_WITH_INV are added.
- A new consumer API function, ib_alloc_mr() is added to allocate
fast register memory regions.
- New consumer API functions, ib_alloc_fast_reg_page_list() and
ib_free_fast_reg_page_list() are added to allocate and free
device-specific memory for fast registration page lists.
- A new consumer API function, ib_update_fast_reg_key(), is added to
allow the key portion of the R_Key and L_Key of a fast registration
MR to be updated. Consumers call this if desired before posting
a IB_WR_FAST_REG_MR work request.
Consumers can use this as follows:
- MR is allocated with ib_alloc_mr().
- Page list memory is allocated with ib_alloc_fast_reg_page_list().
- MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().
- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)
- MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
invalidate operation.
- MR is deallocated with ib_dereg_mr()
- page lists dealloced via ib_free_fast_reg_page_list().
Applications can allocate a fast register MR once, and then can
repeatedly bind the MR to different physical block lists (PBLs) via
posting work requests to a send queue (SQ). For each outstanding
MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be
allocated (the fast_reg_page_list is owned by the low-level driver
from the consumer posting a work request until the request completes).
Thus pipelining can be achieved while still allowing device-specific
page_list processing.
The 32-bit fast register memory key/STag is composed of a 24-bit index
and an 8-bit key. The application can change the key each time it
fast registers thus allowing more control over the peer's use of the
key/STag (ie it can effectively be changed each time the rkey is
rebound to a page list).
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove subversion $Id lines and improve readability by fixing other
coding style problems pointed out by checkpatch.pl.
Signed-off-by: Dotan Barak <dotanba@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The license text for several files references a third software license
that was inadvertently copied in. Update the license to what was
intended. This update was based on a request from HP.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In 2.6.26, we added some support for send with invalidate work
requests, including a device capability flag to indicate whether a
device supports such requests. However, the support was incomplete:
the completion structure was not extended with a field for the key
contained in incoming send with invalidate requests.
Full support for memory management extensions (send with invalidate,
local invalidate, fast register through a send queue, etc) is planned
for 2.6.27. Since send with invalidate is not very useful by itself,
just remove the IB_DEVICE_SEND_W_INV bit before the 2.6.26 final
release; we will add an IB_DEVICE_MEM_MGT_EXTENSIONS bit in 2.6.27,
which makes things simpler for applications, since they will not have
quite as confusing an array of fine-grained bits to check.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts the main ib_device to use struct device instead of struct
class_device as class_device is going away.
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add support for modifying CQ parameters for controlling event
generation moderation.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
LSO (large send offload) allows the networking stack to pass SKBs with
data size larger than the MTU to the IPoIB driver and have the HCA HW
fragment the data to multiple MSS-sized packets. Add a device
capability flag IB_DEVICE_UD_TSO for devices that can perform TCP
segmentation offload, a new send work request opcode IB_WR_LSO,
header, hlen and mss fields for the work request structure, and a new
IB_WC_LSO completion type.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a create_flags member to struct ib_qp_init_attr that will allow a
kernel verbs consumer to create a pass special flags when creating a QP.
Add a flag value for telling low-level drivers that a QP will be used
for IPoIB UD LSO. The create_flags member will also be useful for XRC
and ehca low-latency QP support.
Since no create_flags handling is implemented yet, add code to all
low-level drivers to return -EINVAL if create_flags is non-zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
IDR IDs are signed, so struct ib_uobject.id should be signed. This
avoids some sparse pointer signedness warnings.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a device capability to show when it can handle checksum offload.
Also add a send flag for inserting checksums and a csum_ok field to
the completion record.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is based on user feedback from Doug Ledford at RedHat:
Events that occur on an rdma_cm_id are reported to userspace through an
event channel. Connection request events are reported on the event
channel associated with the listen. When the connection is accepted, a
new rdma_cm_id is created and automatically uses the listen event
channel. This is suboptimal where the user only wants listen events on
that channel.
Additionally, it may be desirable to have events related to connection
establishment use a different event channel than those related to
already established connections.
Allow the user to migrate an rdma_cm_id between event channels. All
pending events associated with the rdma_cm_id are moved to the new event
channel.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
To allow ULPs to tune timeout values and capture retry statistics,
report the number of times that a mad send operation was retried.
For RMPP mads, report the total number of times that the any portion
(send window) of the send operation was retried.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Stop using kobject_register, as this way we can control the sending of
the uevent properly, after everything is properly initialized.
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <mshefty@ichips.intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Not architecture specific code should not #include <asm/scatterlist.h>.
This patch therefore either replaces them with
#include <linux/scatterlist.h> or simply removes them if they were
unused.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The IB CM provides a message received acknowledged (MRA) message that
can be sent to indicate that a REQ or REP message has been received, but
will require more time to process than the timeout specified by those
messages. In many cases, the application may not know how long it will
take to respond to a CM message, but the majority of the time, it will
usually respond before a retry has been sent. Rather than sending an
MRA in response to all messages just to handle the case where a longer
timeout is needed, it is more efficient to queue the MRA for sending in
case a duplicate message is received.
This avoids sending an MRA when it is not needed, but limits the number
of times that a REQ or REP will be resent. It also provides for a
simpler implementation than generating the MRA based on a timer event.
(That is, trying to send the MRA after receiving the first REQ or REP if
a response has not been generated, so that it is received at the remote
side before a duplicate REQ or REP has been received)
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The declaration of struct ib_user_mad_reg_req.method_mask[] exported
to userspace was an array of __u32, but the kernel internally treated
it as a bitmap made up of longs. This makes a difference for 64-bit
big-endian kernels, where numbering the bits in an array of__u32 gives:
|31.....0|63....31|95....64|127...96|
while numbering the bits in an array of longs gives:
|63..............0|127............64|
64-bit userspace can handle this by just treating method_mask[] as an
array of longs, but 32-bit userspace is really stuck: the meaning of
the bits in method_mask[] depends on whether the kernel is 32-bit or
64-bit, and there's no sane way for userspace to know that.
Fix this by updating <rdma/ib_user_mad.h> to make it clear that
method_mask[] is an array of longs, and using a compat_ioctl method to
convert to an array of 64-bit longs to handle the 32-on-64 problem.
This fixes the interface description to match existing behavior (so
working binaries continue to work) in almost all situations, and gives
consistent semantics in the case of 32-bit userspace that can run on
either a 32-bit or 64-bit kernel, so that the same binary can work for
both 32-on-32 and 32-on-64 systems.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for setting the P_Key index of sent MADs and getting the
P_Key index of received MADs. This requires a change to the layout of
the ABI structure struct ib_user_mad_hdr, so to avoid breaking
compatibility, we default to the old (unchanged) ABI and add a new
ioctl IB_USER_MAD_ENABLE_PKEY that allows applications that are aware
of the new ABI to opt into using it.
We plan on switching to the new ABI by default in a year or so, and
this patch adds a warning that is printed when an application uses the
old ABI, to push people towards converting to the new ABI.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Hal Rosenstock <hal@xsigo.com>
During ib_umem_get(), determine whether all pages from the memory
region are hugetlb pages and report this in the "hugetlb" member.
Low-level drivers can use this information if they need it.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Export the ability to set the type of service to user space. Model
the interface after setsockopt.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Provide support to specify a type of service for a communication
identifier. A new function call is used when dealing with IPv4
addresses. For IPv6 addresses, the ToS is specified through the
traffic class field in the sockaddr_in6 structure.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
[ The comments Eitan Zahavi and myself have made over the v1 post at
<http://lists.openfabrics.org/pipermail/general/2007-August/039247.html>
were fully addressed. ]
Reviewed-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The QoS annex defines new fields for path records. Add them to the
ib_sa for consumers that want to use them.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
After moving the definition of struct ib_umem_chunk from ib_verbs.h to
ib_umem.h there isn't any reason for the macro IB_UMEM_MAX_PAGE_CHUNK
to stay in ib_verbs.h. Move the macro to umem.c, the only place where
it is used.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_verbs.h uses struct list_head and rw_semaphore, so while the files
<linux/list.h> and <linux/rwsem.h> seem to be pulled in indirectly by
the other header files it includes, the right thing is to include
those files directly.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_mad.h uses struct list_head, so while linux/list.h seems to be
pulled in indirectly by one of the headers it includes, the right
thing is to include linux/list.h directly.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The IB CM should include the HCA ACK delay when calculating the local
ACK timeout value to use for RC QPs. If the HCA ACK delay is large
enough relative to the packet life time, then if it is not taken into
account, the calculated timeout value ends up being too small, which
can result in "retry exceeded" errors.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
MADs sent to the SA should use the the default P_Key (0x7fff/0xffff).
There's no requirement that the default P_Key is stored at index 0 in
the local P_Key table, so add code to the sa_query module to look up
the index of the default P_Key when creating an address handle for the
SA (which is done any time the P_Key table might change), and use this
index for all SA queries.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
IB/cm: Improve local id allocation
IPoIB/cm: Fix SRQ WR leak
IB/ipoib: Fix typos in error messages
IB/mlx4: Check if SRQ is full when posting receive
IB/mlx4: Pass send queue sizes from userspace to kernel
IB/mlx4: Fix check of opcode in mlx4_ib_post_send()
mlx4_core: Fix array overrun in dump_dev_cap_flags()
IB/mlx4: Fix RESET to RESET and RESET to ERROR transitions
IB/mthca: Fix RESET to ERROR transition
IB/mlx4: Set GRH:HopLimit when sending globally routed MADs
IB/mthca: Set GRH:HopLimit when building MLX headers
IB/mlx4: Fix check of max_qp_dest_rdma in modify QP
IB/mthca: Fix use-after-free on device restart
IB/ehca: Return proper error code if register_mr fails
IPoIB: Handle P_Key table reordering
IB/core: Use start_port() and end_port()
IB/core: Add helpers for uncached GID and P_Key searches
IB/ipath: Fix potential deadlock with multicast spinlocks
IB/core: Free umem when mm is already gone
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add ib_find_gid() and ib_find_pkey() functions that use uncached device
queries. The calls might block but the returns are always up-to-date.
Cache P_Key and GID table lengths in core to avoid extra port info queries.
Signed-off-by: Yosef Etigin <yosefe@voltaire.com>
Acked-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When memory pinned with ib_umem_get() is released, ib_umem_release()
needs to subtract the amount of memory being unpinned from
mm->locked_vm. However, ib_umem_release() may be called with
mm->mmap_sem already held for writing if the memory is being released
as part of an munmap() call, so it is sometimes necessary to defer
this accounting into a workqueue.
However, the work struct used to defer this accounting is dynamically
allocated before it is queued, so there is the possibility of failing
that allocation. If the allocation fails, then ib_umem_release has no
choice except to bail out and leave the process with a permanently
elevated locked_vm.
Fix this by allocating the structure to defer accounting as part of
the original struct ib_umem, so there's no possibility of failing a
later allocation if creating the struct ib_umem and pinning memory
succeeds.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Export ib_umem_get()/ib_umem_release() and put low-level drivers in
control of when to call ib_umem_get() to pin and DMA map userspace,
rather than always calling it in ib_uverbs_reg_mr() before calling the
low-level driver's reg_user_mr method.
Also move these functions to be in the ib_core module instead of
ib_uverbs, so that driver modules using them do not depend on
ib_uverbs.
This has a number of advantages:
- It is better design from the standpoint of making generic code a
library that can be used or overridden by device-specific code as
the details of specific devices dictate.
- Drivers that do not need to pin userspace memory regions do not
need to take the performance hit of calling ib_mem_get(). For
example, although I have not tried to implement it in this patch,
the ipath driver should be able to avoid pinning memory and just
use copy_{to,from}_user() to access userspace memory regions.
- Buffers that need special mapping treatment can be identified by
the low-level driver. For example, it may be possible to solve
some Altix-specific memory ordering issues with mthca CQs in
userspace by mapping CQ buffers with extra flags.
- Drivers that need to pin and DMA map userspace memory for things
other than memory regions can use ib_umem_get() directly, instead
of hacks using extra parameters to their reg_phys_mr method. For
example, the mlx4 driver that is pending being merged needs to pin
and DMA map QP and CQ buffers, but it does not need to create a
memory key for these buffers. So the cleanest solution is for mlx4
to call ib_umem_get() in the create_qp and create_cq methods.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
IPoIB: Convert to NAPI
IB: Return "maybe missed event" hint from ib_req_notify_cq()
IB: Add CQ comp_vector support
IB/ipath: Fix a race condition when generating ACKs
IB/ipath: Fix two more spin lock problems
IB/fmr_pool: Add prefix to all printks
IB/srp: Set proc_name
IB/srp: Add orig_dgid sysfs attribute to scsi_host
IPoIB/cm: Don't crash if remote side uses one QP for both directions
RDMA/cxgb3: Support for new abort logic
RDMA/cxgb3: Initialize cpu_idx field in cpl_close_listserv_req message
RDMA/cxgb3: Fail qp creation if the requested max_inline is too large
RDMA/cxgb3: Fix TERM codes
IPoIB/cm: Fix error handling in ipoib_cm_dev_open()
IB/ipath: Don't corrupt pending mmap list when unmapped objects are freed
IB/mthca: Work around kernel QP starvation
IB/ipath: Don't put QP in timeout queue if waiting to send
IB/ipath: Don't call spin_lock_irq() from interrupt context
The semantics defined by the InfiniBand specification say that
completion events are only generated when a completions is added to a
completion queue (CQ) after completion notification is requested. In
other words, this means that the following race is possible:
while (CQ is not empty)
ib_poll_cq(CQ);
// new completion is added after while loop is exited
ib_req_notify_cq(CQ);
// no event is generated for the existing completion
To close this race, the IB spec recommends doing another poll of the
CQ after requesting notification.
However, it is not always possible to arrange code this way (for
example, we have found that NAPI for IPoIB cannot poll after
requesting notification). Also, some hardware (eg Mellanox HCAs)
actually will generate an event for completions added before the call
to ib_req_notify_cq() -- which is allowed by the spec, since there's
no way for any upper-layer consumer to know exactly when a completion
was really added -- so the extra poll of the CQ is just a waste.
Motivated by this, we add a new flag "IB_CQ_REPORT_MISSED_EVENTS" for
ib_req_notify_cq() so that it can return a hint about whether the a
completion may have been added before the request for notification.
The return value of ib_req_notify_cq() is extended so:
< 0 means an error occurred while requesting notification
== 0 means notification was requested successfully, and if
IB_CQ_REPORT_MISSED_EVENTS was passed in, then no
events were missed and it is safe to wait for another
event.
> 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was
passed in. It means that the consumer must poll the
CQ again to make sure it is empty to avoid the race
described above.
We add a flag to enable this behavior rather than turning it on
unconditionally, because checking for missed events may incur
significant overhead for some low-level drivers, and consumers that
don't care about the results of this test shouldn't be forced to pay
for the test.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a num_comp_vectors member to struct ib_device and extend
ib_create_cq() to pass in a comp_vector parameter -- this parallels
the userspace libibverbs API. Update all hardware drivers to set
num_comp_vectors to 1 and have all ULPs pass 0 for the comp_vector
value. Pass the value of num_comp_vectors to userspace rather than
hard-coding a value of 1.
We want multiple CQ event vector support (via MSI-X or similar for
adapters that can generate multiple interrupts), but it's not clear
how many vectors we want, or how we want to deal with policy issues
such as how to decide which vector to use or how to set up interrupt
affinity. This patch is useful for experimenting, since no core
changes will be necessary when updating a driver to support multiple
vectors, and we know that we want to make at least these changes
anyway.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
I noticed that many source files include <linux/pci.h> while they do
not appear to need it. Here is an attempt to clean it all up.
In order to find all possibly affected files, I searched for all
files including <linux/pci.h> but without any other occurence of "pci"
or "PCI". I removed the include statement from all of these, then I
compiled an allmodconfig kernel on both i386 and x86_64 and fixed the
false positives manually.
My tests covered 66% of the affected files, so there could be false
positives remaining. Untested files are:
arch/alpha/kernel/err_common.c
arch/alpha/kernel/err_ev6.c
arch/alpha/kernel/err_ev7.c
arch/ia64/sn/kernel/huberror.c
arch/ia64/sn/kernel/xpnet.c
arch/m68knommu/kernel/dma.c
arch/mips/lib/iomap.c
arch/powerpc/platforms/pseries/ras.c
arch/ppc/8260_io/enet.c
arch/ppc/8260_io/fcc_enet.c
arch/ppc/8xx_io/enet.c
arch/ppc/syslib/ppc4xx_sgdma.c
arch/sh64/mach-cayman/iomap.c
arch/xtensa/kernel/xtensa_ksyms.c
arch/xtensa/platform-iss/setup.c
drivers/i2c/busses/i2c-at91.c
drivers/i2c/busses/i2c-mpc.c
drivers/media/video/saa711x.c
drivers/misc/hdpuftrs/hdpu_cpustate.c
drivers/misc/hdpuftrs/hdpu_nexus.c
drivers/net/au1000_eth.c
drivers/net/fec_8xx/fec_main.c
drivers/net/fec_8xx/fec_mii.c
drivers/net/fs_enet/fs_enet-main.c
drivers/net/fs_enet/mac-fcc.c
drivers/net/fs_enet/mac-fec.c
drivers/net/fs_enet/mac-scc.c
drivers/net/fs_enet/mii-bitbang.c
drivers/net/fs_enet/mii-fec.c
drivers/net/ibm_emac/ibm_emac_core.c
drivers/net/lasi_82596.c
drivers/parisc/hppb.c
drivers/sbus/sbus.c
drivers/video/g364fb.c
drivers/video/platinumfb.c
drivers/video/stifb.c
drivers/video/valkyriefb.c
include/asm-arm/arch-ixp4xx/dma.h
sound/oss/au1550_ac97.c
I would welcome test reports for these files. I am fine with removing
the untested files from the patch if the general opinion is that these
changes aren't safe. The tested part would still be nice to have.
Note that this patch depends on another header fixup patch I submitted
to LKML yesterday:
[PATCH] scatterlist.h needs types.h
http://lkml.org/lkml/2007/3/01/141
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Extend rdma_cm to support multicast communication. Multicast support
is added to the existing RDMA_PS_UDP port space, as well as a new
RDMA_PS_IPOIB port space. The latter port space allows joining the
multicast groups used by IPoIB, which enables offloading IPoIB traffic
to a separate QP. The port space determines the signature used in the
MGID when joining the group. The newly added RDMA_PS_IPOIB also
allows for unicast operations, similar to RDMA_PS_UDP.
Supporting the RDMA_PS_IPOIB requires changing how UD QPs are initialized,
since we can no longer assume that the qkey is constant. This requires
saving the Q_Key to use when attaching to a device, so that it is
available when creating the QP. The Q_Key information is exported to
the user through the existing rdma_init_qp_attr() interface.
Multicast support is also exported to userspace through the rdma_ucm.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The IB SA tracks multicast join/leave requests on a per port basis and
does not do any reference counting: if two users of the same port join
the same group, and one leaves that group, then the SA will remove the
port from the group even though there is one user who wants to stay a
member left. Therefore, in order to support multiple users of the
same multicast group from the same port, we need to perform reference
counting locally.
To do this, add an multicast submodule to ib_sa to perform reference
counting of multicast join/leave operations. Modify ib_ipoib (the
only in-kernel user of multicast) to use the new interface.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make the untyped data region in ib_user_mad have type u64 so that it
gets aligned properly. This avoids alignment faults in ib_umad when
casting the data field to an rmpp_mad and accessing the 64-bit tid
field on architectures like ia64.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct ib_wc currently only includes the local QP number: this matches
the IB spec, but seems mostly useless. The following patch replaces
this with the pointer to qp itself, and updates all low level drivers
and all users.
This has the following advantages:
- Ability to get a per-qp context through wc->qp->qp_context
- Existing drivers already have the qp pointer ready in poll cq, so
this change actually saves a tiny bit (extra memory read) on data path
(for ehca it would actually be expensive to find the QP pointer when
polling a CQ, but ehca does not support SRQ so we can leave wc->qp as
NULL for ehca)
- Users that need the QP number can still get it through wc->qp->qp_num
Use case:
In IPoIB connected mode code, I have a common CQ shared by multiple
QPs. To track connection usage, I need a way to get at some per-QP
context upon the completion, and I would like to avoid allocating
context object per work request just to stick a QP pointer into it.
With this code, I can just use wc->qp->qp_context.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
<rdma/ib_verbs.h> uses struct kref, so it should include <linux/kref.h>
explicitly to avoid hidden include dependencies.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ib_dma_alloc_coherent() wrapper uses a u64* for the dma_handle
parameter, unlike dma_alloc_coherent, which uses dma_addr_t*. This
means that we need a temporary variable to handle the case when
ib_dma_alloc_coherent() just falls through directly to
dma_alloc_coherent() on architectures where sizeof u64 != sizeof
dma_addr_t.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
At least on PPC, the "op ? op : dma" construct causes a compile failure
because the dma_* is a do{}while(0) macro.
This turns all of them into proper if/else to avoid this problem.
Signed-off-by: Ben Collins <bcollins@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The QLogic InfiniPath HCAs use programmed I/O instead of HW DMA.
This patch allows a verbs device driver to interpose on DMA mapping
function calls in order to avoid relying on bus_to_virt() and
phys_to_virt() to undo the mappings created by dma_map_single(),
dma_map_sg(), etc.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Export the rdma cm interfaces to userspace via a misc device.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Allow the use of UD QPs through the rdma_cm, in order to provide
address translation services for resolving IB addresses for datagram
messages using SIDR.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During connection establishment, the passive side of a connection can
receive messages from the active side before the connection event has
been delivered to the user. Allow the passive side to send messages
in response to received data before the event is delivered. To handle
the case where the connection messages are lost, a new rdma_notify()
function is added that users may invoke to force a connection into the
established state.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Connection information was never given to the recipient of a
connection request or reply message. Only the event was delivered.
Report the connection data with the event to allows user to
reject the connection based on the requested parameters, or adjust
their resources to match the request.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The qp_type parameter into the rdma_cm is unneeded, and can be
misleading. The QP type should be determined from the port space.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ib_cm_establish() function is replaced with a more generic
ib_cm_notify(). This routine is used to notify the CM that failover
has occurred, so that future CM messages (LAP, DREQ) reach the remote
CM. (Currently, we continue to use the original path) This bumps the
userspace CM ABI.
New alternate path information is captured when a LAP message is sent
or received. This allows QP attributes to be initialized for the user
when a new path is loaded after failover occurs.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Require registration with ib_addr module to prevent caller from
unloading while a callback is in progress.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Return the sq_draining value back to user space for query_qp instead
of the en_sqd_async notify value, which is valid only for
modify_qp. For query_qp, the draining status should returned.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Relevant SA queries are actually "greater than" / "less than", not
"greater than or equal" / "less than or equal" as the names imply.
(See IB spec 1.2 Vol 1, 15.2.5.16 PATHRECORD/Table 205 PathRecord)
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Document the reject sending and modifying QP to error done in rdma_accept().
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Clarify that rdma_destroy_id cancels outstanding asynchronous operations on the
Associated id.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Require users to register with SA module, to prevent the sa_query
module text from going away while an SA query callback is still
running. Update all in-tree users for the new interface.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Modifications to the existing rdma header files, core files, drivers,
and ulp files to support iWARP, including:
- Hook iWARP CM into the build system and use it in rdma_cm.
- Convert enum ib_node_type to enum rdma_node_type, which includes
the possibility of RDMA_NODE_RNIC, and update everything for this.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add an iWARP Connection Manager (CM), which abstracts connection
management for iWARP devices (RNICs). It is a logical instance of the
xx_cm where xx is the transport type (ib or iw). The symbols exported
are used by the transport independent rdma_cm module, and are
available also for transport dependent ULPs.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Pass a struct ib_udata to the low-level driver's ->modify_srq() and
->modify_qp() methods, so that it can get to the device-specific data
passed in by the userspace driver.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a ib_uverbs_resize_cq_resp.driver_data field so that low-level
drivers can return data from a resize CQ operation to userspace. Have
ib_uverbs_resize_cq() only copy the cqe field, to avoid having to bump
the userspace ABI.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Validate MADs sent by userspace clients for spec compliance with
C13-18.1.1 (prevent duplicate requests and responses sent on the
same port). Without this, RMPP transactions get aborted because
of duplicate packets.
This patch is similar to that provided by Jack Morgenstein.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_fmr_pool_map_phys gets the virtual address by pointer but never writes
there, and users (e.g. srp) seem to assume this and ignore the value
returned. This patch cleans up the API to get the VA by value, and updates
all users.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Acked-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The device address contains unsigned character arrays, which contain raw GID
addresses. The GIDs may not be naturally aligned, so do not cast them to
structures or unions.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds the Kbuild files listing the files which are to be installed by
the 'headers_install' make target, in generic directories.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Currently, all userspace verbs operations that call into the kernel
are serialized by ib_uverbs_idr_mutex. This can be a scalability
issue for some workloads, especially for devices driven by the ipath
driver, which needs to call into the kernel even for datapath
operations.
Fix this by adding reference counts to the userspace objects, and then
converting ib_uverbs_idr_mutex into a spinlock that only protects the
idrs long enough to take a reference on the object being looked up.
Because remove operations may fail, we have to do a slightly funky
two-step deletion, which is described in the comments at the top of
uverbs_cmd.c.
This also still leaves ib_uverbs_idr_lock as a single lock that is
possibly subject to contention. However, the lock hold time will only
be a single idr operation, so multiple threads should still be able to
make progress, even if ib_uverbs_idr_lock is being ping-ponged.
Surprisingly, these changes even shrink the object code:
add/remove: 23/5 grow/shrink: 4/21 up/down: 633/-693 (-60)
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a call to initialize address handle attributes given a path record.
This is used by the CM, and would be useful for users of UD QPs.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a function to initialize address handle attributes from a work
completion. This functionality is duplicated by both verbs and the CM.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The P_Key is provided into a SIDR REQ in two places, once as a
parameter, and again in the path record. Remove the P_Key as a
parameter and always use the one given in the path record.
This change has no practical effect on ABI functionality.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move ipath's struct port_info into <rdma/ib_smi.h>, so that it can be
used by mthca to implement client reregister support.
Remove the __attribute__((packed)) because all the members of the struct
are naturally aligned anyway.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add IB_EVENT_CLIENT_REREGISTER to enum so low-level drivers can
generate "client reregister" events.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add an LMC cache to struct ib_device, and add a function
ib_get_cached_lmc() to query the cache.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Kernel connection management agent over InfiniBand that connects based
on IP addresses. The agent defines a generic RDMA connection
abstraction to support clients wanting to connect over different RDMA
devices.
The agent also handles RDMA device hotplug events on behalf of clients.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add an address translation service that maps IP addresses to
InfiniBand GID addresses using IPoIB.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Extend matching connection requests to listens in the InfiniBand CM to
include private data checks.
This allows applications to listen on the same service identifier,
with private data directing the request to the appropriate application.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Provide common handling for marshalling data between userspace clients
and kernel InfiniBand drivers.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Push translation of static rate to HCA format into low-level drivers,
where it belongs. For static rate encoding, use encoding of rate
field from IB standard PathRecord, with addition of value 0, for
backwards compatibility with current usage. The changes are:
- Add enum ib_rate to midlayer includes.
- Get rid of static rate translation in IPoIB; just use static rate
directly from Path and MulticastGroup records.
- Update mthca driver to translate absolute static rate into the
format used by hardware. This also fixes mthca's static rate
handling for HCAs that are capable of 4X DDR.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add RMPP support for additional management classes that support it.
Also, validate RMPP is consistent with management class specified.
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for sending and receiving large RMPP transfers. The old
code supports transfers only as large as a single contiguous kernel
memory allocation. This patch uses linked list of memory buffers when
sending and receiving data to avoid needing contiguous pages for
larger transfers.
Receive side: copy the arriving MADs in chunks instead of coalescing
to one large buffer in kernel space.
Send side: split a multipacket MAD buffer to a list of segments,
(multipacket_list) and send these using a gather list of size 2.
Also, save pointer to last sent segment, and retrieve requested
segments by walking list starting at last sent segment. Finally,
save pointer to last-acked segment. When retrying, retrieve
segments for resending relative to this pointer. When updating last
ack, start at this pointer.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Pass actual capacity of created SRQ back to userspace, so that
userspace can report accurate capacities. This requires an ABI bump,
to change struct ib_uverbs_create_srq_resp.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Have mthca's create_srq method return the actual capacity of the SRQ
that gets created. Also update comments in <rdma/ib_verbs.h> to
clarify that this is what is expected from ib_create_srq().
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The size of struct ib_uverbs_create_qp_resp is not even multiple of 8
bytes. This causes problems for low-level drivers that add private
data after the structure: 32-bit userspace will look in the wrong
place for a response from a 64-bit kernel. Fix this by adding a
reserved field. Also, bump the ABI version because this changes the
size of a structure.
Pointed out by Hoang-Nam Nguyen <HNGUYEN@de.ibm.com>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle querying userspace SRQs (shared
receive queues), including adding an ABI for marshalling requests and
responses. The kernel midlayer already has the underlying
ib_query_srq() function.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle querying userspace QPs (queue pairs),
including adding an ABI for marshalling requests and responses. The
kernel midlayer already has the underlying ib_query_qp() function.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The in-kernel mthca driver contains a table of which attributes are
valid for each queue pair state transition. It turns out that both
other IB drivers -- ipath and ehca -- which are being prepared for
merging have copied this table, errors and all.
To forestall this code duplication, move this table and the code to
check parameters against it into a midlayer library function,
ib_modify_qp_is_ok().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows the consumer to set the page size of "pages" mapped
by the pool FMRs, which is a feature already existing in the base
verbs API. On the cosmetic side it changes ib_fmr_attr.page_size field
to be named page_shift.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Expose a writable "node_desc" sysfs attribute for InfiniBand devices.
This allows userspace to update the node description with information
such as the node's hostname, so that IB network management software
can tie its view to the real world.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle resizing userspace CQs (completion
queues), including adding an ABI for marshalling requests and
responses. The kernel midlayer already has ib_resize_cq().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a node_guid field to struct ib_device. It is the responsibility
of the low-level driver to initialize this field before registering a
device with the midlayer. Convert everyone to looking at this field
instead of calling ib_query_device() when all they want is the node
GUID, and remove the node_guid field from struct ib_device_attr.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move the computation of QP capabilities (max scatter/gather entries,
max inline data, etc) into the kernel, and have the uverbs module
return the values as part of the create QP response. This keeps
precise knowledge of device limits in the low-level kernel driver.
This requires an ABI bump, so while we're making changes, get rid of
the max_sge parameter for the modify SRQ command -- it's not used and
shouldn't be there.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the struct ib_device.resize_cq() method to take a plain integer
that holds the new CQ size, rather than a pointer to an integer that
it uses to return the new size. This makes the interface match the
exported ib_resize_cq() signature, and allows the low-level driver to
update the CQ size with proper locking if necessary.
No in-tree drivers are exporting this method yet.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix structure layouts to ensure same size on 32-bit and 64-bit architectures.
This permits 32-bit userspace apps on a 64-bit kernel.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The MAD layer was violating the DMA API by touching data buffers used
for sends after the DMA mapping was done. This causes problems on
non-cache-coherent architectures, because the device doing DMA won't
see updates to the payload buffers that exist only in the CPU cache.
Fix this by having all MAD consumers use ib_create_send_mad() to
allocate their send buffers, and moving the DMA mapping into the MAD
layer so it can be done just before calling send (and after any
modifications of the send buffer by the MAD layer).
Tested on a non-cache-coherent PowerPC 440SPe system.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Bind communication identifiers to a device to support device removal.
Export per HCA CM devices to userspace.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Add kernel/user ABI structures for marshalling poll CQ, request CQ
notification, post send, post receive, post SRQ receive, create AH and
destroy AH commands. These commands allow us to support userspace
verbs for devices that can't perform these operations directly from
userspace (eg the PathScale HCA).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Give each device a uverbs_cmd_mask, so that a low-level driver can
control which methods may be called on behalf of userspace.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add abi_version attribute to uverbs class devices to allow for
ABI versioning of device-specific interfaces.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Introduce new userspace verbs ABI version 3. This eliminates some
unneeded commands, and adds support for user-created completion
channels. This cleans up problems with file leaks on error paths, and
also makes sure that file descriptors are always installed into the
correct process.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Al Viro pointed out that the current IB userspace verbs interface
allows userspace to cause mischief by closing file descriptors before
we're ready, or issuing the same command twice at the same time. This
patch closes those races, and fixes other obvious problems such as a
module reference leak.
Some other interface bogosities will require an ABI change to fix
properly, so I'm deferring those fixes until 2.6.15.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Clean up code by using enums instead of hard-coded magic numbers.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Changes to CM to support CM and port redirection (REJ reason 24).
Signed-off-by: John Kingman <kingman <at> storagegear.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
completion events after destroying a CQ, QP or SRQ. We do this by
sweeping the event lists before returning from a destroy calls, and
then return the number of events already reported before the destroy
call. This allows userspace wait until it has processed all events
for an object returned from the kernel before it frees its context for
the object.
The ABI of the destroy CQ, destroy QP and destroy SRQ commands has to
change to return the event count, so bump the ABI version from 1 to 2.
The userspace libibverbs library has already been updated to handle
both the old and new ABI versions.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add structure definition for ClassPortInfo format. This is
needed for (at least) handling CM redirects.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
SA: Move SA attributes to ib_sa.h so are accessible to more than
sa_query.c. Also, remove deprecated attributes and add one missing one.
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_sa.h: Define more SA methods (initially for madeye decode)
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Add user specified context to all uCM events. Users will not retrieve
any events associated with the context after destroying the corresponding
cm_id.
- Provide the ib_cm_init_qp_attr() call to userspace clients of the CM.
This call may be used to set QP attributes properly before modifying the QP.
- Fixes some error handling synchonization and cleanup issues.
- Performs some minor code cleanup.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move the InfiniBand headers from drivers/infiniband/include to include/rdma.
This allows InfiniBand-using code to live elsewhere, and lets us remove the
ugly EXTRA_CFLAGS include path from the InfiniBand Makefiles.
Signed-off-by: Roland Dreier <rolandd@cisco.com>