Commit Graph

232 Commits

Author SHA1 Message Date
Or Gerlitz
c938a616aa IB/core: Add raw packet QP type
IB_QPT_RAW_PACKET allows applications to build a complete packet,
including L2 headers, when sending; on the receive side, the HW will
not strip any headers.

This QP type is designed for userspace direct access to Ethernet; for
example by applications that do TCP/IP themselves.  Only processes
with the NET_RAW capability are allowed to create raw packet QPs (the
name "raw packet QP" is supposed to suggest an analogy to AF_PACKET /
SOL_RAW sockets).

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08 11:18:09 -07:00
Or Gerlitz
c3bccbfbb7 IB/core: Use qp->usecnt to track multicast attach/detach
Just as we don't allow PDs, CQs, etc. to be destroyed if there are QPs
that are attached to them, don't let a QP be destroyed if there are
multicast group(s) attached to it.  Use the existing usecnt field of
struct ib_qp which was added by commit 0e0ec7e ("RDMA/core: Export
ib_open_qp() to share XRC TGT QPs") to track this.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08 11:16:54 -07:00
Roland Dreier
f0e88aeb19 Merge branches 'cma', 'cxgb3', 'cxgb4', 'ehca', 'iser', 'mad', 'nes', 'qib', 'srp' and 'srpt' into for-next 2012-03-19 09:50:33 -07:00
Or Gerlitz
d927d505c5 IB: Change CQE "csum_ok" field to a bit flag
Use a bit in wc_flags rather then a whole integer to hold the
"checksum OK" flag.  By itself, this change doesn't reduce the size of
struct ib_wc on 64bit machines -- it stays on 56 bytes because of
padding.  However, it will allow to add more fields in the future
without enlarging the struct.  Also, it will let us have a unified
approach with future libibverbs checksum offload reporting, because a
bit flag doesn't break the library ABI.

This patch was suggested during conversation with Liran Liss
<liranl@mellanox.com>.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-03-08 12:34:27 -08:00
Or Gerlitz
2e96691c31 IB: Use central enum for speed instead of hard-coded values
The kernel IB stack uses one enumeration for IB speed, which wasn't
explicitly specified in the verbs header file.  Add that enum, and use
it all over the code.

The IB speed/width notation is also used by iWARP and IBoE HW drivers,
which use the convention of rate = speed * width to advertise their
port link rate.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-03-05 09:25:16 -08:00
Swapna Thete
d144b650c6 IB/mad: Add MAD error codes from IBA spec
Add defines for MAD error codes so that they can be used when
returning error responses.

Signed-off-by: Swapna Thete <swapna.thete@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-02-25 17:47:31 -08:00
Sean Hefty
c89d1bedf8 rdma/core: Fix sparse warnings
Clean up sparse warnings in the rdma core layer.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-01-04 09:17:45 -08:00
Linus Torvalds
f470f8d4e7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (62 commits)
  mlx4_core: Deprecate log_num_vlan module param
  IB/mlx4: Don't set VLAN in IBoE WQEs' control segment
  IB/mlx4: Enable 4K mtu for IBoE
  RDMA/cxgb4: Mark QP in error before disabling the queue in firmware
  RDMA/cxgb4: Serialize calls to CQ's comp_handler
  RDMA/cxgb3: Serialize calls to CQ's comp_handler
  IB/qib: Fix issue with link states and QSFP cables
  IB/mlx4: Configure extended active speeds
  mlx4_core: Add extended port capabilities support
  IB/qib: Hold links until tuning data is available
  IB/qib: Clean up checkpatch issue
  IB/qib: Remove s_lock around header validation
  IB/qib: Precompute timeout jiffies to optimize latency
  IB/qib: Use RCU for qpn lookup
  IB/qib: Eliminate divide/mod in converting idx to egr buf pointer
  IB/qib: Decode path MTU optimization
  IB/qib: Optimize RC/UC code by IB operation
  IPoIB: Use the right function to do DMA unmap pages
  RDMA/cxgb4: Use correct QID in insert_recv_cqe()
  RDMA/cxgb4: Make sure flush CQ entries are collected on connection close
  ...
2011-11-01 10:51:38 -07:00
Roland Dreier
504255f8d0 Merge branches 'amso1100', 'cma', 'cxgb3', 'cxgb4', 'fdr', 'ipath', 'ipoib', 'misc', 'mlx4', 'misc', 'nes', 'qib' and 'xrc' into for-next 2011-11-01 09:37:08 -07:00
Sean Hefty
42849b2697 RDMA/uverbs: Export ib_open_qp() capability to user space
Allow processes that share the same XRC domain to open an existing
shareable QP.  This permits those processes to receive events on the
shared QP and transfer ownership, so that any process may modify the
QP.  The latter allows the creating process to exit, while a remaining
process can still transition it for path migration purposes.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:50:56 -07:00
Sean Hefty
0e0ec7e063 RDMA/core: Export ib_open_qp() to share XRC TGT QPs
XRC TGT QPs are shared resources among multiple processes.  Since the
creating process may exit, allow other processes which share the same
XRC domain to open an existing QP.  This allows us to transfer
ownership of an XRC TGT QP to another process.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:49:51 -07:00
Sean Hefty
638ef7a6c6 RDMA/ucm: Allow user to specify QP type when creating id
Allow the user to indicate the QP type separately from the port space
when allocating an rdma_cm_id.  With RDMA_PS_IB, there is no longer a
1:1 relationship between the QP type and port space, so we need to
switch on the QP type to select between UD and connected QPs.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:40:36 -07:00
Sean Hefty
2d2e941529 RDMA/cm: Define new RDMA port space specific to IB
Add RDMA_PS_IB.  XRC QP types will use the IB port space when operating
over the RDMA CM.  For the 'IP protocol' field value, we select 0x3F,
which is listed as being for 'any local network'.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:39:52 -07:00
Sean Hefty
8541f8de05 RDMA/uverbs: Export XRC SRQs to user space
We require additional information to create XRC SRQs than we can
exchange using the existing create SRQ ABI.  Provide an enhanced create
ABI for extended SRQ types.

Based on patches by Jack Morgenstein <jackm@dev.mellanox.co.il>
and Roland Dreier <roland@purestorage.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:29:18 -07:00
Sean Hefty
53d0bd1e7f RDMA/uverbs: Export XRC domains to user space
Allow user space to create XRC domains.  Because XRCDs are expected to
be shared among multiple processes, we use inodes to identify an XRCD.

Based on patches by Jack Morgenstein <jackm@dev.mellanox.co.il>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:21:24 -07:00
Sean Hefty
d3d72d909e RDMA/verbs: Cleanup XRC TGT QPs when destroying XRCD
XRC TGT QPs are intended to be shared among multiple users and
processes.  Allow the destruction of an XRC TGT QP to be done explicitly
through ib_destroy_qp() or when the XRCD is destroyed.

To support destroying an XRC TGT QP, we need to track TGT QPs with the
XRCD.  When the XRCD is destroyed, all tracked XRC TGT QPs are also
cleaned up.

To avoid stale reference issues, if a user is holding a reference on a
TGT QP, we increment a reference count on the QP.  The user releases the
reference by calling ib_release_qp.  This releases any access to the QP
from a user above verbs, but allows the QP to continue to exist until
destroyed by the XRCD.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:20:27 -07:00
Sean Hefty
b42b63cf0d RDMA/core: Add XRC QPs
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).

XRC communication is between an initiator (INI) QP and a target (TGT)
QP.  Target QPs are associated with SRQs through an XRCD.  An XRC TGT QP
behaves like a receive-only RD QP.  XRC INI QPs behave similarly to RC
QPs, except that work requests posted to an XRC INI QP must specify the
remote SRQ that is the target of the work request.

We define two new QP types for XRC, to distinguish between INI and TGT
QPs, and update the core layer to support XRC QPs.

This patch is derived from work by Jack Morgenstein
<jackm@dev.mellanox.co.il>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:16:19 -07:00
Sean Hefty
418d51307d RDMA/core: Add XRC SRQ type
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).

XRC defines SRQs that are specifically used by XRC connections.  Expand
the SRQ code to support XRC SRQs.  An XRC SRQ is currently restricted to
only XRC use according to the IB XRC Annex.

Portions of this patch were derived from work by
Jack Morgenstein <jackm@dev.mellanox.co.il>.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:14:31 -07:00
Sean Hefty
96104eda01 RDMA/core: Add SRQ type field
Currently, there is only a single ("basic") type of SRQ, but with XRC
support we will add a second.  Prepare for this by defining an SRQ type
and setting all current users to IB_SRQT_BASIC.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13 09:13:26 -07:00
Sean Hefty
59991f94eb RDMA/core: Add XRC domain support
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).

A few new concepts are introduced to support this.  This patch adds:

 - A new device capability flag, IB_DEVICE_XRC, which low-level
   drivers set to indicate that a device supports XRC.
 - A new object type, XRC domains (struct ib_xrcd), and new verbs
   ib_alloc_xrcd()/ib_dealloc_xrcd().  XRCDs are used to limit which
   XRC SRQs an incoming message can target.

This patch is derived from work by Jack Morgenstein <jackm@dev.mellanox.co.il>.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-12 10:32:26 -07:00
Marcel Apfelbaum
71eeba161d IB: Add new InfiniBand link speeds
Introduce support for the following extended speeds:

FDR-10: a Mellanox proprietary link speed which is 10.3125 Gbps with
        64b/66b encoding rather than 8b/10b encoding.
FDR:    IBA extended speed 14.0625 Gbps.
EDR:    IBA extended speed 25.78125 Gbps.

Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-11 11:53:47 -07:00
Kumar Sanghvi
3ebeebc38b RDMA/iwcm: Propagate ird/ord values upwards
Update struct iw_cm_event to support propagating the ird/ord values
upwards to the application.

Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-06 09:37:52 -07:00
Jiri Pirko
4bc71cb983 net: consolidate and fix ethtool_ops->get_settings calling
This patch does several things:
- introduces __ethtool_get_settings which is called from ethtool code and
  from drivers as well. Put ASSERT_RTNL there.
- dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
- changes calling in drivers so rtnl locking is respected. In
  iboe_get_rate was previously ->get_settings() called unlocked. This
  fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
  problem. Also fixed by calling __dev_get_by_index() instead of
  dev_get_by_index() and holding rtnl_lock for both calls.
- introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
  so bnx2fc_if_create() and fcoe_if_create() are called locked as they
  are from other places.
- use __ethtool_get_settings() in bonding code

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v2->v3:
	-removed dev_ethtool_get_settings()
	-added ASSERT_RTNL into __ethtool_get_settings()
	-prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock
	 around it and __ethtool_get_settings() call
v1->v2:
        add missing export_symbol
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits]
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 17:32:26 -04:00
Arun Sharma
60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
Or Gerlitz
6aea213a62 IB/pma: Add include file for IBA performance counters definitions
Move the various definitions and mad structures needed for software
implementation of IBA PM agent from the ipath and qib drivers into a
single include file, which in turn could be used by more consumers.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18 21:04:35 -07:00
Or Gerlitz
761d90ed4c IB/core: Add GID change event
Add IB GID change event type.  This is needed for IBoE when the HW
driver updates the GID (e.g when new VLANs are added/deleted) table
and the change should be reflected to the IB core cache.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18 21:04:30 -07:00
Linus Torvalds
4c171acc20 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
  RDMA/cma: Save PID of ID's owner
  RDMA/cma: Add support for netlink statistics export
  RDMA/cma: Pass QP type into rdma_create_id()
  RDMA: Update exported headers list
  RDMA/cma: Export enum cma_state in <rdma/rdma_cm.h>
  RDMA/nes: Add a check for strict_strtoul()
  RDMA/cxgb3: Don't post zero-byte read if endpoint is going away
  RDMA/cxgb4: Use completion objects for event blocking
  IB/srp: Fix integer -> pointer cast warnings
  IB: Add devnode methods to cm_class and umad_class
  IB/mad: Return EPROTONOSUPPORT when an RDMA device lacks the QP required
  IB/uverbs: Add devnode method to set path/mode
  RDMA/ucma: Add .nodename/.mode to tell userspace where to create device node
  RDMA: Add netlink infrastructure
  RDMA: Add error handling to ib_core_init()
2011-05-26 12:13:57 -07:00
Nir Muchtar
753f618ae0 RDMA/cma: Add support for netlink statistics export
Add callbacks and data types for statistics export of all current
devices/ids.  The schema for RDMA CM is a series of netlink messages.
Each one contains an rdma_cm_stat struct.  Additionally, two netlink
attributes are created for the addresses for each message (if
applicable).

Their types used are:
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR (The source address for this ID)
RDMA_NL_RDMA_CM_ATTR_DST_ADDR (The destination address for this ID)
sockaddr_* structs are encapsulated within these attributes.

In other words, every transaction contains a series of messages like:

-------message 1-------
struct rdma_cm_id_stats {
       __u32 qp_num;
       __u32 bound_dev_if;
       __u32 port_space;
       __s32 pid;
       __u8 cm_state;
       __u8 node_type;
       __u8 port_num;
       __u8 reserved;
}
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute - contains the source address
RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute - contains the destination address
-------end 1-------
-------message 2-------
struct rdma_cm_id_stats
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute
RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute
-------end 2-------

Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-25 13:46:23 -07:00
Sean Hefty
b26f9b9949 RDMA/cma: Pass QP type into rdma_create_id()
The RDMA CM currently infers the QP type from the port space selected
by the user.  In the future (eg with RDMA_PS_IB or XRC), there may not
be a 1-1 correspondence between port space and QP type.  For netlink
export of RDMA CM state, we want to export the QP type to userspace,
so it is cleaner to explicitly associate a QP type to an ID.

Modify rdma_create_id() to allow the user to specify the QP type, and
use it to make our selections of datagram versus connected mode.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-25 13:46:23 -07:00
Roland Dreier
9a7147b506 RDMA: Update exported headers list
Various RDMA headers are intended to be exported to userspace, so add
them to the headers-y list.  Add a (strictly speaking, superfluous)
include of <linux/types.h> to avoid a headers_check warning.

Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-25 13:46:23 -07:00
Nir Muchtar
550e5ca77e RDMA/cma: Export enum cma_state in <rdma/rdma_cm.h>
Move cma.c's internal definition of enum cma_state to enum rdma_cm_state
in an exported header so that it can be exported via RDMA netlink.

Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-25 13:46:22 -07:00
Linus Torvalds
06f4e926d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
  macvlan: fix panic if lowerdev in a bond
  tg3: Add braces around 5906 workaround.
  tg3: Fix NETIF_F_LOOPBACK error
  macvlan: remove one synchronize_rcu() call
  networking: NET_CLS_ROUTE4 depends on INET
  irda: Fix error propagation in ircomm_lmp_connect_response()
  irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
  irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
  be2net: Kill set but unused variable 'req' in lancer_fw_download()
  irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
  atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
  rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
  pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
  isdn: capi: Use pr_debug() instead of ifdefs.
  tg3: Update version to 3.119
  tg3: Apply rx_discards fix to 5719/5720
  ...

Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
2011-05-20 13:43:21 -07:00
Roland Dreier
b2cbae2c24 RDMA: Add netlink infrastructure
Add basic RDMA netlink infrastructure that allows for registration of
RDMA clients for which data is to be exported and supplies message
construction callbacks.

Signed-off-by: Nir Muchtar <nirm@voltaire.com>

[ Reorganize a few things, add CONFIG_NET dependency.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-20 11:46:11 -07:00
Roland Dreier
d0c49bf391 RDMA/iwcm: Get rid of enum iw_cm_event_status
The IW_CM_EVENT_STATUS_xxx values were used in only a couple of places;
cma.c uses -Exxx values instead, and so do the amso1100, cxgb3 and cxgb4
drivers -- only nes was using the enum values (with the mild consequence
that all nes connection failures were treated as generic errors rather
than reported as timeouts or rejections).

We can fix this confusion by getting rid of enum iw_cm_event_status and
using a plain int for struct iw_cm_event.status, and converting nes to
use -Exxx as the other iWARP drivers do.

This also gets rid of the warning

    drivers/infiniband/core/cma.c: In function 'cma_iw_handler':
    drivers/infiniband/core/cma.c:1333:3: warning: case value '4294967185' not in enumerated type 'enum iw_cm_event_status'
    drivers/infiniband/core/cma.c:1336:3: warning: case value '4294967186' not in enumerated type 'enum iw_cm_event_status'
    drivers/infiniband/core/cma.c:1332:3: warning: case value '4294967192' not in enumerated type 'enum iw_cm_event_status'

Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Faisal Latif <faisal.latif@intel.com>
2011-05-09 22:23:57 -07:00
Hefty, Sean
a9bb79128a RDMA/cma: Add an ID_REUSEADDR option
Lustre requires that clients bind to a privileged port number before
connecting to a remote server.  On larger clusters (typically more
than about 1000 nodes), the number of privileged ports is exhausted,
resulting in lustre being unusable.

To handle this, we add support for reusable addresses to the rdma_cm.
This mimics the behavior of the socket option SO_REUSEADDR.  A user
may set an rdma_cm_id to reuse an address before calling
rdma_bind_addr() (explicitly or implicitly).  If set, other
rdma_cm_id's may be bound to the same address, provided that they all
have reuse enabled, and there are no active listens.

If rdma_listen() is called on an rdma_cm_id that has reuse enabled, it
will only succeed if there are no other id's bound to that same
address.  The reuse option is exported to user space.  The behavior of
the kernel reuse implementation was verified against that given by
sockets.

This patch is derived from a path by Ira Weiny <weiny2@llnl.gov>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09 22:06:10 -07:00
David Decotigny
8ae6daca85 ethtool: Call ethtool's get/set_settings callbacks with cleaned data
This makes sure that when a driver calls the ethtool's
get/set_settings() callback of another driver, the data passed to it
is clean. This guarantees that speed_hi will be zeroed correctly if
the called callback doesn't explicitely set it: we are sure we don't
get a corrupted speed from the underlying driver. We also take care of
setting the cmd field appropriately (ETHTOOL_GSET/SSET).

This applies to dev_ethtool_get_settings(), which now makes sure it
sets up that ethtool command parameter correctly before passing it to
drivers. This also means that whoever calls dev_ethtool_get_settings()
does not have to clean the ethtool command parameter. This function
also becomes an exported symbol instead of an inline.

All drivers visible to make allyesconfig under x86_64 have been
updated.

Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-29 14:01:30 -07:00
Tejun Heo
f06267104d RDMA: Update workqueue usage
* ib_wq is added, which is used as the common workqueue for infiniband
  instead of the system workqueue.  All system workqueue usages
  including flush_scheduled_work() callers are converted to use and
  flush ib_wq.

* cancel_delayed_work() + flush_scheduled_work() converted to
  cancel_delayed_work_sync().

* qib_wq is removed and ib_wq is used instead.

This is to prepare for deprecation of flush_scheduled_work().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2011-01-16 21:16:31 -08:00
Eli Cohen
af7bd46376 IB/core: Add VLAN support for IBoE
Add 802.1q VLAN support to IBoE. The VLAN tag is encoded within the
GID derived from a link local address in the following way:

    GID[11] GID[12] contain the VLAN ID when the GID contains a VLAN.

The 3 bits user priority field of the packets are identical to the 3
bits of the SL.

In case of rdma_cm apps, the TOS field is used to generate the SL
field by doing a shift right of 5 bits effectively taking to 3 MS bits
of the TOS field.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-10-25 10:20:39 -07:00
Eli Cohen
2420b60b1d IB/uverbs: Return link layer type to userspace for query port operation
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-10-25 10:20:39 -07:00
Eli Cohen
ff7f5aab35 IB/pack: IBoE UD packet packing support
Add support for packing IBoE packet headers.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>

[ Clean up and fix ib_ud_header_init() a bit.  - Roland ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-10-14 12:41:29 -07:00
Eli Cohen
3c86aa70bf RDMA/cm: Add RDMA CM support for IBoE devices
Add support for IBoE device binding and IP --> GID resolution.  Path
resolving and multicast joining are implemented within cma.c by
filling in the responses and running callbacks in the CMA work queue.

IP --> GID resolution always yields IPv6 link local addresses; remote
GIDs are derived from the destination MAC address of the remote port.
Multicast GIDs are always mapped to multicast MACs as is done in IPv6.
(IPv4 multicast is enabled by translating IPv4 multicast addresses to
IPv6 multicast as described in
<http://www.mail-archive.com/ipng@sunroof.eng.sun.com/msg02134.html>.)

Some helper functions are added to ib_addr.h.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-10-13 15:46:43 -07:00
Eli Cohen
a3f5adaf49 IB/core: Add link layer property to ports
This patch allows ports to have different link layers:
IB_LINK_LAYER_INFINIBAND or IB_LINK_LAYER_ETHERNET.  This is required
for adding IBoE (InfiniBand-over-Ethernet, aka RoCE) support.  For
devices that do not provide an implementation for querying the link
layer property of a port, we return a default value based on the
transport: RMA_TRANSPORT_IB nodes will return IB_LINK_LAYER_INFINIBAND
and RDMA_TRANSPORT_IWARP nodes will return IB_LINK_LAYER_ETHERNET.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-09-27 17:51:10 -07:00
Aleksey Senin
a2ebf07ae5 IB: Rename RAW_ETY to RAW_ETHERTYPE
Change abbreviated IB_QPT_RAW_ETY to IB_QPT_RAW_ETHERTYPE to make
the special QP type easier to understand.

cf http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg04530.html

Signed-off-by: Aleksey Senin <alekseys@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-08-04 10:44:19 -07:00
Ralph Campbell
9a6edb60ec IB/core: Allow device-specific per-port sysfs files
Add a new parameter to ib_register_device() so that low-level device
drivers can pass in a pointer to a callback function that will be
called for each port that is registered in sysfs.  This allows
low-level device drivers to create files in

    /sys/class/infiniband/<hca>/ports/<N>/

without having to poke through the internals of the RDMA sysfs handling.

There is no need for an unregister function since the kobject
reference will go to zero when ib_unregister_device() is called.

Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-21 10:34:44 -07:00
Vladimir Sokolovsky
5e80ba8ff0 IB/core: Add support for masked atomic operations
- Add new IB_WR_MASKED_ATOMIC_CMP_AND_SWP and IB_WR_MASKED_ATOMIC_FETCH_AND_ADD
   send opcodes that can be used to post "masked atomic compare and
   swap" and "masked atomic fetch and add" work request respectively.
 - Add masked_atomic_cap capability.
 - Add mask fields to atomic struct of ib_send_wr
 - Add new opcodes to ib_wc_opcode

The new operations are described more precisely below:

* Masked Compare and Swap (MskCmpSwap)

The MskCmpSwap atomic operation is an extension to the CmpSwap
operation defined in the IB spec.  MskCmpSwap allows the user to
select a portion of the 64 bit target data for the “compare” check as
well as to restrict the swap to a (possibly different) portion.  The
pseudo code below describes the operation:

| atomic_response = *va
| if (!((compare_add ^ *va) & compare_add_mask)) then
|     *va = (*va & ~(swap_mask)) | (swap & swap_mask)
|
| return atomic_response

The additional operands are carried in the Extended Transport Header.
Atomic response generation and packet format for MskCmpSwap is as for
standard IB Atomic operations.

* Masked Fetch and Add (MFetchAdd)

The MFetchAdd Atomic operation extends the functionality of the
standard IB FetchAdd by allowing the user to split the target into
multiple fields of selectable length. The atomic add is done
independently on each one of this fields. A bit set in the
field_boundary parameter specifies the field boundaries. The pseudo
code below describes the operation:

| bit_adder(ci, b1, b2, *co)
| {
|	value = ci + b1 + b2
|	*co = !!(value & 2)
|
|	return value & 1
| }
|
| #define MASK_IS_SET(mask, attr)      (!!((mask)&(attr)))
| bit_position = 1
| carry = 0
| atomic_response = 0
|
| for i = 0 to 63
| {
|         if ( i != 0 )
|                 bit_position =  bit_position << 1
|
|         bit_add_res = bit_adder(carry, MASK_IS_SET(*va, bit_position),
|                                 MASK_IS_SET(compare_add, bit_position), &new_carry)
|         if (bit_add_res)
|                 atomic_response |= bit_position
|
|         carry = ((new_carry) && (!MASK_IS_SET(compare_add_mask, bit_position)))
| }
|
| return atomic_response

Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-04-21 16:37:48 -07:00
Roland Dreier
fe8875e5a4 Merge branch 'misc' into for-next
Conflicts:
	drivers/infiniband/core/uverbs_main.c
2010-03-01 23:52:31 -08:00
Roland Dreier
e8094e667a Merge branch 'cma' into for-next 2010-03-01 23:51:54 -08:00
Eli Cohen
920d706c89 IB/core: Fix and clean up ib_ud_header_init()
ib_ud_header_init() first clears header and then fills up the various
fields.  Later on, it tests header->immediate_present, which it has
already cleared, so the condition is always false.  Fix this by adding
an immediate_present parameter and setting header->immediate_present
as is done with grh_present.  Also remove unused calculation of
header_len.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-02-24 14:54:10 -08:00
Alexander Chiang
17a55f79fd IB/core: Pack struct ib_device a little tighter
A small change to reduce the size of ib_device to 1112 bytes
(from 1128).

Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-02-24 10:23:49 -08:00
Sean Hefty
cf4f7e8c47 RDMA/cm: Remove unused definition of RDMA_PS_SCTP
The defined SCTP number is incorrect (0x83, rather than 0x84), and
since it is not used anywhere, simply remove the definition.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-02-11 15:40:25 -08:00
Roland Dreier
14f369d1d6 Merge branches 'amso1100', 'cma', 'cxgb3', 'ehca', 'ipath', 'ipoib', 'iser', 'misc', 'mlx4' and 'nes' into for-next 2009-12-15 23:39:25 -08:00
Bart Van Assche
55464d461b IB: Clarify the documentation of ib_post_send()
Clarify the behavior of ib_post_send() when a list of work requests is
passed in and an immediate error is returned.

Signed-off-by: Bart Van Assche <bart.vanassche@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-12-09 14:20:04 -08:00
Sean Hefty
6f8372b69c RDMA/cm: fix loopback address support
The RDMA CM is intended to support the use of a loopback address
when establishing a connection; however, the behavior of the CM
when loopback addresses are used is confusing and does not always
work, depending on whether loopback was specified by the server,
the client, or both.

The defined behavior of rdma_bind_addr is to associate an RDMA
device with an rdma_cm_id, as long as the user specified a non-
zero address.  (ie they weren't just trying to reserve a port)
Currently, if the loopback address is passed to rdam_bind_addr,
no device is associated with the rdma_cm_id.  Fix this.

If a loopback address is specified by the client as the destination
address for a connection, it will fail to establish a connection.
This is true even if the server is listing across all addresses or
on the loopback address itself.  The issue is that the server tries
to translate the IP address carried in the REQ message to a local
net_device address, which fails.  The translation is not needed in
this case, since the REQ carries the actual HW address that should
be used.

Finally, cleanup loopback support to be more transport neutral.
Replace separate calls to get/set the sgid and dgid from the
device address to a single call that behaves correctly depending
on the format of the device address.  And support both IPv4 and
IPv6 address formats.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>

[ Fixed RDS build by s/ib_addr_get/rdma_addr_get/  - Roland ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 13:26:06 -08:00
Sean Hefty
c4315d85f9 IB/addr: Store net_device type instead of translating to RDMA transport
The struct rdma_dev_addr stores net_device address information:
the source device address, destination hardware address, and
broadcast address.  For consistency, store the net_device type
rather than converting it to the rdma_node_type.

The type indicates the format of the various hardware addresses,
which is what we're concerned with, and not the RDMA node type
that the address may map to.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 12:57:18 -08:00
Sean Hefty
6266ed6e41 RDMA/cma: Replace net_device pointer with index
Provide the device interface when resolving route information to
ensure that the correct outbound device is used.  This will also
simplify processing of sin6_scope_id for IPv6 support.

Based on work from:
David Wilder <dwilder@us.ibm.com>
Jason Gunthorpe <jgunthrope@obsidianresearch.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 12:55:22 -08:00
Sean Hefty
a7ca1f00ed RDMA/ucma: Add option to manually set IB path
Export rdma_set_ib_paths to user space to allow applications to
manually set the IB path used for connections.  This allows
alternative ways for a user space application or library to obtain
path record information, including retrieving path information
from cached data, avoiding direct interaction with the IB SA.
The IB SA is a single, centralized entity that can limit scaling
on large clusters running MPI applications.

Future changes to the rdma cm can expand on this framework to
support the full range of features allowed by the IB CM, such as
separate forward and reverse paths and APM.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-16 09:30:33 -08:00
Anand Gadiyar
fd589a8f0a trivial: fix typo "to to" in multiple files
Signed-off-by: Anand Gadiyar <gadiyar@ti.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-09-21 15:14:55 +02:00
Linus Torvalds
13220a94d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1750 commits)
  ixgbe: Allow Priority Flow Control settings to survive a device reset
  net: core: remove unneeded include in net/core/utils.c.
  e1000e: update version number
  e1000e: fix close interrupt race
  e1000e: fix loss of multicast packets
  e1000e: commonize tx cleanup routine to match e1000 & igb
  netfilter: fix nf_logger name in ebt_ulog.
  netfilter: fix warning in ebt_ulog init function.
  netfilter: fix warning about invalid const usage
  e1000: fix close race with interrupt
  e1000: cleanup clean_tx_irq routine so that it completely cleans ring
  e1000: fix tx hang detect logic and address dma mapping issues
  bridge: bad error handling when adding invalid ether address
  bonding: select current active slave when enslaving device for mode tlb and alb
  gianfar: reallocate skb when headroom is not enough for fcb
  Bump release date to 25Mar2009 and version to 0.22
  r6040: Fix second PHY address
  qeth: fix wait_event_timeout handling
  qeth: check for completion of a running recovery
  qeth: unregister MAC addresses during recovery.
  ...

Manually fixed up conflicts in:
	drivers/infiniband/hw/cxgb3/cxio_hal.h
	drivers/infiniband/hw/nes/nes_nic.c
2009-03-26 15:54:36 -07:00
Roland Dreier
09f98bafea Merge branches 'cxgb3', 'endian', 'ipath', 'ipoib', 'iser', 'mad', 'misc', 'mlx4', 'mthca', 'nes' and 'sysfs' into for-next 2009-03-24 20:44:41 -07:00
Ramachandra K
7020cb0fe2 IB/mad: Fix RMPP header RRespTime manipulation
Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime.  In
the 8-bit field of the RMPP header, the first 5 bits are RRespTime and
next 3 bits are RMPPFlags. Hence to retain the first 5 bits, the mask
should be 0xF8 instead of 0xF1.

ack_recv()-->format_ack() calls ib_set_rmpp_flags() and due to the
incorrect ANDing with 0xF1, RRespTime got changed incorrectly and RMPP
Acks sent back always had a RRespTime of 0x1E (30) which caused the
other end to consider the time outs to be approximately 4297 seconds
(i.e. in the order of 4*2^30) instead of the usual ~4 seconds (order
of 4*2^20).

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi@qlogic.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-02-27 10:33:12 -08:00
Harvey Harrison
f3a7c66b5c net: replace __constant_{endian} uses in net headers
Base versions handle constant folding now.  For headers exposed to
userspace, we must only expose the __ prefixed versions.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-14 22:58:35 -08:00
Harvey Harrison
9c3da09917 IB: Remove __constant_{endian} uses
The base versions handle constant folding just fine, use them
directly.  The replacements are OK in the include/ files as they are
not exported to userspace so we don't need the __ prefixed versions.

This patch does not affect code generation at all.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-01-17 17:11:57 -08:00
Roland Dreier
3f44675439 RDMA/cma: Remove padding arrays by using struct sockaddr_storage
There are a few places where the RDMA CM code handles IPv6 by doing

	struct sockaddr		addr;
	u8			pad[sizeof(struct sockaddr_in6) -
				    sizeof(struct sockaddr)];

This is fragile and ugly; handle this in a better way with just

	struct sockaddr_storage	addr;

[ Also roll in patch from Aleksey Senin <alekseys@voltaire.com> to
  switch to struct sockaddr_storage and get rid of padding arrays in
  struct rdma_addr. ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-08-04 11:02:14 -07:00
FUJITA Tomonori
8d8bb39b9e dma-mapping: add the device argument to dma_mapping_error()
Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
architecture does:

This enables us to cleanly fix the Calgary IOMMU issue that some devices
are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).

I think that per-device dma_mapping_ops support would be also helpful for
KVM people to support PCI passthrough but Andi thinks that this makes it
difficult to support the PCI passthrough (see the above thread).  So I
CC'ed this to KVM camp.  Comments are appreciated.

A pointer to dma_mapping_ops to struct dev_archdata is added.  If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it.  If it's
NULL, the system-wide dma_ops pointer is used as before.

If it's useful for KVM people, I plan to implement a mechanism to register
a hook called when a new pci (or dma capable) device is created (it works
with hot plugging).  It enables IOMMUs to set up an appropriate
dma_mapping_ops per device.

The major obstacle is that dma_mapping_error doesn't take a pointer to the
device unlike other DMA operations.  So x86 can't have dma_mapping_ops per
device.  Note all the POWER IOMMUs use the same dma_mapping_error function
so this is not a problem for POWER but x86 IOMMUs use different
dma_mapping_error functions.

The first patch adds the device argument to dma_mapping_error.  The patch
is trivial but large since it touches lots of drivers and dma-mapping.h in
all the architecture.

This patch:

dma_mapping_error() doesn't take a pointer to the device unlike other DMA
operations.  So we can't have dma_mapping_ops per device.

Note that POWER already has dma_mapping_ops per device but all the POWER
IOMMUs use the same dma_mapping_error function.  x86 IOMMUs use device
argument.

[akpm@linux-foundation.org: fix sge]
[akpm@linux-foundation.org: fix svc_rdma]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix bnx2x]
[akpm@linux-foundation.org: fix s2io]
[akpm@linux-foundation.org: fix pasemi_mac]
[akpm@linux-foundation.org: fix sdhci]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc]
[akpm@linux-foundation.org: fix ibmvscsi]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:03 -07:00
Amir Vadai
38ca83a588 RDMA/cma: Add RDMA_CM_EVENT_TIMEWAIT_EXIT event
Consumers that want to re-use their QPs in new connections need to
know when the QP has exited the timewait state.  Report the timewait
event through the rdma_cm.

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-22 14:14:23 -07:00
Or Gerlitz
dd5bdff83b RDMA/cma: Add RDMA_CM_EVENT_ADDR_CHANGE event
Add an RDMA_CM_EVENT_ADDR_CHANGE event can be used by rdma-cm
consumers that wish to have their RDMA sessions always use the same
links (eg <hca/port>) as the IP stack does.  In the current code, this
does not happen when bonding is used and fail-over happened but the IB
link used by an already existing session is operating fine.

Use the netevent notification for sensing that a change has happened
in the IP stack, then scan the rdma-cm ID list to see if there is an
ID that is "misaligned" with respect to the IP stack, and deliver
RDMA_CM_EVENT_ADDR_CHANGE for this ID.  The consumer can act on the
event or just ignore it.

Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-22 14:14:22 -07:00
Or Gerlitz
64c5e613b9 RDMA/addr: Keep pointer to netdevice in struct rdma_dev_addr
Keep a pointer to the local (src) netdevice in struct rdma_dev_addr,
and copy it in as part of rdma_copy_addr().  Use rdma_translate_ip()
in cma_new_conn_id() to reduce some code duplication and also make
sure the src_dev member gets set.

In a high-availability configuration the netdevice pointer can be used
by the RDMA CM to align RDMA sessions to use the same links as the IP
stack does under fail-over and route change cases.

Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-14 23:48:53 -07:00
Steve Wise
96f15c0353 RDMA/core: Add local DMA L_Key support
- Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name
  IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0
  STag support and IB HCAs to indicate reserved L_Key support.

- Add a u32 local_dma_lkey member to struct ib_device.  Drivers fill
  this in with the appropriate local DMA L_Key (if they support it).

- Fix up the drivers using this flag.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-14 23:48:53 -07:00
Ron Livne
47ee1b9f2e IB/core: Add support for multicast loopback blocking
This patch also adds a creation flag for QPs,
IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK, which when set means that
multicast sends from the QP to a group that the QP is attached to will
not be looped back to the QP's receive queue.  This can be used to
save receive resources when a consumer does not want a local copy of
multicast traffic; for example IPoIB must waste CPU time throwing away
such local copies of multicast traffic.

This patch also adds a device capability flag that shows whether a
device supports this feature or not.

Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-14 23:48:48 -07:00
Steve Wise
7f624d023b RDMA/core: Add iWARP protocol statistics attributes in sysfs
This patch adds a sysfs attribute group called "proto_stats" under
/sys/class/infiniband/$device/ and populates this group with protocol
statistics if they exist for a given device.  Currently, only iWARP
stats are defined, but the code is designed to allow InfiniBand
protocol stats if they become available.  These stats are per-device
and more importantly -not- per port.

Details:

- Add union rdma_protocol_stats in ib_verbs.h.  This union allows
  defining transport-specific stats.  Currently only iwarp stats are
  defined.

- Add struct iw_protocol_stats to define the current set of iwarp
  protocol stats.

- Add new ib_device method called get_proto_stats() to return protocol
  statistics.

- Add logic in core/sysfs.c to create iwarp protocol stats attributes
  if the device is an RNIC and has a get_proto_stats() method.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-14 23:48:48 -07:00
Steve Wise
00f7ec36c9 RDMA/core: Add memory management extensions support
This patch adds support for the IB "base memory management extension"
(BMME) and the equivalent iWARP operations (which the iWARP verbs
mandates all devices must implement).  The new operations are:

 - Allocate an ib_mr for use in fast register work requests.

 - Allocate/free a physical buffer lists for use in fast register work
   requests.  This allows device drivers to allocate this memory as
   needed for use in posting send requests (eg via dma_alloc_coherent).

 - New send queue work requests:
   * send with remote invalidate
   * fast register memory region
   * local invalidate memory region
   * RDMA read with invalidate local memory region (iWARP only)

Consumer interface details:

 - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
   to indicate device support for these features.

 - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
   IB_WR_RDMA_READ_WITH_INV are added.

 - A new consumer API function, ib_alloc_mr() is added to allocate
   fast register memory regions.

 - New consumer API functions, ib_alloc_fast_reg_page_list() and
   ib_free_fast_reg_page_list() are added to allocate and free
   device-specific memory for fast registration page lists.

 - A new consumer API function, ib_update_fast_reg_key(), is added to
   allow the key portion of the R_Key and L_Key of a fast registration
   MR to be updated.  Consumers call this if desired before posting
   a IB_WR_FAST_REG_MR work request.

Consumers can use this as follows:

 - MR is allocated with ib_alloc_mr().

 - Page list memory is allocated with ib_alloc_fast_reg_page_list().

 - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().

 - MR made VALID and bound to a specific page list via
   ib_post_send(IB_WR_FAST_REG_MR)

 - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
   ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
   invalidate operation.

 - MR is deallocated with ib_dereg_mr()

 - page lists dealloced via ib_free_fast_reg_page_list().

Applications can allocate a fast register MR once, and then can
repeatedly bind the MR to different physical block lists (PBLs) via
posting work requests to a send queue (SQ).  For each outstanding
MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be
allocated (the fast_reg_page_list is owned by the low-level driver
from the consumer posting a work request until the request completes).
Thus pipelining can be achieved while still allowing device-specific
page_list processing.

The 32-bit fast register memory key/STag is composed of a 24-bit index
and an 8-bit key.  The application can change the key each time it
fast registers thus allowing more control over the peer's use of the
key/STag (ie it can effectively be changed each time the rkey is
rebound to a page list).

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-14 23:48:45 -07:00
Dotan Barak
4deccd6d95 RDMA: Improve include file coding style
Remove subversion $Id lines and improve readability by fixing other
coding style problems pointed out by checkpatch.pl.

Signed-off-by: Dotan Barak <dotanba@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-14 23:48:44 -07:00
Sean Hefty
a947491709 RDMA: Fix license text
The license text for several files references a third software license
that was inadvertently copied in.  Update the license to what was
intended.  This update was based on a request from HP.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-07-14 23:48:43 -07:00
Roland Dreier
4c0283fc56 IB/core: Remove IB_DEVICE_SEND_W_INV capability flag
In 2.6.26, we added some support for send with invalidate work
requests, including a device capability flag to indicate whether a
device supports such requests.  However, the support was incomplete:
the completion structure was not extended with a field for the key
contained in incoming send with invalidate requests.

Full support for memory management extensions (send with invalidate,
local invalidate, fast register through a send queue, etc) is planned
for 2.6.27.  Since send with invalidate is not very useful by itself,
just remove the IB_DEVICE_SEND_W_INV bit before the 2.6.26 final
release; we will add an IB_DEVICE_MEM_MGT_EXTENSIONS bit in 2.6.27,
which makes things simpler for applications, since they will not have
quite as confusing an array of fine-grained bits to check.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-06-09 09:58:42 -07:00
Arthur Kepner
cb9fbc5c37 IB: expand ib_umem_get() prototype
Add a new parameter, dmasync, to the ib_umem_get() prototype.  Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().

Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:12 -07:00
Tony Jones
f4e91eb4a8 IB: convert struct class_device to struct device
This converts the main ib_device to use struct device instead of struct
class_device as class_device is going away.

Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-19 19:10:30 -07:00
Eli Cohen
2dd5716227 IB/core: Add support for modify CQ
Add support for modifying CQ parameters for controlling event
generation moderation.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-04-16 21:09:33 -07:00
Roland Dreier
0f39cf3d54 IB/core: Add support for "send with invalidate" work requests
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions.  Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated.  Add this new union to struct
ib_uverbs_send_wr.  Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().

Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.

Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit.  The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing).  Remove the flag from
all existing drivers that set it until we know which ones are OK.

The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.

This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-04-16 21:09:32 -07:00
Eli Cohen
c93570f23a IB/core: Add IPoIB UD LSO support
LSO (large send offload) allows the networking stack to pass SKBs with
data size larger than the MTU to the IPoIB driver and have the HCA HW
fragment the data to multiple MSS-sized packets.  Add a device
capability flag IB_DEVICE_UD_TSO for devices that can perform TCP
segmentation offload, a new send work request opcode IB_WR_LSO,
header, hlen and mss fields for the work request structure, and a new
IB_WC_LSO completion type.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-04-16 21:09:27 -07:00
Eli Cohen
b846f25aa2 IB/core: Add creation flags to struct ib_qp_init_attr
Add a create_flags member to struct ib_qp_init_attr that will allow a
kernel verbs consumer to create a pass special flags when creating a QP.
Add a flag value for telling low-level drivers that a QP will be used
for IPoIB UD LSO.  The create_flags member will also be useful for XRC
and ehca low-latency QP support.

Since no create_flags handling is implemented yet, add code to all
low-level drivers to return -EINVAL if create_flags is non-zero.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-04-16 21:09:27 -07:00
Roland Dreier
b3d636b0d1 IB: Make struct ib_uobject.id a signed int
IDR IDs are signed, so struct ib_uobject.id should be signed.  This
avoids some sparse pointer signedness warnings.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-04-16 21:01:06 -07:00
Roland Dreier
5128bdc97a IB/core: Remove unused struct ib_device.flags member
Avoid confusion about what it might mean, since it's never initialized.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-02-08 14:47:26 -08:00
Eli Cohen
e0605d9199 IB/core: Add IP checksum offload support
Add a device capability to show when it can handle checksum offload.
Also add a send flag for inserting checksums and a csum_ok field to
the completion record.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-02-08 14:37:56 -08:00
Sean Hefty
88314e4dda RDMA/cma: add support for rdma_migrate_id()
This is based on user feedback from Doug Ledford at RedHat:

Events that occur on an rdma_cm_id are reported to userspace through an
event channel.  Connection request events are reported on the event
channel associated with the listen.  When the connection is accepted, a
new rdma_cm_id is created and automatically uses the listen event
channel.  This is suboptimal where the user only wants listen events on
that channel.

Additionally, it may be desirable to have events related to connection
establishment use a different event channel than those related to
already established connections.

Allow the user to migrate an rdma_cm_id between event channels. All
pending events associated with the rdma_cm_id are moved to the new event
channel.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-01-25 14:15:32 -08:00
Sean Hefty
4fc8cd4919 IB/mad: Report number of times a mad was retried
To allow ULPs to tune timeout values and capture retry statistics,
report the number of times that a mad send operation was retried.

For RMPP mads, report the total number of times that the any portion
(send window) of the send operation was retried.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-01-25 14:15:30 -08:00
Greg Kroah-Hartman
35be068198 Kobject: change drivers/infiniband to use kobject_init_and_add
Stop using kobject_register, as this way we can control the sending of
the uevent properly, after everything is properly initialized.

Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <mshefty@ichips.intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:26 -08:00
Adrian Bunk
87ae9afdca cleanup asm/scatterlist.h includes
Not architecture specific code should not #include <asm/scatterlist.h>.

This patch therefore either replaces them with
#include <linux/scatterlist.h> or simply removes them if they were
unused.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:47:06 +01:00
Sean Hefty
de98b693e9 IB/cm: Modify interface to send MRAs in response to duplicate messages
The IB CM provides a message received acknowledged (MRA) message that
can be sent to indicate that a REQ or REP message has been received, but
will require more time to process than the timeout specified by those
messages.  In many cases, the application may not know how long it will
take to respond to a CM message, but the majority of the time, it will
usually respond before a retry has been sent.  Rather than sending an
MRA in response to all messages just to handle the case where a longer
timeout is needed, it is more efficient to queue the MRA for sending in
case a duplicate message is received.

This avoids sending an MRA when it is not needed, but limits the number
of times that a REQ or REP will be resent.  It also provides for a
simpler implementation than generating the MRA based on a timer event.
(That is, trying to send the MRA after receiving the first REQ or REP if
a response has not been generated, so that it is received at the remote
side before a duplicate REQ or REP has been received)

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-09 19:59:17 -07:00
Roland Dreier
a394f83bdf IB/umad: Fix bit ordering and 32-on-64 problems on big endian systems
The declaration of struct ib_user_mad_reg_req.method_mask[] exported
to userspace was an array of __u32, but the kernel internally treated
it as a bitmap made up of longs.  This makes a difference for 64-bit
big-endian kernels, where numbering the bits in an array of__u32 gives:

    |31.....0|63....31|95....64|127...96|

while numbering the bits in an array of longs gives:

    |63..............0|127............64|

64-bit userspace can handle this by just treating method_mask[] as an
array of longs, but 32-bit userspace is really stuck: the meaning of
the bits in method_mask[] depends on whether the kernel is 32-bit or
64-bit, and there's no sane way for userspace to know that.

Fix this by updating <rdma/ib_user_mad.h> to make it clear that
method_mask[] is an array of longs, and using a compat_ioctl method to
convert to an array of 64-bit longs to handle the 32-on-64 problem.
This fixes the interface description to match existing behavior (so
working binaries continue to work) in almost all situations, and gives
consistent semantics in the case of 32-bit userspace that can run on
either a 32-bit or 64-bit kernel, so that the same binary can work for
both 32-on-32 and 32-on-64 systems.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-09 19:59:15 -07:00
Roland Dreier
2be8e3ee8e IB/umad: Add P_Key index support
Add support for setting the P_Key index of sent MADs and getting the
P_Key index of received MADs.  This requires a change to the layout of
the ABI structure struct ib_user_mad_hdr, so to avoid breaking
compatibility, we default to the old (unchanged) ABI and add a new
ioctl IB_USER_MAD_ENABLE_PKEY that allows applications that are aware
of the new ABI to opt into using it.

We plan on switching to the new ABI by default in a year or so, and
this patch adds a warning that is printed when an application uses the
old ABI, to push people towards converting to the new ABI.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Hal Rosenstock <hal@xsigo.com>
2007-10-09 19:59:15 -07:00
Joachim Fenkes
c8d8beea03 IB/umem: Add hugetlb flag to struct ib_umem
During ib_umem_get(), determine whether all pages from the memory
region are hugetlb pages and report this in the "hugetlb" member.
Low-level drivers can use this information if they need it.

Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-09 19:59:13 -07:00
Sean Hefty
7ce86409ad RDMA/ucma: Allow user space to set service type
Export the ability to set the type of service to user space.  Model
the interface after setsockopt.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-09 19:59:12 -07:00
Sean Hefty
a81c994d5e RDMA/cma: Add ability to specify type of service
Provide support to specify a type of service for a communication
identifier.  A new function call is used when dealing with IPv4
addresses.  For IPv6 addresses, the ToS is specified through the
traffic class field in the sockaddr_in6 structure.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>

[ The comments Eitan Zahavi and myself have made over the v1 post at 
  <http://lists.openfabrics.org/pipermail/general/2007-August/039247.html>
  were fully addressed. ]
 
Reviewed-by: Or Gerlitz <ogerlitz@voltaire.com> 
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-09 19:59:12 -07:00
Sean Hefty
733d65fe33 IB/sa: Add new QoS fields to path record
The QoS annex defines new fields for path records.  Add them to the
ib_sa for consumers that want to use them.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-09 19:59:12 -07:00
Dotan Barak
92ddc447ce IB: Move the macro IB_UMEM_MAX_PAGE_CHUNK() to umem.c
After moving the definition of struct ib_umem_chunk from ib_verbs.h to
ib_umem.h there isn't any reason for the macro IB_UMEM_MAX_PAGE_CHUNK
to stay in ib_verbs.h.  Move the macro to umem.c, the only place where
it is used.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-08-03 10:45:18 -07:00
Dotan Barak
bfb3ea1251 IB: Include <linux/list.h> and <linux/rwsem.h> from <rdma/ib_verbs.h>
ib_verbs.h uses struct list_head and rw_semaphore, so while the files
<linux/list.h> and <linux/rwsem.h> seem to be pulled in indirectly by
the other header files it includes, the right thing is to include
those files directly.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-08-03 10:45:18 -07:00
Dotan Barak
e8b398deb7 IB: Include <linux/list.h> from <rdma/ib_mad.h>
ib_mad.h uses struct list_head, so while linux/list.h seems to be
pulled in indirectly by one of the headers it includes, the right
thing is to include linux/list.h directly.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-08-03 10:45:17 -07:00
Sean Hefty
1d84612649 IB/cm: Include HCA ACK delay in local ACK timeout
The IB CM should include the HCA ACK delay when calculating the local
ACK timeout value to use for RC QPs.  If the HCA ACK delay is large
enough relative to the packet life time, then if it is not taken into
account, the calculated timeout value ends up being too small, which
can result in "retry exceeded" errors.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-07-10 21:50:05 -07:00
Sean Hefty
2aec5c602c IB/sa: Make sure SA queries use default P_Key
MADs sent to the SA should use the the default P_Key (0x7fff/0xffff).
There's no requirement that the default P_Key is stored at index 0 in
the local P_Key table, so add code to the sa_query module to look up
the index of the default P_Key when creating an address handle for the
SA (which is done any time the P_Key table might change), and use this
index for all SA queries.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-07-10 21:45:31 -07:00
Linus Torvalds
8aee74c8ee Merge branch 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
  IB/cm: Improve local id allocation
  IPoIB/cm: Fix SRQ WR leak
  IB/ipoib: Fix typos in error messages
  IB/mlx4: Check if SRQ is full when posting receive
  IB/mlx4: Pass send queue sizes from userspace to kernel
  IB/mlx4: Fix check of opcode in mlx4_ib_post_send()
  mlx4_core: Fix array overrun in dump_dev_cap_flags()
  IB/mlx4: Fix RESET to RESET and RESET to ERROR transitions
  IB/mthca: Fix RESET to ERROR transition
  IB/mlx4: Set GRH:HopLimit when sending globally routed MADs
  IB/mthca: Set GRH:HopLimit when building MLX headers
  IB/mlx4: Fix check of max_qp_dest_rdma in modify QP
  IB/mthca: Fix use-after-free on device restart
  IB/ehca: Return proper error code if register_mr fails
  IPoIB: Handle P_Key table reordering
  IB/core: Use start_port() and end_port()
  IB/core: Add helpers for uncached GID and P_Key searches
  IB/ipath: Fix potential deadlock with multicast spinlocks
  IB/core: Free umem when mm is already gone
2007-05-21 16:19:32 -07:00