Commit Graph

714 Commits

Author SHA1 Message Date
Linus Torvalds
4c929feed7 Main batch of InfiniBand/RDMA changes for 3.19:
- On-demand paging support in core midlayer and mlx5 driver.  This lets
    userspace create non-pinned memory regions and have the adapter HW
    trigger page faults.
  - iSER and IPoIB updates and fixes.
  - Low-level HW driver updates for cxgb4, mlx4 and ocrdma.
  - Other miscellaneous fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJUk2pBAAoJEENa44ZhAt0hL18P/jdCGbWVXOJh25KvjzmKIfUV
 T3Bdixz5h/Xj2iU6ShsXLSZa8vkPXsiO5v3MIQcR5MuPn88vrxemTy/OmBjefJeL
 qKGnWfy9O8KxMqhYZAokTTIyl5ygtSITbJsCE5W0KHgRBgBtexbrHeFBcWsT3AZ5
 piGyRP4XWc2LtfjrFWdUUjRELz9m74L93uILy0P8lS58k3M8YIOvkjqVmGj5Ya3U
 /hadgk1HYWfxjw+z3v0keaP1IoqHpJferH+UyjCj8UsIB9swXabE8ap/SFrQPIpe
 p+Zwyi25292mavqEfm/neUmvn34xLF8c00XB6UKxr42Q9yd1mDxnO+ZxWpxW5klQ
 tKEZeySDbB/WplGrumCeNXPonFvdBpGOTguP3z5o0bcgj1UJ+yVk8KjNBlwSWhQw
 Mkh/Rb6gSJzeidB3pnQV3TKVkvcFr+Li6DgbG6a77f0W7ggQC2UaeTwEPY5FlMtK
 n2jQddhnXYsQXeOEpDcISbpAnCIx+qjDIRv7jYTajw0hg8A669ytcI/gi4b9qJeU
 l3epZDszbCkRwPACzOXCRfeZRiz1H6/USI+Vn/yIQZBlHEd7TcK6ph+KDO/btX+D
 PWKrirIgzorJsIsDyD4WBXHfJnNS1Imfoxl5s7/8kkwrIkwY+lGpU0zM1bu7cS8W
 c32iGI9+dgHSPTZt3RdL
 =MCO9
 -----END PGP SIGNATURE-----

Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

Pull infiniband updates from Roland Dreier:
 "Main batch of InfiniBand/RDMA changes for 3.19:

   - On-demand paging support in core midlayer and mlx5 driver.  This
     lets userspace create non-pinned memory regions and have the
     adapter HW trigger page faults.
   - iSER and IPoIB updates and fixes.
   - Low-level HW driver updates for cxgb4, mlx4 and ocrdma.
   - Other miscellaneous fixes"

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (56 commits)
  IB/mlx5: Implement on demand paging by adding support for MMU notifiers
  IB/mlx5: Add support for RDMA read/write responder page faults
  IB/mlx5: Handle page faults
  IB/mlx5: Page faults handling infrastructure
  IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation
  IB/mlx5: Changes in memory region creation to support on-demand paging
  IB/mlx5: Implement the ODP capability query verb
  mlx5_core: Add support for page faults events and low level handling
  mlx5_core: Re-add MLX5_DEV_CAP_FLAG_ON_DMND_PG flag
  IB/srp: Allow newline separator for connection string
  IB/core: Implement support for MMU notifiers regarding on demand paging regions
  IB/core: Add support for on demand paging regions
  IB/core: Add flags for on demand paging support
  IB/core: Add support for extended query device caps
  IB/mlx5: Add function to read WQE from user-space
  IB/core: Add umem function to read data from user-space
  IB/core: Replace ib_umem's offset field with a full address
  IB/mlx5: Enhance UMR support to allow partial page table update
  IB/mlx5: Remove per-MR pas and dma pointers
  RDMA/ocrdma: Always resolve destination mac from GRH for UD QPs
  ...
2014-12-18 20:10:44 -08:00
Ido Shamay
c3f2511fea net/mlx4: Cache line CQE/EQE stride fixes
This commit contains 2 fixes for the 128B CQE/EQE stride feaure.
Wei found that mlx4_QUERY_HCA function marked the wrong capability
in flags (64B CQE/EQE), when CQE/EQE stride feature was enabled.
Also added small fix in initial CQE ownership bit assignment, when CQE
is size is not default 32B.

Fixes: 77507aa24 (net/mlx4: Enable CQE/EQE stride support)
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-16 15:23:53 -05:00
Yuval Shaia
0b9976577c mlx4_core: Check for DPDP violation only when DPDP is not supported
Move check for DPDP out of the loop to make the code more readable.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15 18:12:29 -08:00
Or Gerlitz
c78e25edbf net/mlx4_core: Avoid double dumping of the PF device capabilities
To support asymmetric EQ allocations, we should query the device
capabilities prior to enabling SRIOV. As a side effect of adding that,
we are dumping the PF device capabilities twice. Avoid that by moving
the printing into a helper function which is called once.

Fixes: 7ae0e400cd ('net/mlx4_core: Flexible (asymmetric) allocation of
		     EQs and MSI-X vectors for PF/VFs')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-15 11:34:54 -05:00
Matan Barak
da315679e8 net/mlx4_core: Fixed memory leak and incorrect refcount in mlx4_load_one
The current mlx4_load_one has a memory leak as it always allocates
dev_cap, but frees it only on error.

In addition, even if VFs exist when mlx4_load_one is called,
we still need to notify probed VFs that we're loading (by
incrementing pf_loading).

Fixes: a0eacca948 ('net/mlx4_core: Refactor mlx4_load_one')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-15 11:34:53 -05:00
Matan Barak
7d077cd34e net/mlx4: Add support for A0 steering
Add the required firmware commands for A0 steering and a way to enable
that. The firmware support focuses on INIT_HCA, QUERY_HCA, QUERY_PORT,
QUERY_DEV_CAP and QUERY_FUNC_CAP commands. Those commands are used
to configure and query the device.

The different A0 DMFS (steering) modes are:

Static - optimized performance, but flow steering rules are
limited. This mode should be choosed explicitly by the user
in order to be used.

Dynamic - this mode should be explicitly choosed by the user.
In this mode, the FW works in optimized steering mode as long as
it can and afterwards automatically drops to classic (full) DMFS.

Disable - this mode should be explicitly choosed by the user.
The user instructs the system not to use optimized steering, even if
the FW supports Dynamic A0 DMFS (and thus will be able to use optimized
steering in Default A0 DMFS mode).

Default - this mode is implicitly choosed. In this mode, if the FW
supports Dynamic A0 DMFS, it'll work in this mode. Otherwise, it'll
work at Disable A0 DMFS mode.

Under SRIOV configuration, when the A0 steering mode is enabled,
older guest VF drivers who aren't using the RX QP allocation flag
(MLX4_RESERVE_A0_QP) will get a QP from the general range and
fail when attempting to register a steering rule. To avoid that,
the PF context behaviour is changed once on A0 static mode, to
require support for the allocation flag in VF drivers too.

In order to enable A0 steering, we use log_num_mgm_entry_size param.
If the value of the parameter is not positive, we treat the absolute
value of log_num_mgm_entry_size as a bit field. Setting bit 2 of this
bit field enables static A0 steering.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:36 -05:00
Matan Barak
431df8c7e9 net/mlx4: Refactor QUERY_PORT
Currently QUERY_PORT is done as a part of QUERY_DEV_CAP firmware command.

Since we would like to use it without querying all device capabilities,
extract this part to be a function of its own.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:35 -05:00
Matan Barak
579d059bd2 net/mlx4_core: Add explicit error message when rule doesn't meet configuration
When a given flow steering rule is invalid in respect to the current
steering configuration, print the correct error message to the system log.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:35 -05:00
Matan Barak
d57febe1a4 net/mlx4: Add A0 hybrid steering
A0 hybrid steering is a form of high performance flow steering.
By using this mode, mlx4 cards use a fast limited table based steering,
in order to enable fast steering of unicast packets to a QP.

In order to implement A0 hybrid steering we allocate resources
from different zones:
(1) General range
(2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region.

When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP,
we try hard to allocate the QP from range (2). Otherwise, we try hard not
to allocate from this  range. However, when the system is pushed to its
limits and one needs every resource, the allocator uses every region it can.

Meaning, when we run out of raw-eth qps, the allocator allocates from the
general range (and the special-A0 area is no longer active). If we run out
of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that
is also exhausted, the allocator will allocate from the general range
(and the A0 region is no longer active).

Note that if a raw-eth qp is allocated from the general range, it attempts
to allocate the range such that bits 6 and 7 (blueflame bits) in the
QP number are not set.

When the feature is used in SRIOV, the VF has to notify the PF what
kind of QP attributes it needs. In order to do that, along with the
"Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According
to the combination of these bits, the PF tries to allocate a suitable QP.

In order to maintain backward compatibility (with older PFs), the PF
notifies which QP attributes it supports via QUERY_FUNC_CAP command.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:35 -05:00
Matan Barak
7a89399ffa net/mlx4: Add mlx4_bitmap zone allocator
The zone allocator is a mechanism which manages a few mlx4_bitmaps.

When allocating a resource, the user indicates the desired zone of
which this resource will be allocated from. If possible, the resource
will be allocated from this zone. Otherwise, the resource will be
allocated from a less-than, equal-to, higher-than priority zone,
according to the desired zone's properties with that respective
allocation order.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:35 -05:00
Dotan Barak
ab256e5ad0 net/mlx4: Add a check if there are too many reserved QPs
The number of reserved QPs is affected both from the firmware and
from the driver's requirements. This patch adds a check that
validates that this number is indeed feasable.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:35 -05:00
Eugenia Emantayev
ddae0349fd net/mlx4: Change QP allocation scheme
When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields
in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset.

The current Ethernet driver code reserves a Tx QP range with 256b alignment.

This is wrong because if there are more than 64 Tx QPs in use,
QPNs >= base + 65 will have bits 6/7 set.

This problem is not specific for the Ethernet driver, any entity that
tries to reserve more than 64 BF-enabled QPs should fail. Also, using
ranges is not necessary here and is wasteful.

The new mechanism introduced here will support reservation for
"Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
(when hypervisors support WC in VMs). The flow we use is:

1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
   and request "BF enabled QPs" if BF is supported for the function

2. In the ALLOC_RES FW command, change param1 to:
a. param1[23:0]  - number of QPs
b. param1[31-24] - flags controlling QPs reservation

Bit 31 refers to Eth blueflame supported QPs. Those QPs must have
bits 6 and 7 unset in order to be used in Ethernet.

Bits 24-30 of the flags are currently reserved.

When a function tries to allocate a QP, it states the required attributes
for this QP. Those attributes are considered "best-effort". If an attribute,
such as Ethernet BF enabled QP, is a must-have attribute, the function has
to check that attribute is supported before trying to do the allocation.

In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
which are unsupported. If SRIOV is used, the PF validates those attributes
and masks out unsupported attributes as well. In order to notify VFs which
attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's
mailbox is filled by the PF, which notifies which QP allocation attributes
it supports.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:35 -05:00
Matan Barak
3dca0f42c7 net/mlx4_core: Use tasklet for user-space CQ completion events
Previously, we've fired all our completion callbacks straight from our ISR.

Some of those callbacks were lightweight (for example, mlx4_en's and
IPoIB napi callbacks), but some of them did more work (for example,
the user-space RDMA stack uverbs' completion handler). Besides that,
doing more than the minimal work in ISR is generally considered wrong,
it could even lead to a hard lockup of the system. Since when a lot
of completion events are generated by the hardware, the loop over those
events could be so long, that we'll get into a hard lockup by the system
watchdog.

In order to avoid that, add a new way of invoking completion events
callbacks. In the interrupt itself, we add the CQs which receive completion
event to a per-EQ list and schedule a tasklet. In the tasklet context
we loop over all the CQs in the list and invoke the user callback.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:34 -05:00
Or Gerlitz
383677da43 net/mlx4_core: Mask out host side virtualization features for guests
When VFs (guests in this context) issue the QUERY_DEV_CAP command, they
need not be told that host side virtualization features such as VST, FSM
(MAC anti-spoofing) and running > 80 VFs are supported by the device.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:34 -05:00
Or Gerlitz
c58942f252 net/mlx4_en: Set csum level for encapsulated packets
This was dropped by mistake for the napi_gro_frags flow, fix that.

Fixes: dd65beac48 ('net/mlx4_en: Extend usage of napi_gro_frags')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:47:34 -05:00
Eyal Perry
947cbb0ac2 net/mlx4_en: Support for configurable RSS hash function
The ConnectX HW is capable of using one of the following hash functions:
Toeplitz and an XOR hash function. This patch extends the implementation
of the mlx4_en driver set/get_rxfh callbacks to support getting and
setting the RSS hash function used by the device.

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-08 21:07:10 -05:00
Eyal Perry
892311f66f ethtool: Support for configurable RSS hash function
This patch extends the set/get_rxfh ethtool-options for getting or
setting the RSS hash function.

It modifies drivers implementation of set/get_rxfh accordingly.

This change also delegates the responsibility of checking whether a
modification to a certain RX flow hash parameter is supported to the
driver implementation of set_rxfh.

User-kernel API is done through the new hfunc bitmask field in the
ethtool_rxfh struct. A bit set in the hfunc field is corresponding to an
index in the new string-set ETH_SS_RSS_HASH_FUNCS.

Got approval from most of the relevant driver maintainers that their
driver is using Toeplitz, and for the few that didn't answered, also
assumed it is Toeplitz.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Ariel Elior <ariel.elior@qlogic.com>
Cc: Prashant Sreedharan <prashant@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Matthew Vick <matthew.vick@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Solarflare linux maintainers <linux-net-drivers@solarflare.com>
Cc: Shradha Shah <sshah@solarflare.com>
Cc: Shreyas Bhatewara <sbhatewara@vmware.com>
Cc: "VMware, Inc." <pv-drivers@vmware.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-08 21:07:10 -05:00
Jiri Pirko
02637fce3e net: rename netdev_phys_port_id to more generic name
So this can be reused for identification of other "items" as well.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Thomas Graf <tgraf@suug.ch>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:19 -08:00
David S. Miller
60b7379dc5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-29 20:47:48 -08:00
Jack Morgenstein
2d5c57d7fb net/mlx4_core: Limit count field to 24 bits in qp_alloc_res
Some VF drivers use the upper byte of "param1" (the qp count field)
in mlx4_qp_reserve_range() to pass flags which are used to optimize
the range allocation.

Under the current code, if any of these flags are set, the 32-bit
count field yields a count greater than 2^24, which is out of range,
and this VF fails.

As these flags represent a "best-effort" allocation hint anyway, they may
safely be ignored. Therefore, the PF driver may simply mask out the bits.

Fixes: c82e9aa0a8 "mlx4_core: resource tracking for HCA resources used by guests"
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-26 12:04:49 -05:00
Eric Dumazet
bd635c354d mlx4: fix mlx4_en_set_rxfh()
mlx4_en_set_rxfh() can crash if no RSS indir table is provided.

While we are at it, allow RSS key to be changed with ethtool -X

Tested:

myhost:~# cat /proc/sys/net/core/netdev_rss_key
b6:89:91:f3:b2:c3:c2:90:11:e8:ce:45:e8:a9:9d:1c:f2:f6:d4:53:61:8b:26:3a:b3:9a:57:97:c3:b6:79:4d:2e:d9:66:5c:72:ed:b6:8e:c5:5d:4d:8c:22:67:30🆎8a:6e:c3:6a

myhost:~# ethtool -x eth0
RX flow hash indirection table for eth0 with 8 RX ring(s):
    0:      0     1     2     3     4     5     6     7
RSS hash key:
b6:89:91:f3:b2:c3:c2:90:11:e8:ce:45:e8:a9:9d:1c:f2:f6:d4:53:61:8b:26:3a:b3:9a:57:97:c3:b6:79:4d:2e:d9:66:5c:72:ed:b6:8e

myhost:~# ethtool -X eth0 hkey \
03:0e:e2:43:fa:82:0e:73:14:2d:c0:68:21:9e:82:99:b9:84:d0:22:e2:b3:64:9f:4a:af:00:fa:cc:05:b4:4a:17:05:14:73:76:58:bd:2f

myhost:~# ethtool -x eth0
RX flow hash indirection table for eth0 with 8 RX ring(s):
    0:      0     1     2     3     4     5     6     7
RSS hash key:
03:0e:e2:43:fa:82:0e:73:14:2d:c0:68:21:9e:82:99:b9:84:d0:22:e2:b3:64:9f:4a:af:00:fa:cc:05:b4:4a:17:05:14:73:76:58:bd:2f

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: b9d1ab7eb4 ("mlx4: use netdev_rss_key_fill() helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-23 13:49:12 -05:00
David S. Miller
1459143386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ieee802154/fakehard.c

A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.

Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 22:28:24 -05:00
Saeed Mahameed
312df74c71 net/mlx4_en: mlx4_en_set_settings() always fails when autoneg is set
Fix ethtool set settings to not check AUTONEG_ENABLE

mlx4_en_set_settings should not check if cmd->autoneg == AUTONEG_ENABLE,
cmd->autoneg can be enabled by default and this check will fail other settings requests.
mlx4_en driver doesn't support changing autoneg value, but shouldn't fail the request
in case cmd->autoneg was set.

Fixes: d48b3ab ("net/mlx4_en: Use PTYS register to set ethtool settings (Speed)")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 15:06:30 -05:00
Al Viro
914efb02be mlx4: don't duplicate kvfree()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:58:18 -05:00
Or Gerlitz
9737c6ab7a net/mlx4_en: Add VXLAN ndo calls to the PF net device ops too
This is currently missing, which results in a crash when one attempts
to set VXLAN tunnel over the mlx4_en when acting as PF.

	[ 2408.785472] BUG: unable to handle kernel NULL pointer dereference at (null)
	[...]
	[ 2408.994104] Call Trace:
	[ 2408.996584]  [<ffffffffa021f7f5>] ? vxlan_get_rx_port+0xd6/0x103 [vxlan]
	[ 2409.003316]  [<ffffffffa021f71f>] ? vxlan_lowerdev_event+0xf2/0xf2 [vxlan]
	[ 2409.010225]  [<ffffffffa0630358>] mlx4_en_start_port+0x862/0x96a [mlx4_en]
	[ 2409.017132]  [<ffffffffa063070f>] mlx4_en_open+0x17f/0x1b8 [mlx4_en]

While here, make sure to invoke vxlan_get_rx_port() only when VXLAN
offloads are actually enabled and not when they are only supported.

Reported-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-19 15:11:09 -05:00
Eric Dumazet
b9d1ab7eb4 mlx4: use netdev_rss_key_fill() helper
Use of well known RSS key increases attack surface.
Switch to a random one, using generic helper so that all
ports share a common key.

Also provide ethtool -x support to fetch RSS key

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-16 15:59:13 -05:00
Joe Stringer
956bdab2e4 net/mlx4_en: Implement ndo_gso_check()
Use vxlan_gso_check() to advertise offload support for this NIC.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14 17:12:48 -05:00
David S. Miller
076ce44825 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/chelsio/cxgb4vf/sge.c
	drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c

sge.c was overlapping two changes, one to use the new
__dev_alloc_page() in net-next, and one to use s->fl_pg_order in net.

ixgbe_phy.c was a set of overlapping whitespace changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14 01:01:12 -05:00
Matan Barak
de966c5928 net/mlx4_core: Support more than 64 VFs
We now allow up to 126 VFs. Note though that certain firmware
versions only allow up to 80 VFs. Moreover, old HCAs only support 64 VFs.
In these cases, we limit the maximum number of VFs to 64.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:16:22 -05:00
Matan Barak
7ae0e400cd net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X vectors for PF/VFs
Previously, the driver queried the firmware in order to get the number
of supported EQs. Under SRIOV, since this was done before the driver
notified the firmware how many VFs it actually needs, the firmware had
to take into account a worst case scenario and always allocated four EQs
per VF, where one was used for events while the others were used for completions.

Now, when the firmware supports the asymmetric allocation scheme, denoted
by exposing num_sys_eqs > 0 (--> MLX4_DEV_CAP_FLAG2_SYS_EQS), we use the
QUERY_FUNC command to query the firmware before enabling SRIOV. Thus we
can get more EQs and MSI-X vectors per function.

Moreover, when running in the new firmware/driver mode, the limitation
that the number of EQs should be a power of two is lifted.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:16:21 -05:00
Matan Barak
e8c4265bea net/mlx4_core: Add QUERY_FUNC firmware command
QUERY_FUNC firmware command could be used in order to query the
number of EQs, reserved EQs, etc for a specific function.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:16:19 -05:00
Matan Barak
a0eacca948 net/mlx4_core: Refactor mlx4_load_one
Refactor mlx4_load_one, as a preparation step for a new and
more complicated load function. The goal is to support both
newer firmware that required init_hca to be done before
enable_sriov and legacy firmwares that requires things to
be done the other way around.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:16:18 -05:00
Matan Barak
ffc39f6d6f net/mlx4_core: Refactor mlx4_cmd_init and mlx4_cmd_cleanup
Refactoring mlx4_cmd_init and mlx4_cmd_cleanup such that partial init
and cleanup are possible. After this refactoring, calling mlx4_cmd_init
several times is safe.

This is necessary in the VF init flow when mlx4_init_hca returns -EACCESS,
we need to issue cleanup and re-attempt to call it with the slave flag.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:16:17 -05:00
Matan Barak
225c6c8c6b net/mlx4_core: Use correct variable type for mlx4_slave_cap
We've used an incorrect type for the loop counter and the
mlx4_QUERY_FUNC_CAP function. The current input modifier
is either a port or a boolean.
Since the number of ports is always a positive value < 255,
we should use u8 instead of an integer with casting.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:16:16 -05:00
Matan Barak
7c68dd435b net/mlx4_core: Fix wrong reading of reserved_eqs
We mistakenly read the reserved_eqs field as a standard
numeric value rather than a log2 value.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:16:15 -05:00
Or Gerlitz
f4a1edd561 net/mlx4_en: Advertize encapsulation offloads features only when VXLAN tunnel is set
Currenly we only support Large-Send and TX checksum offloads for
encapsulated traffic of type VXLAN. We must make sure to advertize
these offloads up to the stack only when VXLAN tunnel is set.

Failing to do so, would mislead the the networking stack to assume
that the driver can offload the internal TX checksum for GRE packets
and other buggy schemes.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 13:24:45 -05:00
Shani Michaeli
f8c6455bb0 net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
When processing received traffic, pass CHECKSUM_COMPLETE status to the
stack, with calculated checksum for non TCP/UDP packets (such
as GRE or ICMP).

Although the stack expects checksum which doesn't include the pseudo
header, the HW adds it. To address that, we are subtracting the pseudo
header checksum from the checksum value provided by the HW.

In the IPv6 case, we also compute/add the IP header checksum which
is not added by the HW for such packets.

Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 13:20:02 -05:00
Shani Michaeli
dd65beac48 net/mlx4_en: Extend usage of napi_gro_frags
We can call napi_gro_frags for all the received traffic regardless
of the checksum status. Specifically, received packets whose status
is CHECKSUM_NONE (and soon to be added CHECKSUM_COMPLETE)
are eligible for napi_gro_frags as well.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 13:20:02 -05:00
Eric Dumazet
2e1af7d74f mlx4: restore conditional call to napi_complete_done()
After commit 1a28817282 ("mlx4: use napi_complete_done()") we ended up
calling napi_complete_done() in the case NAPI poll consumed all its
budget.

This added extra interrupt pressure, this patch restores proper
behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 1a28817282 ("mlx4: use napi_complete_done()")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10 21:09:03 -05:00
Eric Dumazet
1a28817282 mlx4: use napi_complete_done()
To enable gro_flush_timeout, a driver has to use napi_complete_done()
instead of napi_complete().

Tested:
 Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)

Without this feature, we send back about 305,000 ACK per second.

GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)

Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)

Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.

Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
0.00      0.50

B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
0.00      0.50

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10 12:05:59 -05:00
Matan Barak
d475c95b4b net/mlx4_core: Add retrieval of CONFIG_DEV parameters
Add code to issue CONFIG_DEV "get" firmware command.

This command is used in order to obtain certain parameters used for
supporting various RX checksumming options and vxlan UDP port.

The GET operation is allowed for VFs too.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 12:28:14 -05:00
Ido Shamay
1ab25f86c4 net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages
Needed in order to get cache cold pages (L3 flushed) for HW scatter.

Otherwise memory may flush those entries when the packet comes from
PCI, causing back pressure resulting in BW decrease.

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 12:28:13 -05:00
Ido Shamay
5f6e980080 net/mlx4_en: Remove RX buffers alignment to IP_ALIGN
When IP_ALIGN has a non zero value, hardware will write to a non aligned
address. The only reader from this address is when copying the header
from the first frag into the linear buffer (further access to the IP
address will be from the linear buffer, in which the headers are
aligned). Since the penalty of non align access by the hardware is
greater than the software memcpy, changing the frag_align to always be 0.

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 12:28:13 -05:00
Amir Vadai
0a98455666 net/mlx4_core: Protect port type setting by mutex
We need to protect set_port_type() for concurrency, as the sysfs code could
call it from mutliple contexts in parallel.

The port_mutex is not enough because we need to protect from concurrent
modification of 'info' and stopping of the port sensing work.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 12:28:13 -05:00
Saeed Mahameed
6e80669998 net/mlx4_core: Prevent VF from changing port configuration
Added wrapper to the ACCESS_REG command for handling guest HW
registers access, preventing write operations, but do allow reads.

This will prevent SRIOV guests to change port PTYS configuration,
such as speed/advertised link modes.

Fixes: adbc7ac5c1 ('net/mlx4_core: Introduce ACCESS_REG CMD [...]')
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 12:28:13 -05:00
David S. Miller
55b42b5ca2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/marvell.c

Simple overlapping changes in drivers/net/phy/marvell.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-01 14:53:27 -04:00
Or Gerlitz
571e1b2c7a mlx4: Avoid leaking steering rules on flow creation error flow
If mlx4_ib_create_flow() attempts to create > 1 rules with the
firmware, and one of these registrations fail, we leaked the
already created flow rules.

One example of the leak is when the registration of the VXLAN ghost
steering rule fails, we didn't unregister the original rule requested
by the user, introduced in commit d2fce8a906 "mlx4: Set
user-space raw Ethernet QPs to properly handle VXLAN traffic".

While here, add dump of the VXLAN portion of steering rules
so it can actually be seen when flow creation fails.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30 19:48:58 -04:00
Or Gerlitz
a4f2dacbf2 net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
For VXLAN/NVGRE encapsulation, the current HW doesn't support offloading
both the outer UDP TX checksum and the inner TCP/UDP TX checksum.

The driver doesn't advertize SKB_GSO_UDP_TUNNEL_CSUM, however we are wrongly
telling the HW to offload the outer UDP checksum for encapsulated packets,
fix that.

Fixes: 837052d0cc ('net/mlx4_en: Add netdev support for TCP/IP
		     offloads of vxlan tunneling')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30 19:48:58 -04:00
Eric Dumazet
477b35b44f mlx4: use napi_schedule_irqoff()
mlx4_en_rx_irq() and mlx4_en_tx_irq() run from hard interrupt context.

They can use napi_schedule_irqoff() instead of napi_schedule()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-By: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30 16:50:47 -04:00
Amir Vadai
d5ec899adb net/mlx4_en: Report actual number of rings in indirection table
Hardware requires the number of rings in indirection table to be a power
of 2. When setting number of channels to a non power of 2 number,
indirection table is using only the closest power of 2 rings.
Report this number in 'ethtool -x' and not the total number of rx rings.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-28 17:18:01 -04:00