Commit 5b4c4d3686 "mlx4_en: Allow communication between functions on
same host" introduced a regression under which a bridge acting as vSwitch
whose uplink is an mlx4 Ethernet device become non-operative in native
(non sriov) mode. This happens since broadcast ARP requests sent by VMs
were loopback-ed by the HW and hence the bridge learned VM source MACs
on both the VM and the uplink ports.
The fix is to place the DMAC in the send WQE only under SRIOV/eSwitch
configuration or when the device is in selftest.
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- A good chunk of Bart Van Assche's SRP fixes
- UAPI disintegration from David Howells
- mlx4 support for "64-byte CQE" hardware feature from Or Gerlitz
- Other miscellaneous fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJQxstjAAoJEENa44ZhAt0hURUQAJd7HumReKTdRqzIzXPc+rgl
pRR5eqplPY2anfJMqLDiFphVjfCiKyhudomdo+RUbBFFnUVLlBzk80A0/IZ3g3PZ
MHOT+pX4PGDd+3FQxV2AaQCMwgGbvC0haInXyQDVZGm0fbMjRd699yGVWBiA8rOI
VNhUi5WMmynSINYokM8UxrhfoUfy3QxsOvZBZ3XUD1zjJB0IMd5HRdiDUG7ur0q+
rfpWKv51DXT81ux36MXbdPBhLRbzx4B7EwuPWOFPqJe1KwK2cD8iA6DwEKC9KMxS
Kj2+CxB5Bfpfz8bhLi2VZcMgAKiSIQDXUtiKz8h0yFVhvADYZLU7zdGN49mCqKcY
9dwX8+0aIVez6WB2jH+ir2FSG65NsnvqESwQ4LLQ9bhArgf9fapVGlypHwcKi5hh
3j2ipO/RyT56nLQeI0gz1P5mQneFSWlY96CD8WP+9OxO/mVnxViajzevSwT/cLE6
IOMks8DPhsQK88JXSx0XKVxn3zrJ9SXbYDhRWJ6f4w/fxraRXlFdQi0UfcsAajkX
5qmM4e8Oy97TJYiY1RkAmb7aV182xMWVjtDx2FFTQ5ukgDea/DklIM/JNQ475027
N7zMW1tP6+gnnDyMEkteVuPdbl1fzwI3RdXCh0mFZHZ5tvegkdxbw0XxERcevnQN
LZfME8wCuC7+RtmE38Li
=TQK2
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband upate from Roland Dreier:
"First batch of InfiniBand/RDMA changes for the 3.8 merge window:
- A good chunk of Bart Van Assche's SRP fixes
- UAPI disintegration from David Howells
- mlx4 support for "64-byte CQE" hardware feature from Or Gerlitz
- Other miscellaneous fixes"
Fix up trivial conflict in mellanox/mlx4 driver.
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (33 commits)
RDMA/nes: Fix for crash when registering zero length MR for CQ
RDMA/nes: Fix for terminate timer crash
RDMA/nes: Fix for BUG_ON due to adding already-pending timer
IB/srp: Allow SRP disconnect through sysfs
srp_transport: Document sysfs attributes
srp_transport: Simplify attribute initialization code
srp_transport: Fix attribute registration
IB/srp: Document sysfs attributes
IB/srp: send disconnect request without waiting for CM timewait exit
IB/srp: destroy and recreate QP and CQs when reconnecting
IB/srp: Eliminate state SRP_TARGET_DEAD
IB/srp: Introduce the helper function srp_remove_target()
IB/srp: Suppress superfluous error messages
IB/srp: Process all error completions
IB/srp: Introduce srp_handle_qp_err()
IB/srp: Simplify SCSI error handling
IB/srp: Keep processing commands during host removal
IB/srp: Eliminate state SRP_TARGET_CONNECTING
IB/srp: Increase block layer timeout
RDMA/cm: Change return value from find_gid_port()
...
Add support to changing number of rx/tx channels using
ethtool ('ethtool -[lL]'). Where the number of tx channels specified in ethtool
is the number of rings per user priority - not total number of tx rings.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ConnectX-3 devices can use either 64- or 32-byte completion queue
entries (CQEs) and event queue entries (EQEs). Using 64-byte
EQEs/CQEs performs better because each entry is aligned to a complete
cacheline. This patch queries the HCA's capabilities, and if it
supports 64-byte CQEs and EQES the driver will configure the HW to
work in 64-byte mode.
The 32-byte vs 64-byte mode is global per HCA and not per CQ or EQ.
Since this mode is global, userspace (libmlx4) must be updated to work
with the configured CQE size, and guests using SR-IOV virtual
functions need to know both EQE and CQE size.
In case one of the 64-byte CQE/EQE capabilities is activated, the
patch makes sure that older guest drivers that use the QUERY_DEV_FUNC
command (e.g as done in mlx4_core of Linux 3.3..3.6) will notice that
they need an update to be able to work with the PPF. This is done by
changing the returned pf_context_behaviour not to be zero any more. In
case none of these capabilities is activated that value remains zero
and older guest drivers can run OK.
The SRIOV related flow is as follows
1. the PPF does the detection of the new capabilities using
QUERY_DEV_CAP command.
2. the PPF activates the new capabilities using INIT_HCA.
3. the VF detects if the PPF activated the capabilities using
QUERY_HCA, and if this is the case activates them for itself too.
Note that the VF detects that it must be aware to the new PF behaviour
using QUERY_FUNC_CAP. Steps 1 and 2 apply also for native mode.
User space notification is done through a new field introduced in
struct mlx4_ib_ucontext which holds device capabilities for which user
space must take action. This changes the binary interface so the ABI
towards libmlx4 exposed through uverbs is bumped from 3 to 4 but only
when **needed** i.e. only when the driver does use 64-byte CQEs or
future device capabilities which must be in sync by user space. This
practice allows to work with unmodified libmlx4 on older devices (e.g
A0, B0) which don't support 64-byte CQEs.
In order to keep existing systems functional when they update to a
newer kernel that contains these changes in VF and userspace ABI, a
module parameter enable_64b_cqe_eqe must be set to enable 64-byte
mode; the default is currently false.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The vlan tag can be zero. This is why it can't serve as an indication
that packet requires VLAN header in the TX flow.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The QP range is reserved as a single block. However, when freeing the
en resources, the tx-ring QPs are released both in mlx4_en_destroy_tx_ring
(one at a time) and in mlx4_en_free_resources (as a block release).
Fix by eliminating the one-at-a-time release in mlx4_en_destroy_tx_ring.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit e22979d96a (mlx4_en: Moving to Interrupts for TX
completions) we no longer need to orphan skbs in mlx4_en_xmit()
since skb wont stay a long time in TX ring before their release.
Orphaning skbs in ndo_start_xmit() should be avoided as much as
possible, since it breaks TCP Small Queue or other flow control
mechanisms (per socket limits)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removing the ring->blocked flag, it is redundant and leads to a race:
We close the TX queue and then set the "blocked" flag.
Between those 2 operations the completion function can check the "blocked"
flag, sees that it is 0, and wouldn't open the TX queue.
Using netif_tx_queue_stopped to check the state of the queue to avoid this race.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the TX ring scheme such that the number of rings for untagged packets
and for tagged packets (per each of the vlan priorities) is the same, unlike
the current situation where for tagged traffic there's one ring per priority
and for untagged rings as the number of core.
Queue selection is done as follows:
If the mqprio qdisc is operates on the interface, such that the core networking
code invoked the device setup_tc ndo callback, a mapping of skb->priority =>
queue set is forced - for both, tagged and untagged traffic.
Else, the egress map skb->priority => User priority is used for tagged traffic, and
all untagged traffic is sent through tx rings of UP 0.
The patch follows the convergence of discussing that issue with John Fastabend
over this thread http://comments.gmane.org/gmane.linux.network/229877
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Liran Liss <liranl@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving to interrupts instead of polling fpr TX completions
Avoiding situations where skb can be held in by the driver for
a long time (till timer expires).
The change is also necessary for supporting BQL.
Removing comp_lock that was required because we could handle TX
completions from several contexts: Interrupts, timer, polling.
Now there is only interrupts
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since vlan egress map is only good for tagged traffic, need to have other
mapping to be used by untagged traffic.
For that, the driver uses sch_mqprio mapping. This mapping could be set by
using tc tool from iproute2 package.
Mapped UP will be used by the HW for QoS purposes, but won't go out on the
wire.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of relying on HW to change schedule queue by UP, schedule
queue is fixed for a tx_ring, and UP in WQE is ignored in this aspect. This
resolves two issues with untagged traffic:
1. untagged traffic has no UP in packet which is needed for QoS. The change
above allows setting the schedule queue (and by that the UP) of such a stream.
2. BlueFlame uses the same field used by vlan tag. So forcing UP from QPC
allows using BF for untagged but prioritized traffic.
In old firmware that force UP is not supported, untagged traffic will not subject to
QoS.
Because UP is set by QP, need to always have a tx ring per UP, even if pfcrx
module paramter is false.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The blue flame buffer is defined to be of type void __iomem *
but was passed to mlx4_bf_copy which gets unsigned long * .
This triggered sparse warning on different address spaces,
fix that by changing mlx4_bf_copy first param to be of type void __iomem * .
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Localized the pdev->dev, and using dma_map instead of pci_map
There are multiple map/unmap operations on data path,
optimizing those by saving redundant pointer access.
Those places were identified as hot-spots when running kernel profiling
during some benchmarks.
The fixes had most impact when testing packet rate with small packets,
reducing several % from CPU load, and in some case being the difference
between reaching wire speed or being CPU bound.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix new sparse errors introduced in commit 6221217199 (mlx4_en: dont
change mac_header on xmit)
Reported-by: Or Gerlitz <or.gerlitz@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
A driver xmit function is not allowed to change skb without special
care.
mlx4_en_xmit() should not call skb_reset_mac_header() and instead should
use skb->data to access ethernet header.
This removes a dumb test : if (ethh && ethh->h_dest)
Also remove this slow mlx4_en_mac_to_u64() call, we can use
get_unaligned() to get faster code.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
alloc failures use dump_stack so emitting an additional
out-of-memory message is an unnecessary duplication.
Remove the allocation failure messages.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To enable internal loopback, always fill DMAC in control segment
when transmitting the packet, once this is done, the packet is subject
for loopback for if the DMAC mathces one of the multicast/unicast addresses
registered on the physical port.
In receive path if source MAC is our own MAC and we are not in selftest,
or not in force LB mode - drop this packet.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using vlan 0 and UP 0, vlan header wasn't placed.
Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device must be in promiscuous mode or DMAC must be same as the host MAC, or
else packet will be dropped by the HW rx filtering.
Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving to regular Completion Queue implementation (not collapsed)
Completion for each transmitted packet is written to new entry.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
These files were using moduleparam infrastructure, but were not
including anything for it -- which is fine when module.h is being
implicitly included in all files, but that is going away.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Not updating common counters from data path.
The checksum counters are per ring, summarizing them when collecting statistics.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.
Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moves the Mellanox driver into drivers/net/ethernet/mellanox/ and
make the necessary Kconfig and Makefile changes.
CC: Roland Dreier <roland@kernel.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>