Allow user to enable loopback feature for individual ports using ethtool.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PPLR register allows configuration of the port's loopback mode.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updates the Mellanox spectrum driver to use the newer intermediate
representation for flow actions in matchall offloads.
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When splitting a port, different local ports need to be mapped on different
systems. For example:
SN3700 (local_ports_in_2x=2):
* Without split:
front panel 1 --> local port 1
front panel 2 --> local port 5
* Split to 2:
front panel 1s0 --> local port 1
front panel 1s1 --> local port 3
front panel 2 --> local port 5
SN3800 (local_ports_in_2x=1):
* Without split:
front panel 1 --> local port 1
front panel 2 --> local port 3
* Split to 2:
front panel 1s0 --> local port 1
front panel 1s1 --> local port 2
front panel 2 --> local port 3
The local_ports_in_{1x, 2x} resources provide the offsets from the base
local ports according to which the new local ports can be calculated.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the number of local ports in 4x changed between SPC and SPC-2,
firmware expose new resources that the driver can query.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new version supports two features that are required by upcoming
changes in the driver:
* Querying of new resources allowing port split into two ports on
Spectrum-2 systems
* Querying of number of gearboxes on supported systems such as SN3800
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the state of rep was introduced, it was also designed to prevent
duplicate unloading of the same rep. Considering the following two
flows when an eswitch manager is at switchdev mode with n VF reps loaded.
+--------------------------------------+--------------------------------+
| cpu-0 | cpu-1 |
| -------- | -------- |
| mlx5_ib_remove | mlx5_eswitch_disable_sriov |
| mlx5_ib_unregister_vport_reps | esw_offloads_cleanup |
| mlx5_eswitch_unregister_vport_reps | esw_offloads_unload_all_reps |
| __unload_reps_all_vport | __unload_reps_all_vport |
+--------------------------------------+--------------------------------+
These two flows will try to unload the same rep. Per original design,
once one flow unloads the rep, the state moves to REGISTERED. The 2nd
flow will no longer needs to do the unload and bails out. However, as
read and write of the state is not atomic, when 1st flow is doing the
unload, the state is still LOADED, 2nd flow is able to do the same
unload action. Kernel crash will happen.
To solve this, driver should do atomic test-and-set for the state. So
that only one flow can change the rep state from LOADED to REGISTERED,
and proceed to do the actual unloading.
Since the state is changing to atomic type, all other read/write should
be atomic action as well.
Fixes: f121e0ea95 (net/mlx5: E-Switch, Add state to eswitch vport representors)
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The check of legal vport is to ensure the vport number falls between
0 and total number of vports. Along with the introduction of uplink
rep, enabled vports are not consecutive any more.
Therefore, rely on the eswitch vport getter function to check if it's
a valid vport.
As the getter function relies on eswitch, add the check of vport
group manager and validation the presence of eswitch structure.
Remove the redundant check in the function calls.
Since the vport array will be allocated once eswitch is initialized
and will be kept alive if eswitch presents, no need to protect it with
the state lock.
Fixes: 5ae5162066 ("net/mlx5: E-Switch, Assign a different position for uplink rep and vport")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Some functions issue vport commands and access vport array using
vport_index/vport_num interchangeably which is OK for VFs vports.
However, this creates potential bug if those vports are not VFs
(E.g, uplink, sf) where their vport_index don't equal to vport_num.
Prepare code to access mlx5_vport structure using a getter function.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Several functions need to access mlx5_vport and vport_num.
When these functions are called, caller already has mlx5_vport*
available.
Hence pass such mlx5_vport pointer.
This is preparation patch to add error checks to
mlx5_eswitch_get_vport() and to return error status.
By doing so, reduce places where error check of mlx5_eswitch_get_vport()
can be avoided.
While doing such change, mlx5_eswitch_query_vport_drop_stats() gets
corrected to work on vport, instead of vport_idx.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently mlx5_esw_for_each_vf_vport iterates over mlx5_vport entries in
eswitch.c
Same macro in eswitch_offloads.c iterates over vport number in
eswitch_offloads.c
Instead of duplicate macro names, to avoid confusion and to reuse the
same macro in both files, move it to eswitch.h.
To iterate over vport numbers where there is no need to iterate over
mlx5_vport, but only a vport number is needed, rename those macros in
eswitch_offloads.c to mlx5_esw_for_each_vf_num_vport*.
While at it, keep all vport and vport rep iterators together.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
mlx5_query_nic_vport_vlans() is not used anymore. Hence remove it.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
CFLAGS_tracepoint.o specifies CFLAGS for compiling tracepoint.c but
it does not exist under drivers/net/ethernet/mellanox/mlx5/core/.
CFLAGS_tracepoint.o is unused.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The same code that returns XDP frames and releases pages is used both in
mlx5e_poll_xdpsq_cq and mlx5e_free_xdpsq_descs. Create a function that
cleans up an MPWQE.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add the support to read additional EEPROM information from high pages.
Information for modules such as SFF-8436 and SFF-8636:
1) Application select table
2) User writable EEPROM
3) Thresholds and alarms
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
With unlocked TC it is possible to have spurious deletes and inserts of
same filter. TC layer needs drivers to always return error when flow
insertion failed in order to correctly calculate "in_hw_count" for each
filter. Fix mlx5e_configure_flower() to return -EEXIST when TC tries to
insert a filter that is already provisioned to the driver.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Current ConnectX HW is unable to perform VLAN pop in TX path and VLAN
push on RX path. To workaround that limitation untagged packets are
tagged with VLAN ID 0x000 (priority tag) and pop/push actions are
replaced by VLAN re-write actions (which are supported by the HW).
Replace TC VLAN pop action with a VLAN priority tag header rewrite.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Current ConnectX HW is unable to perform VLAN pop in TX path and VLAN
push on RX path. As a workaround, untagged packets are tagged with
VID 0x000 allowing pop/push actions to be exchanged with VLAN rewrite
actions.
Use the ingress ACL table, preceding the FDB, to push VLAN 0x000 ID tag
for untagged packets and the egress ACL table, succeeding the FDB, to
pop VLAN 0x000 ID tag.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Hardware requires that all TIRs that steer traffic to the same RQ
should share identical tunneled_offload_en value.
For that, the tunneled_offload_en bit should be set/unset (according to
the HW capability) for all TIRs', not only the ones dedicated for
tunneled (inner) traffic.
Fixes: 1b223dd391 ("net/mlx5e: Fix checksum handling for non-stripped vlan packets")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Many TIR context settings are common to different TIR types,
take them into a common function.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This merge commit includes some misc shared code updates from mlx5-next branch needed
for net-next.
1) From Aya: Enable general events on all physical link types and
restrict general event handling of subtype DELAY_DROP_TIMEOUT in mlx5 rdma
driver to ethernet links only as it was intended.
2) From Eli: Introduce low level bits for prio tag mode
3) From Maor: Low level steering updates to support RDMA RX flow
steering and enables RoCE loopback traffic when switchdev is enabled.
4) From Vu and Parav: Two small mlx5 core cleanups
5) From Yevgeny add HW definitions of geneve offloads
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Introduce specification for Geneve decap flow with encapsulation options
and allow creation of rules that are matching on Geneve TLV options.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When in switchdev mode, we would like to treat loopback RoCE
traffic (on eswitch manager) as RDMA and not as regular
Ethernet traffic
In order to enable it we add flow steering rule that forward RoCE
loopback traffic to the HW RoCE filter (by adding allow rule).
In addition we add RoCE address in GID index 0, which will be
set in the RoCE loopback packet.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Flow table supports three types of miss action:
1. Default miss action - go to default miss table according to table.
2. Go to specific table.
3. Switch domain - go to the root table of an alternative steering
table domain.
New table miss action was added - switch_domain.
The next domain for RDMA_RX namespace is the NIC RX domain.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add new flow steering namespace - MLX5_FLOW_NAMESPACE_RDMA_RX.
Flow steering rules in this namespace are used to filter
RDMA traffic.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Pass the flow steering objects instead of their attributes
to fs_cmd in order to decrease number of arguments and in
addition it will be used to update object fields.
Pass the flow steering root namespace instead of the device
so will have context to the namespace in the fs_cmd layer.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Open events of type 'GENERAL' to all types of interfaces. Prior to this
patch, 'GENERAL' events were captured only by Ethernet interfaces. Other
interface types (non-Ethernet) were excluded and couldn't receive
'GENERAL' events.
Fixes: 5d3c537f90 ("net/mlx5: Handle event of power detection in the PCIE slot")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The mlx5 Sub-Function (SF) sub device will be introduced in
subsequent patches. It will be created as mediated device and
belong to mdev bus. It is necessary to treat dma operations on
PF, VF and SF in uniform way, hence reduce the dependency on
pdev pci dev struct and work directly out of newly introduced
'struct device' from previous patch.
This patch does not change any functionality.
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently mlx5 core stores copy of the PCI device name in a
mlx5_priv structure and uses pr_warn, pr_err helpers.
Get rid of the copy of this name; instead store the parent device
pointer that contains name as well as dma specific parameters.
This also allows to use kernel's well defined dev_warn, dev_err, dev_dbg
device specific print routines.
This is also a preparation patch to access non PCI parent device in
future.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-04-28
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Introduce BPF socket local storage map so that BPF programs can store
private data they associate with a socket (instead of e.g. separate hash
table), from Martin.
2) Add support for bpftool to dump BTF types. This is done through a new
`bpftool btf dump` sub-command, from Andrii.
3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which
was currently not supported since skb was used to lookup netns, from Stanislav.
4) Add an opt-in interface for tracepoints to expose a writable context
for attached BPF programs, used here for NBD sockets, from Matt.
5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel.
6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to
support tunnels such as sit. Add selftests as well, from Willem.
7) Various smaller misc fixes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing new TIR creation core API which allows caller
to receive back from the call the full command outbox.
This comes as a preparation for the next patch that will
retrieve the TIR ICM address from the command outbox.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
mlxsw currently does not support v6 gateways with v4 routes. Commit
19a9d136f1 ("ipv4: Flag fib_info with a fib_nh using IPv6 gateway")
prevents a route from being added, but nothing stops the replace or
append. Add a catch for them too.
$ ip ro add 172.16.2.0/24 via 10.99.1.2
$ ip ro replace 172.16.2.0/24 via inet6 fe80::202:ff:fe00:b dev swp1s0
Error: mlxsw_spectrum: IPv6 gateway with IPv4 route is not supported.
$ ip ro append 172.16.2.0/24 via inet6 fe80::202:ff:fe00:b dev swp1s0
Error: mlxsw_spectrum: IPv6 gateway with IPv4 route is not supported.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series includes updates to mlx5e driver RX data path and some
significant XDP RX/TX improvements to overcome/mitigate HW and PCIE
bottlenecks.
From Tariq:
1) Some Enhancements in rq->flags
2) Stabilize RX packet rate (on Striding RQ) with
multiple outstanding UMR posts
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
From Shay:
3) XDP, Inline small packets into the TX MPWQE in XDP xmit flow
Upon high packet rate with multiple CPUs TX workloads, much of the HCA's
resources are spent on prefetching TX descriptors, thus affecting
transmission rates.
This patch comes to mitigate this problem by moving some workload to the
CPU and reducing the HW data prefetch overhead for small packets (<= 256B).
When forwarding packets with XDP, a packet that is smaller
than a certain size (set to ~256 bytes) would be sent inline within
its WQE TX descrptor (mem-copied), when the hardware tx queue is congested
beyond a pre-defined water-mark.
Performance:
Tested packet rate for UDP 64Byte multi-stream
over two dual port ConnectX-5 100Gbps NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
* Tested with hyper-threading disabled
XDP_TX:
| | before | after | |
| 24 rings | 51Mpps | 116Mpps | +126% |
| 1 ring | 12Mpps | 12Mpps | same |
XDP_REDIRECT:
** Below is the transmit rate, not the redirection rate
which might be larger, and is not affected by this patch.
| | before | after | |
| 32 rings | 64Mpps | 92Mpps | +43% |
| 1 ring | 6.4Mpps | 6.4Mpps | same |
As we can see, feature significantly improves scaling, without
hurting single ring performance.
From Maxim:
4) Some trivial refactoring and code improvements prior to a larger series
to support AF_XDP.
-Saeed.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcv2LjAAoJEEg/ir3gV/o+90gIAI8+4lwkXZAVk4mxf9PMjxuB
bQiKd80e++26sgrNHCyuWZnIzTQqYAnUJ3WRC+Kk1pFTo1O23A+fvweT8m1dqAvP
Z/5ktfbAeF3fwOVu7aGu9vh4zJEWJj8oO+I+G+OaOe2iV7FVTTFnWHxiiCfungAW
oUnXozq4vERSQLechqqgz6nACxOPgEOCJrp4T9lDYSbqZizHgFttmInMQguq/7KS
LvITcNu3EF5l4y2LxwCFiKRgGc2y/belU63AK+2pQUXhH46kQPEHdncdLg5d9QYA
xJwthn697qxS0PIP5oHPHNVN+qJXfuUHVonXqVOAJebGQnV82of6+sPweRxwh1s=
=MfAR
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2019-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-04-22
This series includes updates to mlx5e driver RX data path and some
significant XDP RX/TX improvements to overcome/mitigate HW and PCIE
bottlenecks.
From Tariq:
1) Some Enhancements in rq->flags
2) Stabilize RX packet rate (on Striding RQ) with
multiple outstanding UMR posts
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
From Shay:
3) XDP, Inline small packets into the TX MPWQE in XDP xmit flow
Upon high packet rate with multiple CPUs TX workloads, much of the HCA's
resources are spent on prefetching TX descriptors, thus affecting
transmission rates.
This patch comes to mitigate this problem by moving some workload to the
CPU and reducing the HW data prefetch overhead for small packets (<= 256B).
When forwarding packets with XDP, a packet that is smaller
than a certain size (set to ~256 bytes) would be sent inline within
its WQE TX descrptor (mem-copied), when the hardware tx queue is congested
beyond a pre-defined water-mark.
Performance:
Tested packet rate for UDP 64Byte multi-stream
over two dual port ConnectX-5 100Gbps NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
* Tested with hyper-threading disabled
XDP_TX:
| | before | after | |
| 24 rings | 51Mpps | 116Mpps | +126% |
| 1 ring | 12Mpps | 12Mpps | same |
XDP_REDIRECT:
** Below is the transmit rate, not the redirection rate
which might be larger, and is not affected by this patch.
| | before | after | |
| 32 rings | 64Mpps | 92Mpps | +43% |
| 1 ring | 6.4Mpps | 6.4Mpps | same |
As we can see, feature significantly improves scaling, without
hurting single ring performance.
From Maxim:
4) Some trivial refactoring and code improvements prior to a larger series
to support AF_XDP.
====================
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a #define for the timeout of mlx5e_wait_for_min_rx_wqes to
clarify the meaning of a magic number.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Remove the no longer used page_reuse stat of RQs.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
mlx5e_trigger_irq posts a NOP to the ICO SQ just to trigger an IRQ and
enter the NAPI poll on the right CPU according to the affinity. Use it
in mlx5e_activate_rq.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
mlx5e_mpwqe_get_log_rq_size calculates the number of WQEs (N) based on
the requested number of frames in the RQ (F) and the number of packets
per WQE (P). It ensures that N is not less than the minimum number of
WQEs in an RQ (N_min). Arithmetically, it means that F / P >= N_min
should be true. This function deals with logarithms, so it should check
that log(F) - log(P) >= log(N_min). However, if F < P, this expression
will cause an unsigned underflow. Check log(F) >= log(P) + log(N_min)
instead.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This commit moves the parameter calculation functions to a separate file
for better modularity and code sharing with future features.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
If the channels fail to reopen after setting an XDP program, return the
error code instead of 0. A proper fix is still needed, as now any error
while reopening the channels brings the interface down. This patch only
adds error reporting.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Upon high packet rate with multiple CPUs TX workloads, much of the HCA's
resources are spent on prefetching TX descriptors, thus affecting
transmission rates.
This patch comes to mitigate this problem by moving some workload to the
CPU and reducing the HW data prefetch overhead for small packets (<= 256B).
When forwarding packets with XDP, a packet that is smaller
than a certain size (set to ~256 bytes) would be sent inline within
its WQE TX descrptor (mem-copied), when the hardware tx queue is congested
beyond a pre-defined water-mark.
This is added to better utilize the HW resources (which now makes
one less packet data prefetch) and allow better scalability, on the
account of CPU usage (which now 'memcpy's the packet into the WQE).
To load balance between HW and CPU and get max packet rate, we use
watermarks to detect how much the HW is congested and move the work
loads back and forth between HW and CPU.
Performance:
Tested packet rate for UDP 64Byte multi-stream
over two dual port ConnectX-5 100Gbps NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
* Tested with hyper-threading disabled
XDP_TX:
| | before | after | |
| 24 rings | 51Mpps | 116Mpps | +126% |
| 1 ring | 12Mpps | 12Mpps | same |
XDP_REDIRECT:
** Below is the transmit rate, not the redirection rate
which might be larger, and is not affected by this patch.
| | before | after | |
| 32 rings | 64Mpps | 92Mpps | +43% |
| 1 ring | 6.4Mpps | 6.4Mpps | same |
As we can see, feature significantly improves scaling, without
hurting single ring performance.
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This counter tracks how many TX MPWQE sessions are started in XDP SQ
in XDP TX/REDIRECT flow. It counts per-channel and global stats.
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The XDP redirect flush indication belongs to the receive queue,
not to its XDP send queue.
For this, use a new bit on rq->flags.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Values in enum mlx5e_rq_flag are used as bit indixes.
Intention was to use them with no BIT(i) wrapping.
No functional bug fix here, as the same (shifted)flag bit
is used for all set, test, and clear operations.
Fixes: 121e892754 ("net/mlx5e: Refactor RQ XDP_TX indication")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Update all users of eth_get_headlen to pass network device, fetch
network namespace from it and pass it down to the flow dissector.
This commit is a noop until administrator inserts BPF flow dissector
program.
Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Switch the CPU port to use the new dedicated egress pool instead the
previously used egress pool which was shared with normal front panel
ports.
Add per-port quotas for the amount of traffic that can be buffered for
the CPU port and also adjust the per-{port, TC} quotas.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>