As of current design, in each NAPI, only a single UMR WQE
completion could be available in the completion queue of the
the internal control operations (ICO) send queue, in addition
to nop operations that require no actions upon completion.
This renders the consume index obsolete, as the wqe_counter
field in CQE is sufficient.
This helps removing a memory barrier, and obsoletes the need
for tracking the num_wqebbs to update the consumer counter.
In addition, remove other unused fields in icosq struct:
pdev, dma_fifo_pc, and prev_cc.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Separate the RX post WQEs function of the different RQ types.
This enables RQ type-specific optimizations in data-path.
Poll the ICOSQ completion queue only for Striding RQ,
and only when a UMR post completion could be possibly available.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The indication for a UMR WQE in progress is needed only within
the NAPI context, and hence no races possible and no need for
the use of atomic operations.
The only place the flag is read outside of NAPI context is
in closure flow, after RQ is disabled flag is no more accessed
in NAPI.
Use a boolean instead of a bit in ring state, so that its
non-atomic set operations do not race with the atomic sets of
the other bits.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Rings enabled state change occurs in control path only, and is always
followed by a napi_sychronize(), so that following NAPIs read the
new value. This read does not need to be atomic.
The RQ auto-moderation bit is not set/cleared in data-path.
No need for atomic read, a regular read operation is sufficient.
In RQ creation time as well, there's no multiple threads trying
to access it yet, hence a regular read can be used.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Refactor function mlx5e_lro_update_hdr() to reduce number of
branches.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
NAPI context handles different kinds of completion queues
(RX, TX, and others). Hence, upon a poll trial, some of them
might be empty.
Here we early-return upon empty completion queues, as well as
full rx buffer, and save unnecessary logic and memory barriers.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
If a UMR post is in progress, it means that there's a missing
WQE in RQ, and that a completion will be shortly available in
ICO SQ completion queue. Prefer busy-poll to handle it as soon
as possible.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The dma offset of a MPWQE (Multi-Packet WQE) in memory region
is fixed for all rounds. Calculate it once on creation time,
instead of in runtime. This also obsoletes the wqe argument in
the function.
In addition, optimize dma_info iterator calculation.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In RX data-path, use memset() instead of loop assignment
to init the whole skbs_frags array.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Field is used only locally within the RQ create function.
The use of a local variable is sufficient.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In RX data-path, use shift operations instead of a regular multiplication
by stride size, as it is a power of two.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Bring fast-path fields together, and combine RX WQE mutual
exclusive fields into a union.
Page-reuse and XDP are mutually exclusive and cannot be used at
the same time.
Use a union to combine their footprints.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When the abort mechanism is invoked a default route directing packets to
the CPU is programmed in all the virtual routers currently in use. This
can result in packet loss in case a new VRF is configured.
Upon abort, program the default route in all virtual routers, whether
they are in use or not.
The patch is directed at net-next since post-abort fixes aren't critical
and packet loss due to a missing default route will be insignificant
compared to packet loss caused by the CPU port policer.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I relied on the fact that anycast routes use the loopback device as
their nexthop device to trap packets hitting them to the CPU.
After commit 4832c30d54 ("net: ipv6: put host and anycast routes on
device with address") this is no longer the case and such routes are
programmed with a forward action (note the 'offload' flag):
anycast cafe:: dev enp3s0np7 proto kernel metric 0 offload pref medium
This will prevent the router from locally receiving packets destined to
the Subnet-Router anycast address.
Fix this by specifically programming anycast routes with action trap,
which results in the following output:
anycast cafe:: dev enp3s0np7 proto kernel metric 0 pref medium
Fixes: 4832c30d54 ("net: ipv6: put host and anycast routes on device with address")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the
device in case a port is enslaved to a master netdev such as bridge or
bond.
Since the driver ignores events unrelated to its ports and their
uppers, it's possible to engineer situations in which the device's data
path differs from the kernel's.
One example to such a situation is when a port is enslaved to a bond
that is already enslaved to a bridge. When the bond was enslaved the
driver ignored the event - as the bond wasn't one of its uppers - and
therefore a bridge port instance isn't created in the device.
Until such configurations are supported forbid them by checking that the
upper device doesn't have uppers of its own.
Fixes: 0d65fc1304 ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Nogah Frankel <nogahf@mellanox.com>
Tested-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for controlling IPv6 neighbor counters via dpipe.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for setting counters on IPv6 neighbors based on dpipe's host6
table counter status.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for IPv6 host table dump.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the host entry filler helper to be applicable for both IPv4/6
addresses.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helper for accessing destination IP in case of IPv6 neighbor.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IPv6 host table initial support. The action behavior for both IPv4/6
tables is the same, thus the same action dump op is used. Neighbors with
link local address are ignored.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neighbors with link local addresses are not offloaded to the host table,
yet, the are maintained in the driver for adjacency table usage. When
dumping the IPv6 host neighbors this link local neighbors should be
ignored. This patch exports this helper for dpipe usage.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new flow table and indirect TIRs which are used to hash the
inner packet headers of GRE tunneled packets.
When a GRE tunneled packet is received, the TTC flow table will match
the new IPv4/6->GRE rules which will forward it to the inner TTC table.
The inner TTC is similar to its counterpart outer TTC table, but
matching the inner packet headers instead of the outer ones (and does
not include the new IPv4/6->GRE rules).
The new rules will not add steering hops since they are added to an
already existing flow group which will be matched regardless of this
patch. Non GRE traffic will not be affected.
The inner flow table will forward the packet to inner indirect TIRs
which hash the inner packet and thus result in RSS for the tunneled
packets.
Testing 8 TCP streams bandwidth over GRE:
System: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Before: 21.3 Gbps (Single RQ)
Now : 90.5 Gbps (RSS spread on 8 RQs)
Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add TX offloads support for GRE tunneled packets by reporting the needed
netdev features.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This change adds the ability for flow steering to classify IPv4/6
packets with MPLS tag (Ethertype 0x8847 and 0x8848) as standard IP
packets and hit IPv4/6 classification steering rules.
Since IP packets with MPLS tag header have MPLS ethertype, they
missed the IPv4/6 ethertype rule and ended up hitting the default
filter forwarding all the packets to the same single RQ (No RSS).
Since our device is able to look past the MPLS tag and identify the
next protocol we introduce this solution which replaces ethertype
matching by the device's capability to perform IP version
identification and matching in order to distinguish between IPv4 and
IPv6.
Therefore, when driver is performing flow steering configuration on the
device it will use IP version matching in IP classified rules instead
of ethertype matching which will cause relevant MPLS tagged packets to
hit this rule as well.
If the device doesn't support IP version matching the driver will fall back
to use legacy ethertype matching in the steering as before.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
cq_period_mode assignment was mistakenly removed so it was always set to "0",
which is EQE based moderation, regardless of the device CAPs and
requested value in ethtool.
Fixes: 6a9764efb2 ("net/mlx5e: Isolate open_channels from priv->params")
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Fix inline header size, make sure it is not greater than skb len.
This bug effects small packets, for example L2 packets with size < 18.
Fixes: ae76715d15 ("net/mlx5e: Check the minimum inline header mode before xmit")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When changing from switchdev to legacy mode, all the representor port
devices (uplink nic and reps) are cleaned up. Part of this cleaning
process is removing the neigh entries and the hash table containing them.
However, a representor neigh entry might be linked to the uplink port
hash table and if the uplink nic is cleaned first the cleaning of the
representor will end up in null deref.
Fix that by unloading the representors in the opposite order of load.
Fixes: cb67b83292 ("net/mlx5e: Introduce SRIOV VF representors")
Signed-off-by: Shahar Klein <shahark@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently if vxlan tunnel ipv6 src isn't supplied the driver fails to
resolve it as part of the route lookup. The resulting encap header
is left with a zeroed out ipv6 src address so the packets are sent
with this src ip.
Use an appropriate route lookup API that also resolves the source
ipv6 address if it's not supplied.
Fixes: ce99f6b97f ('net/mlx5e: Support SRIOV TC encapsulation offloads for IPv6 tunnels')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently, increasing the number of combined channels is changing
the RSS spread to use the new created channels.
Prevent the RSS spread change in case the user explicitly declare it,
to avoid overriding user configuration.
Tested:
when RSS default:
# ethtool -L ens8 combined 4
RSS spread will change and point to 4 channels.
# ethtool -X ens8 equal 4
# ethtool -L ens8 combined 6
RSS will not change after increasing the number of the channels.
Fixes: 8bf3686204 ('ethtool: ensure channel counts are within bounds during SCHANNELS')
Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Function mlx5e_dealloc_rx_wqe is using page pointer value as an
indication to valid DMA mapping. In case that the mapping failed, we
released the page but kept the dangling pointer. Store the page pointer
only after the DMA mapping passed to avoid invalid page DMA unmap.
Fixes: bc77b240b3 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
MLX5_INTERFACE_STATE_SHUTDOWN is not used in the code.
Fixes: 5fc7197d3a ("net/mlx5: Add pci shutdown callback")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
There is an issue where the firmware fails during mlx5_load_one,
the health_care timer detects the issue and schedules a health_care call.
Then the mlx5_load_one detects the issue, cleans up and quits. Then
the health_care starts and calls mlx5_unload_one to clean up the resources
that no longer exist and causes kernel panic.
The root cause is that the bit MLX5_INTERFACE_STATE_DOWN is not set
after mlx5_load_one fails. The solution is removing the bit
MLX5_INTERFACE_STATE_DOWN and quit mlx5_unload_one if the
bit MLX5_INTERFACE_STATE_UP is not set. The bit MLX5_INTERFACE_STATE_DOWN
is redundant and we can use MLX5_INTERFACE_STATE_UP instead.
Fixes: 5fc7197d3a ("net/mlx5: Add pci shutdown callback")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Support for ISSI version 0 was recently broken as the arm_srq_cmd
command, which is used only for ISSI version 0, was given the opcode
for ISSI version 1 instead of ISSI version 0.
Change arm_srq_cmd to use the correct command opcode for ISSI version
0.
Fixes: af1ba291c5 ('{net, IB}/mlx5: Refactor internal SRQ API')
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Current code doesn't report DCB_CAP_DCBX_HOST capability when query
through getcap. User space lldptool expects capability to have HOST mode
set when it wants to configure DCBX CEE mode. In absence of HOST mode
capability, lldptool fails to switch to CEE mode.
This fix returns DCB_CAP_DCBX_HOST capability when port's DCBX
controlled mode is under software control.
Fixes: 3a6a931dfb ("net/mlx5e: Support DCBX CEE API")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
qos capability is the master capability bit that determines
if the DCBX is supported for the PCI function. If this bit is off,
driver cannot run any dcbx code.
Fixes: e207b7e991 ("net/mlx5e: ConnectX-4 firmware support for DCBX")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Adding support for updating the FW on new port mac, when port mac change
is requested by the user. This info is required by the FW as OEM
management tools require this info directly from the NIC FW.
Check device capability bit to verify the FW supports user mac.
If the FW does support it, use set_port command to notify the FW on the
new mac.
The feature is relevant only to PF port mac.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When changing the sizeof style usage in the patch cited below,
one brackets misplacement was introduced. Here we fix it.
Fixes: 31975e27a4 ("mlx4: sizeof style usage")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "lg" variable is declared as int so in all places where this variable
is used as a shift operand, the output will be int too.
This produces the following smatch warning:
drivers/net/ethernet/mellanox/mlx4/fw.c:1532 mlx4_map_cmd() warn:
should '1 << lg' be a 64 bit type?
Simple declaration of "1" to be "1ULL" will fix the issue.
Fixes: 225c7b1fee ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to avoid temporary large structs on the stack,
allocate them dynamically.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tal Alon <talal@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to new XRQ(eXtended shared Receive Queue)
hardware object. It supports SRQ semantics with addition
of extended receive buffers topologies and offloads.
Currently supports tag matching topology and rendezvouz offload.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
During the neighbor traversal the neighbors from different families
should be ignored.
Fixes: c58035a74aba ("mlxsw: spectrum_dpipe: Add support for IPv4 host table dump")
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Makes no sense to have dpipe compiled in when devlink is not enabled,
because the devlink dpipe registation is noop function. So don't compile
it in. This also fixes missing extern structs errors.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: a86f030915 ("mlxsw: spectrum_dpipe: Add support for IPv4 host table dump")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series includes updates to mlx5 core driver.
From Gal and Saeed, three cleanup patches.
From Matan, Low level flow steering improvements and optimizations,
- Use more efficient data structures for flow steering objects handling.
- Add tracepoints to flow steering operations.
- Overall these patches improve flow steering rule insertion rate by a
factor of seven in large scales (~50K rules or more).
-Saeed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZntEwAAoJEEg/ir3gV/o+ghgIAJ5UBPWvZspnbQJHBopsJh47
d4qt4HrcxxoA07d7QflGSmzqqvoX87eo6mVMQ/WkB+0D8KxggXYr75EOk4lQeYYo
kiZ+4GdR6UaeQMhykcThKUyEpv60/8wLmXaHvhdWOaVsmzAFwQK0u5HGJlW14lzx
LHvJGWG377zu+SdpR6wNDrwaHhk2B4Azqb5bomiGTPCg1RdZv3i37/hbF00X9GHB
ZzPg3Mc5RQvF1fu9H35x4f15pturmMbtuGzmR2oKHMmNS2XQd6lFFlXfQxVUxtdg
hvAj7RYFrmY1fAPp9cMZbB5ibKkFUFE6idebfrTIrVQrbxv9o0nwRvZTB4lbe9U=
=hpBO
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2017-08-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2017-08-24
This series includes updates to mlx5 core driver.
From Gal and Saeed, three cleanup patches.
From Matan, Low level flow steering improvements and optimizations,
- Use more efficient data structures for flow steering objects handling.
- Add tracepoints to flow steering operations.
- Overall these patches improve flow steering rule insertion rate by a
factor of seven in large scales (~50K rules or more).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Make this const as it is only passed as an argument to the function
mlx5e_create_netdev and the corresponding argument is of type const.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make these const as they are only used in a copy operation.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for controlling neighbor counters via dpipe.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for IPv4 host table dump.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for setting counters on neighbors based on dpipe's host table
counter status. This patch also adds the ability for getting the counter
value, which will be used by the dpipe host table implementation in the
next patches.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done as a preparation before introducing support for neighbor
counters. The flow counter's type enum is used by many registers, yet,
until now it was used only by mgpc and thus it was private. This patch
updates the namespace for more generic usage.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change label name for case of erif table init failure.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done as a preparation before introducing the ability to dump the
host table via dpipe, and to count the table size. The mlxsw's neighbor
representative struct stays private to the router module.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The entry clear routine can be shared between the drivers, thus it is
moved inside devlink.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until now the dpipe table's size was static and known at registration
time. The host table does not have constant size and it is resized in
dynamic manner. In order to support this behavior the size is changed
to be obtained dynamically via an op.
This patch also adjust the current dpipe table for the new API.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix ERIF's table operations name space.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding a flow table entry (fte) to a flow table (ft), we first
need to find its flow group (fg). Currently, this is done by
traversing a linear list of all flow groups in the flow table.
Furthermore, since multiple flow groups which correspond to the same
fte mask may exist in the same ft, we can't just stop at the first
match. Converting the linear list to rhltable in order to speed things
up.
The last four patches increases the steering rules update rate by a
factor of more than 7 (for insertion of 50K steering rules).
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When adding a flow table entry (fte) to a flow group (fg), we first
need to check whether this fte exist. In such a case we just merge
the destinations (if possible). Currently, this is done by traversing
the fte list available in a fg. This could take a lot of time when
using large flow groups. Speeding this up by using rhashtable, which
is much faster.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The current code stores fte_match_param in the software representation
of FTEs and FGs. fte_match_param contains a large reserved area at the
bottom of the struct. Since downstream patches are going to hash this
part, we would like to avoid doing so on a reserved part.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When allocating a flow table entry, we need to allocate a free index
in the flow group. Currently, this is done by traversing the existing
flow table entries in the flow group, until a free index is found.
Replacing this by using a ida, which allows us to find a free index
much faster.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Fix the following checkpatch warning in en_ethtool.c:
WARNING: suspect code indent for conditional statements (8, 9)
+ for (i = 0; i < NUM_PCIE_PERF_STALL_COUNTERS(priv); i++)
+ strcpy(data + (idx++) * ETH_GSTRING_LEN,
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The blank line should be after u32 val = ...
and not after __be32 __iomem *addr = ...
Fixes: ad5b39a95c ("net/mlx5: Add a blank line after declarations")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reported-by: Joe Perches <joe@perches.com>
If action is gact goto_chain, offload it to HW by jumping to another
ruleset.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to lookup ruleset in order to offload goto_chain termination
action. This patch adds it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For goto_chain action we need to know group_id of a ruleset to jump to.
Provide infrastructure in order to get it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reflect chain index coming down from TC core and create a ruleset per
chain. Note that only chain 0, being the implicit chain, is bound to the
device for processing. The rest of chains have to be "jumped-to" by
actions.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the value of the mrouter flag in struct mlxsw_sp_bridge_port when
it is being changed.
Fixes: c57529e1d5 ("mlxsw: spectrum: Replace vPorts with Port-VLAN")
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.
Signed-off-by: Romain Perier <romain.perier@collabora.com>
Reviewed-by: Peter Senna Tschudin <peter.senna@collabora.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.
Signed-off-by: Romain Perier <romain.perier@collabora.com>
Acked-by: Peter Senna Tschudin <peter.senna@collabora.com>
Tested-by: Peter Senna Tschudin <peter.senna@collabora.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The byte offset of counter descriptors should be stored in size_t variable
instead of an integer.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To fix these checkpatch complaints:
WARNING: Comparisons should place the constant on the right side of the test
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To fix these checkpatch complaints:
CHECK: Please don't use multiple blank lines
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To fix these checkpatch complaints:
WARNING: Missing a blank line after declarations
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To fix these checkpatch complaints:
CHECK: Blank lines aren't necessary after an open brace '{'
CHECK: Blank lines aren't necessary before a close brace '}'
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add outbound_pci_buffer_overflow to ethtool output for monitoring the
number of packets that were dropped due to lack of PCIe buffers on
receive path from NIC port toward the host(s).
This counter is valid only in case that tx_overflow_buffer_pkt is
supported in MCAM enhanced features.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
rx_buffer_passed_thres_phy - The number of events where the port RX
buffer has passed a fullness threshold.
rx_buffer_full_phy - The number of events where the port RX buffer has
reached 100% fullness.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
outbound_pci_stalled_rd - The percentage of time within the last second
that the NIC had outbound non-posted read requests but could not perform
the operation due to insufficient non-posted credits.
outbound_pci_stalled_wr - The percentage of time within the
last second that the NIC had outbound posted writes requests but could
not perform the operation due to insufficient posted credits.
outbound_pci_stalled_rd_events - The number of events where
outbound_pci_stalled_rd was above the threshold.
outbound_pci_stalled_wr_events - The number of events where
outbound_pci_stalled_wr was above the threshold.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add support for "ethtool DEVNAME" over ipoib ports,
Display standard port information for IPoIB netdevices using ethtool
For example:
$ ethtool ib2
> Settings for ib2:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: 100000Mb/s
Duplex: Full
Port: Other
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Link detected: yes
Signed-off-by: Shalom Lagziel <shaloml@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Printing an enhanced IPoIB device information using
"ethtool -i DEVNAME", prints the low level driver name: mlx5_core.
This commit changes the name to mlx5_core [ib_ipoib], to include the
ipoib device driver infromation.
Fixes: 076b0936e5 ("net/mlx5e: IPoIB, Add ethtool support")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Upon interface up/down, driver will send PAOS (Ports Administrative and
Operational Status Register) in order to inform the Firmware on the
desired status of the port by the driver.
Since now we might change physical link status on mlx5e_open/close,
logical VF representor should not use mlx5e_open/close ndos as is, and
should call the logical version mlx5e_open/closed_locked.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently, if vport is zero then then an uninialized return status
in err is returned. Since the only return status at the end of the
function esw_add_uc_addr is zero for the current set of return paths
we may as well just return 0 rather than err to fix this issue.
Detected by CoverityScan, CID#1452698 ("Uninitialized scalar variable")
Fixes: eeb66cdb68 ("net/mlx5: Separate between E-Switch and MPFS")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
enable_4k_uar module parameter was added in patch cited below to
address the backward compatibility issue in SRIOV when the VM has
system's PAGE_SIZE uar implementation and the Hypervisor has 4k uar
implementation.
The above compatibility issue does not exist in the non SRIOV case.
In this patch, we always enable 4k uar implementation if SRIOV
is not enabled on mlx4's supported cards.
Fixes: 76e39ccf9c ("net/mlx4_core: Fix backward compatibility on VFs")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The offending commit used a newly added helper function.
But the logic is wrong. Without this fix, the affected NICs
can't do HW offload. Error -EOPNOTSUPP will be returned directly.
Fixes: a2e8da9378 ("net/sched: use newly added classid identity helpers")
Signed-off-by: Chris Mi <chrism@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial fix to spelling mistakes in the mlx4 driver
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel coding style is to treat sizeof as a function
(ie. with parenthesis) not as an operator.
Also use kcalloc and kmalloc_array
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I made an embarrassing mistake and used 'IPV6' instead of 'CONFIG_IPV6'
around the function that updates the kernel about IPv6 neighbours
activity. This can be a problem if the kernel has more neighbours than a
certain threshold and it starts deleting those that are supposedly
inactive.
Fixes: b5f3e0d430 ("mlxsw: spectrum_router: Fix build when IPv6 isn't enabled")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 routes currently lack nexthop flags as in IPv4. This has several
implications.
In the forwarding path, it requires us to check the carrier state of the
nexthop device and potentially ignore a linkdown route, instead of
checking for RTNH_F_LINKDOWN.
It also requires capable drivers to use the user facing IPv6-specific
route flags to provide offload indication, instead of using the nexthop
flags as in IPv4.
Add nexthop flags to IPv6 routes in the 40 bytes hole and use it to
provide offload indication instead of the RTF_OFFLOAD flag, which is
removed while it's still not part of any official kernel release.
In the near future we would like to use the field for the
RTNH_F_{LINKDOWN,DEAD} flags, but this change is more involved and might
not be ready in time for the current cycle.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not necessary to manually clear the
device driver data to NULL.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not necessary to manually clear the
device driver data to NULL.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to limited ASIC resources the maximum number of routes is limited by
the nexthop resource. In order to improve the routing scale nexthop
consolidation should be performed.
This patch adds support for IPv6 neighbor consolidation. The hash value
is calculated based on the nexthop set, by performing bitwise xor on the
ifindexs of the nexthops, in a similar way to IPv4's kernel implementation.
In case of collision a full match is performed between the sets which
include address and ifindex comparison.
Non gateway nexthop groups are not inserted to the hash table due to
lack of nexthop device (ifindex).
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does preparation before introducing IPv6 nexthop group
consolidation. Currently the nexthop group hash table is used only by
IPv4 and uses fixed key size. In order to support the IPv6's variable
length key the current table is changed.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mellanox switches (mlxsw) supports I2C systems without PCI, in order to
give the ability to the users to use such functionality, there is need
to update Kconfig.
Signed-off-by: Ohad Oz <ohado@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The number of LPM trees available for lookup is much smaller than the
number of virtual routers, which are used to implement VRFs. In
addition, an LPM tree can only be used by one protocol - either IPv4 or
IPv6.
Therefore, in order to increase the number of supported virtual routers
to the maximum we need to be able to share LPM trees across virtual
routers instead of trying to find an optimized tree for each.
Do that by allocating one LPM tree for each protocol, but make sure it
will only include prefixes that are actually used, so as to not perform
unnecessary lookups.
Since changing the structure of a bound tree isn't recommended, whenever
a new tree it required, it's first created and then bound to each
virtual router, replacing the old one.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of relying on the LPM tree to be assigned to the virtual router
before binding the two, lets pass it explicitly.
This will later allow us to return upon binding error instead of having
to perform a rollback of the assignment.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point in returning a value from function whose return value
is never checked.
Even if the return value was checked, there wouldn't be anything to do
about it, as these functions are either called from error or deletion
paths.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make these structures const as they only stored in the profile field of
a mlxsw_driver structure, which is of type const.
Done using Coccinelle.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of checking handle, which does not have the inner class
information and drivers wrongly assume clsact->egress as ingress, use
the newly introduced classid identification helpers.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Both add new code
include/rdma/ib_verbs.h - Both add new code
Signed-off-by: Doug Ledford <dledford@redhat.com>
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.
The TCP conflict was a case of local variables in a function
being removed from both net and net-next.
In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.
Signed-off-by: David S. Miller <davem@davemloft.net>
if the NIC fails to validate the checksum on TCP/UDP, and validation of IP
checksum is successful, the driver subtracts the pseudo-header checksum
from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't
do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails.
V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set
Reported-by: Shuang Li <shuali@redhat.com>
Fixes: f8c6455bb0 ("net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
generic api takes care of spreading affinity similar to
what mlx5 open coded (and even handles better asymmetric
configurations). Ask the generic API to spread affinity
for us, and feed him pre_vectors that do not participate
in affinity settings (which is an improvement to what we
had before).
The affinity assignments should match what mlx5 tried to
do earlier but now we do not set affinity to async, cmd
and pages dedicated vectors.
Also, remove mlx5e_get_cpu and introduce mlx5e_get_node
(used for allocation purposes) and mlx5_get_vector_affinity
(for indirection table construction) as they provide the needed
information. Luckily, we have generic helpers to get cpumask
and node given a irq vector. mlx5_get_vector_affinity will
be used by mlx5_ib in a subsequent patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
mlx5e currently assumes that irq affinity is really spread first
irq vectors across device home node cpus, with the new generic affinity
mappings this is no longer the case, hence mlxe should not rely on
this anymore.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Now that we have a generic code to allocate an array
of irq vectors and even correctly spread their affinity,
correctly handle cpu hotplug events and more, were much
better off using it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This series includes some mlx5 updates for both net-next and rdma trees.
From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
- E-Switch (Ethernet SRIOV support).
- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.
From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.
From Rabie,
Increase the maximum supported flow counters.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZiDoAAAoJEEg/ir3gV/o+594H/RH5kRwC719s/5YQFJXvGsVC
fjtj3UUJPLrWB8XBh7a4PRcxXPIHaFKJuY3MU7KHFIeZQFklJcit3njjpxDlUINo
F5S1LHBSYBkeMD/ksWBA8OLCBprNGN6WQ2tuFfAjZlQQ44zqv8LJmegoDtW9bGRy
aGAkjUmALEblQsq81y0BQwN2/8DA8HAywrs8L2dkH1LHwijoIeYMZFOtKugv1FbB
ABSKxcU7D/NYw6rsVdZG59fHFQ+eKOspDFqBZrUzfQ+zUU2hFFo96ovfXBfIqYCV
7BtJuKXu2LeGPzFLsuw4h1131iqFT1iSMy9fEhf/4OwaL/KPP/+Umy8vP/XfM+U=
=wCpd
-----END PGP SIGNATURE-----
Merge tag 'mlx5-shared-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-shared-2017-08-07
This series includes some mlx5 updates for both net-next and rdma trees.
From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
- E-Switch (Ethernet SRIOV support).
- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.
From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.
From Rabie,
Increase the maximum supported flow counters.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of struct tc_to_netdev which is now just unnecessary container
and rather pass per-type structures down to drivers directly.
Along with that, consolidate the naming of per-type structure variables
in cls_*.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the return value from -EINVAL to -EOPNOTSUPP. The rest of the
drivers have it like that, so be aligned.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
prio is not cls_flower specific, but it is meaningful for all
classifiers. Seems that only mlxsw cares about the value. Obviously,
cls offload in other drivers is broken.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As ndo_setup_tc is generic offload op for whole tc subsystem, does not
really make sense to have cls-specific args. So move them under
cls_common structurure which is embedded in all cls structs.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To sync-up with the naming in the rest of the driver, rename the cls arg.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let mlxsw_sp_setup_tc be a splitter for specific setup_tc types and push
out cls_flower and cls_matchall specific codes into separate functions.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let mlx5e_rep_setup_tc (former mlx5e_rep_ndo_setup_tc) be a splitter for
specific setup_tc types and push out cls_flower specific code into
a separate function.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let mlx5e_setup_tc (former mlx5e_ndo_setup_tc) be a splitter for specific
setup_tc types and push out cls_flower and mqprio specific codes into
separate functions. Also change the return values so they are the same
as in the rest of the drivers.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since this is specific to flower now, make it part of the flower offload
struct.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to be aligned with the rest of the types, rename
TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the type is always present, push it to be a separate argument to
ndo_setup_tc. On the way, name the type enum and use it for arg type.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Read new NIC capability field which represnts 16 MSBs of the max flow
counters number supported (max_flow_counter_31_16).
Backward compatibility with older firmware is preserved, the modified
driver reads max_flow_counter_31_16 as 0 from the older firmware and
uses up to 64K counters.
Changed flow counter id from 16 bits to 32 bits. Backward compatibility
with older firmware is preserved as we kept the 16 LSBs of the counter
id in place and added 16 MSBs from reserved field.
Changed the background bulk reading of flow counters to work in chunks
of at most 32K counters, to make sure we don't attempt to allocate very
large buffers.
Signed-off-by: Rabie Loulou <rabiel@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When mlx5_ib registers itself to mlx5_core as an interface, it will
call mlx5_add_device which will call mlx5_ib interface add callback,
in case the latter successfully returns, only then mlx5_core will add
it to the interface list and async events will be forwarded to mlx5_ib.
Between mlx5_ib interface add callback and mlx5_core adding the mlx5_ib
interface to its devices list, arriving mlx5_core events can be missed
by the new mlx5_ib registering interface.
In other words:
thread 1: mlx5_ib: mlx5_register_interface(dev)
thread 1: mlx5_core: mlx5_add_device(dev)
thread 1: mlx5_core: ctx = dev->add => (mlx5_ib)->mlx5_ib_add
thread 2: mlx5_core_event: **new event arrives, forward to dev_list
thread 1: mlx5_core: add_ctx_to_dev_list(ctx)
/* previous event was missed by the new interface.*/
It is ok to miss events before dev->add (mlx5_ib)->mlx5_ib_add_device
but not after.
We fix this race by accumulating the events that come between the
ib_register_device (inside mlx5_add_device->(dev->add)) till the adding
to the list completes and fire them to the new registering interface
after that.
Fixes: f1ee87fe55 ("net/mlx5: Organize device list API in one place")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Allow to selectively build the driver with or without sriov eswitch, VF
representors and TC offloads.
Also remove the need of two ndo ops structures (sriov & basic)
and keep only one unified ndo ops, compile out VF SRIOV ndos when not
needed (MLX5_ESWITCH=n), and for VF netdev calling those ndos will result
in returning -EPERM.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Jes Sorensen <jsorensen@fb.com>
Cc: kernel-team@fb.com
Multi-Physical Function Switch (MPFs) is required for when multi-PF
configuration is enabled to allow passing user configured unicast MAC
addresses to the requesting PF.
Before this patch eswitch.c used to manage the HW MPFS l2 table,
E-Switch always (regardless of sriov) enabled vport(0) (NIC PF) vport's
contexts update on unicast mac address list changes, to populate the PF's
MPFS L2 table accordingly.
In downstream patch we would like to allow compiling the driver without
E-Switch functionalities, for that we move MPFS l2 table logic out
of eswitch.c into its own file, and provide Kconfig flag (MLX5_MPFS) to
allow compiling out MPFS for those who don't want Multi-PF support.
NIC PF netdevice will now directly update MPFS l2 table via the new MPFS
API. VF netdevice has no access to MPFS L2 table, so E-Switch will remain
responsible of updating its MPFS l2 table on behalf of its VFs.
Due to this change we also don't require enabling vport(0) (PF vport)
unicast mac changes events anymore, for when SRIOV is not enabled.
Which means E-Switch is now activated only on SRIOV activation, and not
required otherwise.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jes Sorensen <jsorensen@fb.com>
Cc: kernel-team@fb.com
Expose MLX5_VPORT_MANAGER macro to check for strict vport manager
E-switch and MPFS (Multi Physical Function Switch) abilities.
VPORT manager must be a PF with an ethernet link and with FW advertised
vport group manager capability
Replace older checks with the new macro and use it where needed in
eswitch.c and mlx5e netdev eswitch related flows.
The same macro will be reused in MPFS separation downstream patch.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Remove redundant call to unregister vport representor in mlx5e_add error
flow.
Hide the representor priv and eswitch internal structures from en_main.c
as preparation step for downstream patches which would allow building
the driver without support for representors and eswitch.
Fixes: 6f08a22c5f ("net/mlx5e: Register/unregister vport representors on interface attach/detach")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Since we are going to allow building the driver without eswitch support,
it would be possible to compile out the sriov netdevice ops struct such
that the basic ops instance will be used for non VF devices too.
Add missing udp tunnel ndos into mlx5e_netdev_ops_basic.
While here, rearrange some ndos in the sriov ops struct and put
vf/eswitch related ndos towards the end of it.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
The rest of the helpers are named tcf_exts_*, so change the name of
the action number helpers to be aligned. While at it, change to inline
functions.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each multicast group (MID) stores a bitmap of ports to which a packet
should be forwarded to in case an MDB entry associated with the MID is
hit.
Since the initial introduction of IGMP snooping in commit 3a49b4fde2
("mlxsw: Adding layer 2 multicast support") the driver didn't correctly
free these multicast groups upon ungraceful situations such as the
removal of the upper bridge device or module removal.
The correct way to fix this is to associate each MID with the bridge
ports member in it and then drop the reference in case the bridge port
is destroyed, but this will result in a lot more code and will be fixed
in net-next.
For now, upon module removal, traverse the MID list and release each
one.
Fixes: 3a49b4fde2 ("mlxsw: Adding layer 2 multicast support")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some operations in the bridge driver such as MDB deletion are preformed
in an atomic context and thus deferred to a process context by the
switchdev infrastructure.
Therefore, by the time the operation is performed by the underlying
device driver it's possible the bridge port context is already gone.
This is especially true for removal flows, but theoretically can also be
invoked during addition.
Remove the warnings in such situations and return normally.
Fixes: c57529e1d5 ("mlxsw: spectrum: Replace vPorts with Port-VLAN")
Fixes: 3922285d96 ("net: bridge: Add support for offloading port attributes")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We now have all the necessary IPv6 infrastructure in place, so stop
ignoring these notifications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without resorting to ACLs, the device performs route lookup solely based
on the destination IP address.
In case source-specific routing is needed, an error is returned and the
abort mechanism is activated, thus allowing the kernel to take over
forwarding decisions.
Instead of aborting, we can trap specific destination prefixes where
source-specific routes are present, but this will result in a lot more
code that is unlikely to ever be used.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we got a replace event, then the replaced route must exist. If
the route isn't capable of multipath, then replace first matching
non-multipath capable route.
If the route is capable of multipath and matching multipath capable
route is found, then replace it. Otherwise, replace first matching
non-multipath capable route.
The new route is inserted before the replaced one. In case the replaced
route is currently offloaded, then it's overwritten in the device's table
by the new route and later deleted, thus not impacting routed traffic.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow directly connected and remote unicast IPv6 routes to be programmed
to the device's tables.
As with IPv4, identical routes - sharing the same destination prefix -
are ordered in a FIB node according to their table ID and then the
metric. While the kernel doesn't share the same trie for the local and
main table, this does happen in the device, so ordering according to
table ID is needed.
Since individual nexthops can be added and deleted in IPv6, each FIB
entry stores a linked list of the rt6_info structs it represents. Upon
the addition or deletion of a nexthop, a new nexthop group is allocated
according to the new configuration and the old one is destroyed.
Identical groups aren't currently consolidated, but will be in a
follow-up patchset.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only allow FIB offload in the presence of default rules or an l3mdev
rule. In a similar fashion to IPv4 FIB rules, sanitize IPv6 rules.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FIB notification block currently only handles IPv4 events, but we
want to start handling IPv6 events soon, so lay the groundwork now.
Do that by preparing the work item and process it according to the
notified address family.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're about to add IPv6 notifications in the FIB notification chain, but
the driver currently doesn't support these, so ignore them.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FIB notification chain is currently soley used by IPv4 code.
However, we're going to introduce IPv6 FIB offload support, which
requires these notification as well.
As explained in commit c3852ef7f2 ("ipv4: fib: Replay events when
registering FIB notifier"), upon registration to the chain, the callee
receives a full dump of the FIB tables and rules by traversing all the
net namespaces. The integrity of the dump is ensured by a per-namespace
sequence counter that is incremented whenever a change to the tables or
rules occurs.
In order to allow more address families to use the chain, each family is
expected to register its fib_notifier_ops in its pernet init. These
operations allow the common code to read the family's sequence counter
as well as dump its tables and rules in the given net namespace.
Additionally, a 'family' parameter is added to sent notifications, so
that listeners could distinguish between the different families.
Implement the common code that allows listeners to register to the chain
and for address families to register their fib_notifier_ops. Subsequent
patches will implement these operations in IPv6.
In the future, ipmr and ip6mr will be extended to provide these
notifications as well.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we provide offload indication using the nexthop's flags we must
refresh the offload indication whenever the offload state within the
group changes.
This didn't matter until now, as offload indication was provided using
the FIB info flags and multipath routes were marked as offloaded as long
as one of the nexthops was offloaded.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patch removed the reliance on the counter in the FIB info to
set the offload indication, so we no longer need to keep an offload
state on each FIB entry and can just set or unset the RTNH_F_OFFLOAD
flag in each nexthop.
This is also necessary because we're going to need to refresh the
offload indication whenever the nexthop group associated with the FIB
entry is refreshed. Current check would prevent us from marking a newly
resolved nexthop as offloaded if the FIB entry is already marked as
offloaded.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a similar fashion to previous patch, use the nexthop flags to provide
offload indication instead of the FIB info's flags.
In case a nexthop in a multipath route can't be offloaded (gateway's MAC
can't be resolved, for example), then its offload flag isn't set.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'trans->tid' is only assigned later in the function, resulting in a zero
transaction ID. Use 'tid' instead.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cited commit introduced the following new enum value in file
include/linux/mlx4/device.h:
QUERY_DEV_CAP_DIAG_RPRT_PER_PORT
However, it failed to introduce a corresponding entry in function
dump_dev_cap_flags2() for outputting a line in the message log
when this capability bit is set.
The change here fixes that omission.
Fixes: c7c122ed67 ("net/mlx4: Add diagnostic counters capability bit")
Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cited commit introduced the following new enum value in file
include/linux/mlx4/device.h:
MLX4_DEV_CAP_FLAG2_SVLAN_BY_QP
However the value of MLX4_DEV_CAP_FLAG2_SVLAN_BY_QP needs to stay
consistent with the value used in another namespace in
function dump_dev_cap_flags2(), which is manually kept in sync.
The change here restores that consistency.
Fixes: 7c3d21c815 ("net/mlx4_core: Preparation for VF vlan protocol 802.1ad")
Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The index value in function dump_dev_cap_flags2() for outputting
"sl to vl mapping table change event support" needs to be
consistent with the value of the enumerated constant
MLX4_DEV_CAP_FLAG2_SL_TO_VL_CHANGE_EVENT defined in file
include/linux/mlx4_device.h
The change here restores that consistency.
Fixes: fd10ed8e6f ("IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets")
Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when WoL is supported but disabled, ethtool reports:
"Supports Wake-on: d".
Fix the indication of Wol support, so that the indication
remains "g" all the time if the NIC supports WoL.
Tested:
As accepted, when NIC supports WoL- ethtool reports:
Supports Wake-on: g
Wake-on: d
when NIC doesn't support WoL- ethtool reports:
Supports Wake-on: d
Wake-on: d
Fixes: 14c07b1358 ("mlx4: Wake on LAN support")
Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).
Signed-off-by: David S. Miller <davem@davemloft.net>
Express the same logic more succinctly.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefer logical operator that expresses the intent to bitwise one that
happens to give the same result.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This renames IP2ME-specific registers reg_ralue_v and
reg_ralue_tunnel_ptr to reg_ralue_ip2me_*.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The comments really belong to the individual enumerators. The comment
at the register should instead reference the enum.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding ethtool steering rule with action DISCARD we wrongly
pass a NULL dest with dest_num 1 to mlx5_add_flow_rules().
What this error seems to have caused is sending VPORT 0
(MLX5_FLOW_DESTINATION_TYPE_VPORT) as the fte dest instead of no dests.
We have fte action correctly set to DROP so it might been ignored
anyways.
To reproduce use:
# sudo ethtool --config-nfc <dev> flow-type ether \
dst aa:bb:cc:dd:ee:ff action -1
Fixes: 74491de937 ("net/mlx5: Add multi dest support")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This is done in order to ensure that work will not run after the cleanup.
Fixes: ef9814deaf ('net/mlx5e: Add HW timestamping (TS) support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The overflow_period is calculated in seconds. In order to use it
for delayed work scheduling translation to jiffies is needed.
Fixes: ef9814deaf ('net/mlx5e: Add HW timestamping (TS) support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add the missing option to enable the PTP_CLK_PPS function.
In this case pin should be configured as 1PPS IN first and
then it will be connected to PPS mechanism.
Events will be reported as PTP_CLOCK_PPSUSR events to relevant sysfs.
Fixes: ee7f12205a ('net/mlx5e: Implement 1PPS support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In order to fix the drift in 1PPS out need to adjust the next pulse.
On each 1PPS out falling edge driver gets the event, then the event
handler adjusts the next pulse starting time.
Fixes: ee7f12205a ('net/mlx5e: Implement 1PPS support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Need to disable the MTPPS and unsubscribe from the pulse events
when user disables the 1PPS functionality.
Fixes: ee7f12205a ('net/mlx5e: Implement 1PPS support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In order to mark relevant fields while setting the MTPPS register
add field select. Otherwise it can cause a misconfiguration in
firmware.
Fixes: ee7f12205a ('net/mlx5e: Implement 1PPS support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
outer_header_zero() routine checks if the outer_headers match of a
flow-table entry are all zero.
This function uses the size of whole fte_match_param, instead of just
the outer_headers member, causing failure to detect all-zeros if
any other members of the fte_match_param are non-zero.
Use the correct size for zero check.
Fixes: 6dc6071cfc ("net/mlx5e: Add ethtool flow steering support")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
On interface remove, the clean-up was done incorrectly causing
an error in the log:
"SET_FLOW_TABLE_ROOT(0x92f) op_mod(0x0) failed...syndrome (0x7e9f14)"
This was caused by the following flow:
-ndo_uninit:
Move QP state to RST (this disconnects the QP from FT),
the QP cannot be attached to any FT unless it is in RTS.
-mlx5_rdma_netdev_free:
cleanup_rx: Destroy FT
cleanup_tx: Destroy QP and remove QPN from FT
This caused a problem when destroying current FT we tried to
re-attach the QP to the next FT which is not needed.
The correct flow is:
-mlx5_rdma_netdev_free:
cleanup_rx: remove QPN from FT & Destroy FT
cleanup_tx: Destroy QP
Fixes: 508541146a ("net/mlx5: Use underlay QPN from the root name space")
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When driver fail to allocate an entry to send command to FW, it must
notify the calling function and release the memory allocated for
this command.
Fixes: e126ba97db ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Completion on timeout should not free the driver command entry structure
as it will need to access it again once real completion event from FW
will occur.
Fixes: 73dd3a4839 ('net/mlx5: Avoid using pending command interface slots')
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The tx_enabled lag event field is used to determine whether a slave is
active.
Current logic uses this value only if the mode is active-backup.
However, LACP mode, although considered a load balancing mode, can mark
a slave as inactive in certain situations (e.g., LACP timeout).
This fix takes the tx_enabled value into account when remapping, with
no respect to the LAG mode (this should not affect the behavior in XOR
mode, since in this mode both slaves are marked as active).
Fixes: 7907f23adc (net/mlx5: Implement RoCE LAG feature)
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Upon sriov enable, eswitch is always enabled.
Currently, if enable hca failed over all VFs, we would skip eswitch
disable as part of sriov disable, which will lead to resources leak.
Fix it by disabling eswitch if it was enabled (use indication from
eswitch mode).
Fixes: 6b6adee3da ('net/mlx5: SRIOV core code refactoring')
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The function mlx4_en_get_profile always return zero. So it is not
necessary to check its return value.
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IPv6 isn't enabled the following error is generated:
ERROR: "nd_tbl" [drivers/net/ethernet/mellanox/mlxsw/mlxsw_spectrum.ko]
undefined!
Fix it by replacing 'arp_tbl' and 'nd_tbl' with 'tbl->family' wherever
possible and reference 'nd_tbl' only when IPV6 is enabled.
Fixes: d5eb89cf68 ("mlxsw: spectrum_router: Reflect IPv6 neighbours to the device")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function mlx4_en_arm_cq always returns zero. So change the
return type of the function mlx4_en_arm_cq to void.
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current firmware supported by the driver doesn't support batch deletion
of IPv6 neighbours on a given router interface (RIF).
Until a new version that supports this functionality is made available,
delete neighbours one by one.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each FIB node holds a linked list of routes sharing the same prefix and
length. In the case of IPv4 it's ordered according to table ID, metric
and TOS and only the first route in the list is actually programmed to
the device.
In case a gatewayed route is added somewhere in the list, then after its
nexthop group will be refreshed and become valid (due to the resolution
of its gateway), it'll mistakenly overwrite the existing entry.
Example:
192.168.200.0/24 dev enp3s0np3 scope link metric 1000 offload
192.168.200.0/24 via 192.168.100.1 dev enp3s0np3 metric 1000 offload
Both routes are marked as offloaded despite the fact only the first one
should actually be present in the device's table.
When refreshing the nexthop group, don't write the route to the device's
table unless it's the first in its node.
Fixes: 9aecce1c7d ("mlxsw: spectrum_router: Correctly handle identical routes")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding visibility of resource usage of QPs, CQs and counters used by
virtual functions. This feature will be used to give the PF administrator
more data while debugging VF status. Usage info was added to ALLOC_RES
command, to notify the PF if the resource which is being reserved or
allocated for the VF will be used by kernel driver or by user verbs.
Updated reservation and allocation functions of QP, CQ and counter with
additional usage parameter.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Report 'ipoib_enhanced_offloads' capabilities from
the core layer, it will be used in the next patch from this series.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When delay drop timeout is expired, the firmware raises
general notification event of DELAY_DROP_TIMEOUT subtype.
In addition the feature is disable so the driver have to
reactivate the timeout.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support to SET_DELAY_DROP command.
This command will be used in downstream patches for delay packet drop.
The timeout value should be indicated by delay_drop_timeout field.
Packet processing will be delayed till timeout value passed or until
more WQEs are posted.
Setting this value to 0 disables the feature.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When a user sets port_guid, node_guid or policy of an IB virtual
function, save this information in "struct mlx5_vf_context".
This information will be restored later when pci_resume is called.
To make sure this works, one can use aer-inject to generate PCI
errors on mlx5 devices and verify if relevant fields are restored
after PCI resume.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Before running the ethtool's loopback selftest, we need
to make sure that the local loopback is enabled.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, unicast/multicast loopback raw ethernet
(non-RDMA) packets are sent back to the vport.
A unicast loopback packet is the packet with destination
MAC address the same as the source MAC address.
For multicast, the destination MAC address is in the
vport's multicast filter list.
Moreover, the local loopback is not needed if
there is one or none user space context.
After this patch, the raw ethernet unicast and multicast
local loopback are disabled by default. When there is more
than one user space context, the local loopback is enabled.
Note that when local loopback is disabled, raw ethernet
packets are not looped back to the vport and are forwarded
to the next routing level (eswitch, or multihost switch,
or out to the wire depending on the configuration).
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support for raw ethernet local loopback firmware command.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pull networking fixes from David Miller:
1) BPF verifier signed/unsigned value tracking fix, from Daniel
Borkmann, Edward Cree, and Josef Bacik.
2) Fix memory allocation length when setting up calls to
->ndo_set_mac_address, from Cong Wang.
3) Add a new cxgb4 device ID, from Ganesh Goudar.
4) Fix FIB refcount handling, we have to set it's initial value before
the configure callback (which can bump it). From David Ahern.
5) Fix double-free in qcom/emac driver, from Timur Tabi.
6) A bunch of gcc-7 string format overflow warning fixes from Arnd
Bergmann.
7) Fix link level headroom tests in ip_do_fragment(), from Vasily
Averin.
8) Fix chunk walking in SCTP when iterating over error and parameter
headers. From Alexander Potapenko.
9) TCP BBR congestion control fixes from Neal Cardwell.
10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger.
11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong
Wang.
12) xmit_recursion in ppp driver needs to be per-device not per-cpu,
from Gao Feng.
13) Cannot release skb->dst in UDP if IP options processing needs it.
From Paolo Abeni.
14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander
Levin and myself.
15) Revert some rtnetlink notification changes that are causing
regressions, from David Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
net: bonding: Fix transmit load balancing in balance-alb mode
rds: Make sure updates to cp_send_gen can be observed
net: ethernet: ti: cpsw: Push the request_irq function to the end of probe
ipv4: initialize fib_trie prior to register_netdev_notifier call.
rtnetlink: allocate more memory for dev_set_mac_address()
net: dsa: b53: Add missing ARL entries for BCM53125
bpf: more tests for mixed signed and unsigned bounds checks
bpf: add test for mixed signed and unsigned bounds checks
bpf: fix up test cases with mixed signed/unsigned bounds
bpf: allow to specify log level and reduce it for test_verifier
bpf: fix mixed signed/unsigned derived min/max value bounds
ipv6: avoid overflow of offset in ip6_find_1stfragopt
net: tehuti: don't process data if it has not been copied from userspace
Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"
net: dsa: mv88e6xxx: Enable CMODE config support for 6390X
dt-binding: ptp: Add SoC compatibility strings for dte ptp clock
NET: dwmac: Make dwmac reset unconditional
net: Zero terminate ifr_name in dev_ifname().
wireless: wext: terminate ifr name coming from userspace
netfilter: fix netfilter_net_init() return
...
The number of possible prefix lengths for IPv6 is 129 and not 128.
Fixes following warning from UBSAN when /128 routes are offloaded:
UBSAN: Undefined behaviour in
drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:2510:27 index 128 is out
of range for type 'long unsigned int [128]'
Fixes: 5e9c16cc83 ("mlxsw: spectrum_router: Implement private fib")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions aren't specific to IPv4 and can be re-used for IPv6.
Drop the '4' designation from their name.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Functions that take as argument a FIB entry don't need to take FIB node
as well, as it can be extracted from the entry.
Remove unnecessary FIB node parameter.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions to create and destroy a nexthop group are IPv4 specific
and should be renamed accordingly, so that they won't be confused with
the IPv6 specific functions in follow-up patches.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of the parameters stored in the FIB entry structure are specific to
IPv4 and therefore better placed in an IPv4 specific structure.
Create an IPv4 specific structure that encapsulates the common FIB entry
structure and contains IPv4 specific parameters.
In a follow-up patchset an IPv6 specific structure will be introduced.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we fail to insert a route we invoke the abort mechanism which
flushes all the tables and inserts a default route in each, so that all
packets incoming to the router will be trapped to the CPU.
Upon abort, add an IPv6 default route to the IPv6 tables.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take advantage of previous patch and allow the RALUE register to be
called with IPv6 routes.
In order to re-use as much code as possible between IPv4 and IPv6, only
the lowest-level function that actually does the register packing is
demuxed based on the passed protocol.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the register so that IPv6 LPM entries could be programmed to the
device's table.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A Virtual Router (VR) is an entity which corresponds to a VRF and
performs FIB lookup in an LPM tree according to the {VR, IP Proto} ->
Tree binding.
Extend the virtual router data structure towards IPv6 FIB offload.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A FIB node is an entity which stores routes sharing the same prefix and
length. The data structure itself is already family agnostic, but we
make some of its operations agnostic as well and thus re-use them for
IPv6 offload.
Instead of passing an IPv4-specific structure to fib4_node_get(), pass
general routing parameters and rename the function accordingly.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When looking up a FIB entry we shouldn't create the FIB node where it's
supposed to be linked in case the node doesn't already exist.
Instead, lookup the node and fail if it doesn't exist.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thankfully, the neighbour subsystem is agnostic to the upper protocol
and used by both IPv4 and IPv6. By removing assumptions regarding the
neighbour type we can thus re-use much of the neighbour-related code for
both IPv4 and IPv6.
For each nexthop, store its gateway IP and for nexthop group store the
neighbour table used by its nexthops.
Use this information throughout the code and remove assumption about the
neighbour type.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The neighbours' activity is currently dumped according to the ARP
table's DELAY_PROBE time, but with the introduction of IPv6 offload we
should set the interval according to the minimum between the ARP and
ndisc tables.
Signed-off-by: Arkadi Sharshvesky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In addition to IPv4, periodically dump IPv6 neighbours and update the
kernel about them.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the register so that the active IPv6 neighbours could be dumped
from the device's neighbour table.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As with IPv4, listen to NEIGH_UPDATE events from the ndisc table and
program relevant neighbours to the device's neighbour table.
Note that neighbours with a link-local IP address aren't programmed, as
packets with a link-local destination IP are trapped after LPM lookup
and never reach the neighbour table.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the register, so the IPv6 neighbours could be programmed to the
device's neighbour table.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a netdev is configured with an IP address a router interface (RIF)
should be configured for it in the device. Allow configuration of RIFs
based on IPv6 address notifications as well as IPv4.
Note that the RIF exists as long as an IP address is configured on the
netdev, regardless of the address family.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until now we only flooded broadcast packets to the router when an L3
interface was configured on top of a bridge. However, IPv6 Neighbour
Discovery packets are trapped to the CPU inside the router and these can
be sent with a multicast address.
Flood unregistered multicast packets to the router port, so that
relevant packets could be trapped there.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before we can start using IPv6, we need to trap certain control packets
to the CPU. Among others, these include Neighbour Discovery, DHCP and
neighbour misses.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable IPv6 and IPv6 forwarding on router interfaces (RIFs), so that
they will be able to receive and forward IPv6 traffic.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before we add IPv6 constructs like traps and router interfaces, we first
need to enable IPv6 routing in the device.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The caller to the driver marks GFP_NOIO allocations with help
of memalloc_noio-* calls now. This makes redundant to pass down
to the driver gfp flags, which can be GFP_KERNEL only.
The patch removes the gfp flags argument and updates all driver paths.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Up until now IPv6 unregistered multicast traffic would be flooded like
broadcast, even when MLD snooping was enabled on the bridge. This was
intentional as MLD packet traps were missing, preventing the bridge
driver from programming MDB entries to the device.
Previous patch added these traps, so we can now finally flood IPv6
unregistered multicast packets to specific ports via the multicast table
instead of flooding them to all ports via the broadcast table.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for IPv6 MLDv1/2 packet trapping.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case local sockets have the IP_ROUTER_ALERT socket option set, then
they expect to get packets with the Router Alert option.
Trap such packets, so that the kernel could inspect them and potentially
send them to interested sockets.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 1c6c6d221e ("mlxsw: spectrum: Mirror certain packets to
CPU") we marked packets that were mirrored to the CPU, so that they
won't be flooded again by the bridge driver.
However, certain packets are trapped in the device's router block, after
passing through the bridge block where they were potentially flooded.
Mark all packets coming from L3 traps, so that they won't be potentially
flooded again by the bridge driver.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support offloading rules that match on ip tos.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ecn and dscp fields to the ipv4 acl block.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define new element for ip tos (ecn, dscp) and place it into scratch area.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support offloading rules that match on ip ttl.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ttl field to the ipv4 acl block.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define new element for ip ttl and place it into scratch area.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function __mlx4_zone_remove_one_entry always returns zero. So
it is not necessary to check it.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't rely on kzalloc() always succeeding, so check its return value.
Suppresses the following smatch error:
mlxsw_sp_switchdev_event() error: potential null dereference
'switchdev_work->fdb_info.addr'. (kzalloc returns
null)
Fixes: af06137892 ("mlxsw: spectrum_switchdev: Add support for learning FDB through notification")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 10e23eb299 ("mlxsw: spectrum: Remove support for bypass bridge
port attributes/vlan set") removed statements that used 'bridge_vlan',
but didn't remove the variable itself resulting in the following warning
with W=1:
warning: variable ‘bridge_vlan’ set but not used
[-Wunused-but-set-variable]
Remove the variable and suppress the warning.
Fixes: 10e23eb299 ("mlxsw: spectrum: Remove support for bypass bridge port attributes/vlan set")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on IPv6 route replace I realized we can have a
use-after-free in IPv4 in case the replaced route is offloaded and the
only one using its FIB info.
The problem is that fib_table_insert() drops the reference on the FIB
info of the replaced routes which is eventually freed via call_rcu().
Since the driver doesn't hold a reference on this FIB info it can cause
a use-after-free when it tries to clear the RTNH_F_OFFLOAD flag stored
in fi->fib_flags.
After running the following commands in a loop for enough time with a
KASAN enabled kernel I finally got the below trace.
$ ip route add 192.168.50.0/24 via 192.168.200.1 dev enp3s0np3
$ ip route replace 192.168.50.0/24 dev enp3s0np5
$ ip route del 192.168.50.0/24 dev enp3s0np5
BUG: KASAN: use-after-free in mlxsw_sp_fib_entry_offload_unset+0xa7/0x120 [mlxsw_spectrum]
Read of size 4 at addr ffff8803717d9820 by task kworker/u4:2/55
[...]
? mlxsw_sp_fib_entry_offload_unset+0xa7/0x120 [mlxsw_spectrum]
? mlxsw_sp_fib_entry_offload_unset+0xa7/0x120 [mlxsw_spectrum]
? mlxsw_sp_router_neighs_update_work+0x1cd0/0x1ce0 [mlxsw_spectrum]
? mlxsw_sp_fib_entry_offload_unset+0xa7/0x120 [mlxsw_spectrum]
__asan_load4+0x61/0x80
mlxsw_sp_fib_entry_offload_unset+0xa7/0x120 [mlxsw_spectrum]
mlxsw_sp_fib_entry_offload_refresh+0xb6/0x370 [mlxsw_spectrum]
mlxsw_sp_router_fib_event_work+0xd1c/0x2780 [mlxsw_spectrum]
[...]
Freed by task 5131:
save_stack_trace+0x16/0x20
save_stack+0x46/0xd0
kasan_slab_free+0x70/0xc0
kfree+0x144/0x570
free_fib_info_rcu+0x2e7/0x410
rcu_process_callbacks+0x4f8/0xe30
__do_softirq+0x1d3/0x9e2
Fix this by taking a reference on the FIB info when creating the nexthop
group it represents and drop it when the group is destroyed.
Fixes: 599cf8f95f ("mlxsw: spectrum_router: Add support for route replace")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch the error path of mlxsw_sp_nexthop_init() is symmetric
with mlxsw_sp_nexthop_fini(). Noticed during code review.
Fixes: a8c9701427 ("mlxsw: spectrum_router: Refactor nexthop init routine")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new IPSec offload code introduced a build error:
drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.o: In function `mlx5e_ipsec_build_inverse_table':
ipsec_rxtx.c:(.text+0x556): undefined reference
Another patch was added on top to fix the build error, but
that introduced a new bug, as we now use the remainder of
the division rather than the result.
This makes it use the correct helper function instead.
Fixes: 5dfd87b67c ("net/mlx5: IPSec, Fix 64-bit division on 32-bit builds")
Fixes: 2ac9cfe782 ("net/mlx5e: IPSec, Add Innova IPSec offload TX data path")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Latest change in open-lldp code uses bytes 6-11 of perm_addr buffer
as the Ethernet source address for the host TLV packet.
Since our driver does not fill these bytes, they stay at zero and
the open-lldp code ends up sending the TLV packet with zero source
address and the switch drops this packet.
The fix is to initialize these bytes to 0xff. The open-lldp code
considers 0xff:ff:ff:ff:ff:ff as the invalid address and falls back to
use the host's mac address as the Ethernet source address.
Fixes: 3a6a931dfb ("net/mlx5e: Support DCBX CEE API")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently it is not possible to build just one .o file inside
a subdirectory, because the subdirectories lack a Makefile.
Add a Makefile to the mlx5 subdirectories.
Fixes: e29341fb3a ("net/mlx5: FPGA, Add basic support for Innova")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Both the ethernet and FPGA portions of MLX5 now require the wq functions,
and we get a link error when CONFIG_MLX5_CORE_EN is disabled:
drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.o: In function `mlx5_fpga_conn_create_cq':
conn.c:(.text+0x10b3): undefined reference to `mlx5_cqwq_create'
conn.c:(.text+0x10c6): undefined reference to `mlx5_cqwq_get_size'
conn.c:(.text+0x12bc): undefined reference to `mlx5_cqwq_destroy'
Build wq.o even if MLX5_CORE_EN is not selected.
Fixes: 537a505741 ("net/mlx5: FPGA, Add high-speed connection routines")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Fix warning when building with -Wall:
drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c:105:5: warning: symbol 'mlx5_fpga_device_brb' was not declared. Should it be static?
Fixes: c43051d72a ("net/mlx5: FPGA, Add SBU bypass and reset flows")
Reported-by: Or Gerlitz <gerlitz.or@gmail.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Fix warnings when building with -Wall:
drivers/net/ethernet/mellanox/mlx5/core/lib/gid.c:38:6: warning: symbol 'mlx5_init_reserved_gids' was not declared. Should it be static?
drivers/net/ethernet/mellanox/mlx5/core/lib/gid.c:47:6: warning: symbol 'mlx5_cleanup_reserved_gids' was not declared. Should it be static?
drivers/net/ethernet/mellanox/mlx5/core/lib/gid.c:55:5: warning: symbol 'mlx5_core_reserve_gids' was not declared. Should it be static?
drivers/net/ethernet/mellanox/mlx5/core/lib/gid.c:79:6: warning: symbol 'mlx5_core_unreserve_gids' was not declared. Should it be static?
drivers/net/ethernet/mellanox/mlx5/core/lib/gid.c:92:5: warning: symbol 'mlx5_core_reserved_gid_alloc' was not declared. Should it be static?
drivers/net/ethernet/mellanox/mlx5/core/lib/gid.c:109:6: warning: symbol 'mlx5_core_reserved_gid_free' was not declared. Should it be static?
Fixes: 52ec462eca ("net/mlx5: Add reserved-gids support")
Reported-by: Or Gerlitz <gerlitz.or@gmail.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Some overlapping changes in the mlx5 driver.
A merge conflict resolution posted by Stephen Rothwell was used as a
guide.
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable mlx4_log_num_mgm_entry_size is only called in main.c.
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If mlx5 is set to be built-in and mlxfw as a module, we
get a link error:
drivers/built-in.o: In function `mlx5_firmware_flash':
(.text+0x5aed72): undefined reference to `mlxfw_firmware_flash'
Since we don't want to mandate selecting mlxfw for mlx5 users, we
use the IS_REACHABLE macro to make sure that a stub is exposed
to the caller.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial fix to spelling mistake in mlx5_core_dbg debug message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZU3mtAAoJEEg/ir3gV/o+4k0IAKj5XCn3cviZlXMJRMHBvamt
yWrMI90XgjoPhGPx3K9mf+bMhHOGiZR0Q2DFDJZa5U64DDBVPNvag7fy74GYgj1D
Cet1zohkQ2xdb/R3jfML8tG2IVfvETWo3cgJGFtGUBlOULvpwinSK4A+8oUUGszc
K1vAY0j3+Ncfjk+CZJ8hWqaIk1dyYtjtyn0ACOUOftqBa6+UZY7LbLTTOI7hOZoX
3M35W7ntgGoBScONlxpDUXNUewia4ADTiQPWwHdT9+xNlwz1fzmCHlYi5pY+z9TC
PKbbe1O4l1nsMftwqJVQNHrFnq+x/X69J5vlgobWkk0dQCRQWE9qanG8BfXPykY=
=DUG8
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2017-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2017-06-28
This series contains some fixes for the mlx5 core and netdev driver.
Please pull and let me know if there's any problem.
For -stable:
("net/mlx5e: Fix TX carrier errors report in get stats ndo") Kernels >= v4.7
("net/mlx5: Cancel delayed recovery work when unloading the driver") Kernels >= v4.10
* When applied to net-next this will introduce a contextual conflict, it
should be easy to resolve, (a spin_lock was changed to spin_lock_irqsave in net-next),
if you need any help with this please let me know.
("net/mlx5: Fix driver load error flow when firmware is stuck") Kernels >= v4.4*
* This patch fixes: 6c780a0267 ("net/mlx5: Wait for FW readiness before initializing command interface")
which was submitted two weeks ago and queued up for v4.4.
Sorry about the mess, but other than the above, this series doesn't introduce
any conflict with the current mlx5 IPSec offload series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the number of TX queues that are allocated doesn't depend
on the number of TCs, the module always loads with max num of UP
per channel.
In order to prevent the allocation of unnecessary memory, the
module will load with minimum number of UPs per channel, and the
user will be able to control the number of TX queues per channel
by changing the number of TC to 8 using the tc command.
The variable num_up will hold the information about the current
number of UPs.
Due to the change, needed to remove the lines that set the value of
UP to be different than zero in the func "mlx4_en_select_queue",
since now the num of TX queues that are allocated is only one per channel
in default.
In order not to force the UP to be zero in case of only one TC, added
a condition before forcing it in the func "mlx4_en_fill_qp_context".
Tested:
After the module is loaded with minimum number of UP per channel, to
increase num of TCs to 8, use:
tc qdisc add dev ens8 root mqprio num_tc 8
In order to decrease the number of TCs to minimum number of UP per channel,
use:
tc qdisc del dev ens8 root
Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until this patch, the number of UPs was hard coded for eight.
Replace this with a variable in struct "mlx4_en_port_profile".
Currently, the variable will hold the maximum number of UP,
as before.
The patch creates an infrastructure to add an option for dynamic
change of the actual number of TCs.
Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case a VLAN device is enslaved to a bridge we shouldn't create a
router interface (RIF) for it when it's configured with an IP address.
This is already handled by the driver for other types of netdevs, such
as physical ports and LAG devices.
If this IP address is then removed and the interface is subsequently
unlinked from the bridge, a NULL pointer dereference can happen, as the
original 802.1d FID was replaced with an rFID which was then deleted.
To reproduce:
$ ip link set dev enp3s0np9 up
$ ip link add name enp3s0np9.111 link enp3s0np9 type vlan id 111
$ ip link set dev enp3s0np9.111 up
$ ip link add name br0 type bridge
$ ip link set dev br0 up
$ ip link set enp3s0np9.111 master br0
$ ip address add dev enp3s0np9.111 192.168.0.1/24
$ ip address del dev enp3s0np9.111 192.168.0.1/24
$ ip link set dev enp3s0np9.111 nomaster
Fixes: 99724c18fc ("mlxsw: spectrum: Introduce support for router interfaces")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Petr Machata <petrm@mellanox.com>
Tested-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patchset adds support for Innova IPSec network interface card.
About Innova device:
--------------------
Innova is a network card with a ConnectX chip and an FPGA chip as a
bump-on-the-wire.
Internal
+----------+ Link +-----------------+
| +--------------+ FPGA | +------+
| ConnectX | | Shell +--+ QSFP |
| +--------------+ +-------+ | | Port |
+----------+ I2C | | SBU | | +------+
| +-------+ |
+--+----------+---+
| |
+--+--+ +---+---+
| DDR | | Flash |
+-----+ +-------+
The FPGA synthesized logic is loaded from dedicated flash storage and has
access to its own dedicated DDR RAM.
The ConnectX chip firmware programs the FPGA by accessing its configuration
space over either the slow internal I2C link or the high-speed internal link.
The FPGA logic is divided into a "Shell" and a "Sandbox Unit" (SBU).
mlx5_core driver (with CONFIG_MLX5_FPGA) handles all shell functionality,
while other components may handle the various SBU functionalities.
The driver opens high-speed reliable communication channels with the shell and
the SBU over the internal link.
These channels may be used for high-bandwidth configuration or for SBU-specific
out-of-band data paths.
About Innova IPSec device:
--------------------------
Innova IPSec is a network card that allows offloading IPSec cryptography operations
from the host CPU to the NIC. It is an Innova card with an IPSec SBU.
The hardware keeps the database of IPSec Security Associations (SADB) in the FPGA's
DDR memory.
Internal
+----------+ Link +-----------------+
| +--------------+ FPGA | +------+
| ConnectX | | Shell +--+ QSFP |
| +--------------+ +-------+ | | Port |
+----------+ Internal I2C | | IPSec | | +------+
| | SBU | |
| +-------+ |
+--+----------+---+
| |
+--+--+ +---+---+
| DDR | | |
| | | Flash |
|SADB | | |
+-----+ +-------+
Modes and ciphers:
Currently the following modes and ciphers are supported:
IPv4 and IPv6
ESP tunnel and transport modes
AES 128 and 256 bit encryption, with GCM authentication (RFC4106)
IV is generated using seqiv, in sync with Linux's geniv.
More modes and ciphers may be added later.
Notes:
In the future similar functionality will be included in a single-chip NIC.
About the driver:
-----------------
Patches 1-4 prepare some existing driver code for the new feature:
* Add support for reserved GIDs in the hardware GID table
* Allow multiple modules to enable hardware RoCE support independently
Patches 5-6 define structs and helper functions for QP work-queues.
Patches 7-11 add various FPGA-related features required for Innova.
IPSec.
Patch 12 adds abstraction layer for Mellanox IPSec-offload capable devices.
atches 13-16 add IPSec offload support to the mlx5 netdevice.
This driver services the new IPSec offload API introduced in commit
d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Configuration Path:
If Innova IPSec device is detected, the mlx5e netdevice gets the new
NETIF_F_HW_ESP feature and the xdo callbacks, indicating ESP offload
capabilities, and also the matching TX checksum and GSO features.
The driver configures offloaded Security Associations (SAs) by sending
an ADD_SA or DEL_SA message to the IPSec SBU, which updates the SADB in DDR.
These messages and their responses are sent over a high-speed channel.
Counters for ethtool are retrieved by the driver from the SBU.
Data path:
On receive path, the SBU decrypts ESP packets which match the offloaded SADB,
but keeps them encapsulated.
The SBU injects metadata (Mellanox owned ethertype) indicating that crypto-offload
has taken place, the SA with which it was done, and the authentication result.
The ConnectX chip performs RX checksum offload on the packet, and RSS using the
ESP SPI value. The driver detects the special ethertype, and attaches a struct
secpath to the RX SKB, including flags to indicate that crypto offload took place,
the authentication result, and which xfrm_state was used for decryption, in the
olen and ovec members. The RX SKB may have useful CHECKSUM_COMPLETE. A separate
patchset will add support for that in the xfrm stack.
On transmit path, the stack encapsulates the packet but does not encrypt it, and
indicates in the SKB's secpath that crypto offload is to be performed and the SA
to use to do so.
The driver avoids performing crypto-offload for ESP fragments, and packets with
IP options, as the SBU cannot currently do that. For eligible packets, the driver
prepends a special ethertype with metadata instructing the hardware to perform crypto offload.
The stack builds regular (non-GSO) SKBs so that they contain a placeholder for the ESP trailer.
The driver trims it off, because the SBU automatically appends the trailer for offloaded packets.
The ConnectX chip performs TX checksum offload on inner UDP or TCP packets,
and GSO for TCP packets (duplicating the prepended metadata).
The segmented packets then undergo encryption in the SBU before going on the wire.
Performance:
We measure single stream of TCP on Intel(R) Xeon(R) CPU E5-2643 v2 @3.50GHz
Using AES-NI with ESP GSO we get constant 4.1 Gbps.
Using crypto offload we get constant 18 Gbps.
Note that these numbers require CHECKSUM_COMPLETE support in XFRM, which we submit separately.
- Ilan Tayari
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZUmf1AAoJEEg/ir3gV/o+ukIIALp/5+E1W0cC9xvY1X9dTETW
cKsHvDJ7G1CxUy18W8Mf9z+WOqC6hGCqS+yicOb+umfIqkTcLHDb2irlqprYLC+F
oYl1HqgHTaiAYByqL90qiyPcFbfsaNIqA9KOsED2qdZ1yxjoYBiJnSDZDAdO/0lN
Lt1czNswFc5ovnEUGn8bkjLZZH2pJoJWEI4g4hN9cq33BLLq8A795F/ZjwCJTQ1X
qXdKcEmktBrgZiSiTVFxxpQVhO/uB0HmzaZzrY1k1P5e6yhHEr422mcOcF9KcSL4
aeyRYHjoIh51vPMbScPjvfbO/PwooU3LWLlxLVNLG0MmkSaGyJeUXg/wHsGI910=
=JN0A
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2017-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2017-06-27 (Innova IPsec offload support)
This patchset adds support for Innova IPSec network interface card.
About Innova device:
--------------------
Innova is a network card with a ConnectX chip and an FPGA chip as a
bump-on-the-wire.
Internal
+----------+ Link +-----------------+
| +--------------+ FPGA | +------+
| ConnectX | | Shell +--+ QSFP |
| +--------------+ +-------+ | | Port |
+----------+ I2C | | SBU | | +------+
| +-------+ |
+--+----------+---+
| |
+--+--+ +---+---+
| DDR | | Flash |
+-----+ +-------+
The FPGA synthesized logic is loaded from dedicated flash storage and has
access to its own dedicated DDR RAM.
The ConnectX chip firmware programs the FPGA by accessing its configuration
space over either the slow internal I2C link or the high-speed internal link.
The FPGA logic is divided into a "Shell" and a "Sandbox Unit" (SBU).
mlx5_core driver (with CONFIG_MLX5_FPGA) handles all shell functionality,
while other components may handle the various SBU functionalities.
The driver opens high-speed reliable communication channels with the shell and
the SBU over the internal link.
These channels may be used for high-bandwidth configuration or for SBU-specific
out-of-band data paths.
About Innova IPSec device:
--------------------------
Innova IPSec is a network card that allows offloading IPSec cryptography operations
from the host CPU to the NIC. It is an Innova card with an IPSec SBU.
The hardware keeps the database of IPSec Security Associations (SADB) in the FPGA's
DDR memory.
Internal
+----------+ Link +-----------------+
| +--------------+ FPGA | +------+
| ConnectX | | Shell +--+ QSFP |
| +--------------+ +-------+ | | Port |
+----------+ Internal I2C | | IPSec | | +------+
| | SBU | |
| +-------+ |
+--+----------+---+
| |
+--+--+ +---+---+
| DDR | | |
| | | Flash |
|SADB | | |
+-----+ +-------+
Modes and ciphers:
Currently the following modes and ciphers are supported:
IPv4 and IPv6
ESP tunnel and transport modes
AES 128 and 256 bit encryption, with GCM authentication (RFC4106)
IV is generated using seqiv, in sync with Linux's geniv.
More modes and ciphers may be added later.
Notes:
In the future similar functionality will be included in a single-chip NIC.
About the driver:
-----------------
Patches 1-4 prepare some existing driver code for the new feature:
* Add support for reserved GIDs in the hardware GID table
* Allow multiple modules to enable hardware RoCE support independently
Patches 5-6 define structs and helper functions for QP work-queues.
Patches 7-11 add various FPGA-related features required for Innova.
IPSec.
Patch 12 adds abstraction layer for Mellanox IPSec-offload capable devices.
atches 13-16 add IPSec offload support to the mlx5 netdevice.
This driver services the new IPSec offload API introduced in commit
d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Configuration Path:
If Innova IPSec device is detected, the mlx5e netdevice gets the new
NETIF_F_HW_ESP feature and the xdo callbacks, indicating ESP offload
capabilities, and also the matching TX checksum and GSO features.
The driver configures offloaded Security Associations (SAs) by sending
an ADD_SA or DEL_SA message to the IPSec SBU, which updates the SADB in DDR.
These messages and their responses are sent over a high-speed channel.
Counters for ethtool are retrieved by the driver from the SBU.
Data path:
On receive path, the SBU decrypts ESP packets which match the offloaded SADB,
but keeps them encapsulated.
The SBU injects metadata (Mellanox owned ethertype) indicating that crypto-offload
has taken place, the SA with which it was done, and the authentication result.
The ConnectX chip performs RX checksum offload on the packet, and RSS using the
ESP SPI value. The driver detects the special ethertype, and attaches a struct
secpath to the RX SKB, including flags to indicate that crypto offload took place,
the authentication result, and which xfrm_state was used for decryption, in the
olen and ovec members. The RX SKB may have useful CHECKSUM_COMPLETE. A separate
patchset will add support for that in the xfrm stack.
On transmit path, the stack encapsulates the packet but does not encrypt it, and
indicates in the SKB's secpath that crypto offload is to be performed and the SA
to use to do so.
The driver avoids performing crypto-offload for ESP fragments, and packets with
IP options, as the SBU cannot currently do that. For eligible packets, the driver
prepends a special ethertype with metadata instructing the hardware to perform crypto offload.
The stack builds regular (non-GSO) SKBs so that they contain a placeholder for the ESP trailer.
The driver trims it off, because the SBU automatically appends the trailer for offloaded packets.
The ConnectX chip performs TX checksum offload on inner UDP or TCP packets,
and GSO for TCP packets (duplicating the prepended metadata).
The segmented packets then undergo encryption in the SBU before going on the wire.
Performance:
We measure single stream of TCP on Intel(R) Xeon(R) CPU E5-2643 v2 @3.50GHz
Using AES-NI with ESP GSO we get constant 4.1 Gbps.
Using crypto offload we get constant 18 Gbps.
Note that these numbers require CHECKSUM_COMPLETE support in XFRM, which we submit separately.
- Ilan Tayari
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial fix to spelling mistake in mlx4_dbg debug message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the TX data path, prepend a special metadata ethertype which
instructs the hardware to perform cryptography.
In addition, fill Software-Parser segment in TX descriptor so
that the hardware may parse the ESP protocol, and perform TX
checksum offload on the inner payload.
Support GSO, by providing the inverse of gso_size in the metadata.
This allows the FPGA to update the ESP header (seqno and seqiv) on the
resulting packets, by calculating the packet number within the GSO
back from the TCP sequence number.
Note that for GSO SKBs, the stack does not include an ESP trailer,
unlike the non-GSO case.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In RX data path, the hardware prepends a special metadata ethertype
which indicates that the packet underwent decryption, and the result of
the authentication check.
Communicate this to the stack in skb->sp.
Make wqe_size large enough to account for the injected metadata.
Support only Linked-list RQ type.
IPSec offload RX packets may have useful CHECKSUM_COMPLETE information,
which the stack may not be able to use yet.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add Innova IPSec ESP crypto offload configuration paths.
Detect Innova IPSec device and set the NETIF_F_HW_ESP flag.
Configure Security Associations using the API introduced in a previous
patch.
Add Software-parser hardware descriptor layout
Software-Parser (swp) is a hardware feature in ConnectX which allows the
host software to specify protocol header offsets in the TX path, thus
overriding the hardware parser.
This is useful for protocols that the ASIC may not be able to parse on
its own.
Note that due to inline metadata, XDP is not supported in Innova IPSec.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add routines for manipulating the hardware IPSec SA database (SADB).
In Innova IPSec, a Security Association (SA) is added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.
Add implementation for Innova IPSec (FPGA-based) hardware.
These routines will be used by the IPSec offload support in a later patch
However they may also be used by others such as RDMA and RoCE IPSec.
mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
to work directly with mlx5_core rather than Innova FPGA or other mlx5
acceleration providers.
In this patchset we add Innova IPSec support and mlx5/accel delegates
IPSec offloads to Innova routines.
In the future, when IPSec/TLS or any other acceleration gets integrated
into ConnectX chip, mlx5/accel layer will provide the integrated
acceleration, rather than the Innova one.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add interface to initialize and interact with Innova FPGA SBU
connections.
A client driver may use these functions to set up a high-speed DMA
connection with its SBU hardware logic, and send/receive messages
over this connection.
A later patch in this patchset will make use of these functions for
Innova IPSec offload in mlx5 Ethernet driver.
Add commands to retrieve Innova FPGA SBU capabilities, and to
read/write Innova FPGA configuration space registers and memory,
over internal I2C.
At high level, the FPGA configuration space is divided such:
0x00000000 - 0x007fffff is reserved for the SBU
0x00800000 - 0xffffffff is reserved for the Shell
0x400000000 - ... is DDR memory
A later patchset will add support for accessing FPGA CrSpace and memory
over a high-speed connection. This is the reason for the ACCESS_TYPE
enumeration, which currently only supports I2C.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The Innova FPGA includes shell hardware and Sandbox-Unit (SBU) hardware.
The shell hardware is handled by mlx5_core itself, while the SBU is
handled by a client driver.
Reset the SBU to a well-known initial state when initializing a new
device, and set the FPGA to bypass mode when uninitializing a device.
This allows the client driver to assume that its device has been
reset when a new device is detected.
During SBU reset, the FPGA is put into SBU-bypass mode. In this mode
packets do not pass through the SBU, so it cannot affect the network
data stream at all.
A factory-image does not have an SBU, so skip these flows.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
An FPGA high-speed connection has two endpoints, an FPGA QP and a
ConnectX QP.
Add library routines to create and connect the endpoints of an
FPGA high-speed connection.
These routines allow creating and interacting with both types of
connections: Shell and Sandbox Unit (SBU).
Shell connection provides an interface to the FPGA's address space,
which includes the configuration space and the DDR.
Use of the shell connection will be introduced in a later patchset.
SBU connection provides a command and/or data interface to the
application-specific logic within the FPGA.
Use of the SBU connection will be introduced in a later patch in
this patchset.
Some struct definitions are added to a new header file sdk.h, which
will be extended in later patches in the patchset.
This header file will contain the in-kernel FPGA client driver API.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>