Under multipath it's possible for us to offload the flow only through
the e-switch for which proper route through the uplink exists.
When the port is up and the next-hop route is set again we want to
offload through it as well.
We generate SW event from the FIB event handler when multipath port
affinity changes. The tc offloads code gets this event, goes over the
flows which were marked as of having missing route and attempts to
offload them.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Under multipath offload scheme, as part of handling fib events, emit
mlx5 port affinity event on the enabled ports which will be handled by
the tc offloads code.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In a similar manner to uplink/VF LAG, under multipath we add encap peer
rule on the second port as well.
However, unlike the LAG case, we do want to allow failure for adding
one of the rules. This happens due to using a routing hint while doing
the route lookup when one path (next hop device) is down.
Introduce a new flag to indicate that route lookup failed for encap
flow. Note that a flow may still not be offloaded to hw due to missing
neighbour, in that case, the neigh update event will take care of it.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently the peer flow inherits the flags from the original flow
after we've set it. At this time the flags are set according to
the flow state, e.g marked as going to slow path and such.
Even if not getting us to real bugs now, this opens the door to
get us to troubles later. Future proof the code and avoid the
inheritance, use the peer flags as were set on input when we
started adding the original flow.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To support multipath offload we are going to track SW multipath route
and related nexthops. To do that we register to FIB notifier and handle
the route and next-hops events and reflect that as port affinity to HW.
When there is a new multipath route entry that all next-hops are the
ports of an HCA we will activate LAG in HW.
Egress wise, we use HW LAG as the means to emulate multipath on current
HW which doesn't support port selection based on xmit hash. In the
presence of multiple VFs which use multiple SQs (send queues) this
yields fairly good distribution.
HA wise, HW LAG buys us the ability for a given RQ (receive queue) to
receive traffic from both ports and for SQs to migrate xmitting over
the active port if their base port fails.
When the route entry is being updated to single path we will update
the HW port affinity to use that port only.
If a next-hop becomes dead we update the HW port affinity to the living
port.
When all next-hops are alive again we reset the affinity to default.
Due to FW/HW limitations, when a route is deleted we are not disabling
the HW LAG since doing so will not allow us to enable it again while
VFs are bounded. Typically this is just a temporary state when a
routing daemon removes dead routes and later adds them back as needed.
This patch only handles events for AF_INET.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In order to offload ecmp-on-host scheme where next-hop routes are used,
we will make use of HW LAG. Add accessor function to let upper layers
in the driver to realize if the lag acts in multi-path mode.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Instead of using the system workqueue, allocate our own workqueue.
This workqueue will be used to handle more work in the next patch.
This patch doesn't change functionality.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The change is a refactoring step towards a multipath use case.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This fix checkpatch check
CHECK: Avoid using bool structure members because of possible alignment
issues
Signed-off-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
EAGAIN is treated as a specific case when we consider the attachment
successful but wait for neigh event before offloading the flow.
This can result in unwanted behavior when sub calls on the offloading
path will return EAGAIN and we pass this error up.
Instead of attaching to a specific error code return a boolean value
from the attach encap operation saying if the encap is valid or not.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Remove the tunnel info argument which we can get from the other args.
Also reorder the args to have input args first and output args later.
This patch doesn't change functionality.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Function mlx5e_tx_reporter_recover_from_ctx is only used within mlx5e tx
reporter, move it to be statically declared in en/reporter_tx.c.
Fixes: de8650a820 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Current fib_multipath_hash_policy can make hash based on the L3 or
L4. But it only work on the outer IP. So a specific tunnel always
has the same hash value. But a specific tunnel may contain so many
inner connections.
This patch provide a generic multipath_hash in floi_common. It can
make a user-define hash which can mix with L3 or L4 hash.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have converted all possible callers to using a switchdev
notifier for attributes we do not have a need for implementing
switchdev_ops anymore, and this can be removed from all drivers the
net_device structure.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patches will change the way we communicate setting a port's
attribute and use a notifier to perform those tasks.
Prepare mlxsw to support receiving notifier events targeting
SWITCHDEV_PORT_ATTR_SET and utilize the switchdev_handle_port_attr_set()
to handle stacking of devices.
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/net/ethernet/mellanox/mlxsw/spectrum.c: In function 'mlxsw_sp_port_get_link_ksettings':
drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3062:5: warning:
variable 'autoneg_status' set but not used [-Wunused-but-set-variable]
It's not used since commit 475b33cb66 ("mlxsw: spectrum: Remove unsupported
eth_proto_lp_advertise field in PTYS")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Being able to build devlink as a module causes growing pains.
First all drivers had to add a meta dependency to make sure
they are not built in when devlink is built as a module. Now
we are struggling to invoke ethtool compat code reliably.
Make devlink code built-in, users can still not build it at
all but the dynamically loadable module option is removed.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
size = sizeof(struct foo) + count * sizeof(struct boo);
instance = kzalloc(size, GFP_KERNEL)
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL)
Notice that, in this case, variable alloc_size is not necessary, hence
it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hit the new tracepoint once the vregion migration ends.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Other mutexes are taking care of proper locking for this, no longer
needed to take RTNL mutex here.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer require RTNL lock in this code. Newly introduced mutexes take
care of guarding objagg and bloom filter. There is no need to guard
gen_pool_alloc()/gen_pool_free() as they are fine to be called lockless.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Relax dependency on rtnl mutex during vregion_rehash_intrvl_set(). The
vregion list is protected with newly introduced mutex.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protect objagg structures by adding a mutex to ERP code and take it
during the structure manipulation.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For MR ACL profile is does not make sense to do periodical rehashes, as
there is only one mask in use during the whole vregion lifetime.
Therefore periodical work is scheduled but the rehash never happens.
So allow to enable/disable rehash for the whole group, which is added
per-profile. Disable rehashing for MR profile.
Addition to the vregion list is done only in case the rehash is enable
on the particular vregion. Also, the addition is moved after delayed
work init to avoid schedule of uninitialized work
from vregion_rehash_intrvl_set(). Symmetrically, deletion from
the list is done before canceling the delayed work so it is
not scheduled by vregion_rehash_intrvl_set() again.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bloom filter is shared within multiple regions. For updates, it needs to
be guarded by a separate mutex. Do that in order to not rely on RTNL
mutex.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to remove dependency on RTNL, introduce a mutex
to guard vregion structure, list of chunks and list of entries in
chunks.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor existing _vchunk_assoc/_vchunk_deassoc() functions into
_vregion_get()/_vregion_put() to make the code simpler and prepared for
vregion locking.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to remove RTNL lock dependency, it is needed to protect
the regions list in a group. Introduce a mutex to do the job.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the existing group structure to contain fields needed for HW region
list manipulations. Move the rest of the fields into new vgroup struct.
This makes layering cleaner as the vgroup struct is on higher level than
low-level group struct. Also, this makes it possible to introduce
fine-grained locking.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Never used, remove it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series adds some misc updates to mlx5 driver,
1) Eli Britstein, Introduces tunnel entropy control from PCMR register
and fixes GRE key by controlling port tunnel entropy calculation.
2) Eran Ben Elisha, provides some mlx5 fixes to the latest tx devlink health
reporting mechanism.
3) Huy Nguyen, Added the support for ndo bridge_setlink to allow
VEPA/VEB E-Switch legacy mode configurations.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJccGvRAAoJEEg/ir3gV/o+6kIH/1Y23h0OZrKkkfELujtzoOjx
64BIRhW8Tr+WOXZNKlSvdRkPj0g1yG3B0NO7wChXZDl8dgqBp/Jx5Wxcvb5DKkz/
wK07oKB6KqgcMILzUe5G4iOHw/atMldzSA6sNPh8SDKVKKkXjMKz8ocQZZQRsAaX
1im2D77pX8zE/OhG86Fn23Fed1G6Lg++faPC8GEOxv9LkwaAARa5NfJ6g7AnXOGg
ijmM4R2Hhu0V0YaKXS2qJFb6HQaNO9JKYR1C7wF7z9yqnsn14cdxrAWkE4eA2JnP
zKnDt+ThAKAoX9mn5QHzD8uye56Jeksvo8br4dCH4iDZZSSZT+lo456Dfe+tXpc=
=ef79
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2019-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-02-21
This series adds some misc updates to mlx5 driver,
1) Eli Britstein, Introduces tunnel entropy control from PCMR register
and fixes GRE key by controlling port tunnel entropy calculation.
2) Eran Ben Elisha, provides some mlx5 fixes to the latest tx devlink health
reporting mechanism.
3) Huy Nguyen, Added the support for ndo bridge_setlink to allow
VEPA/VEB E-Switch legacy mode configurations.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add Spectrum-2 ASIC support for the following new port types and speeds:
* 50Gbps 1-lane
* 100Gbps 2-lanes
* 200Gbps 4-lanes
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add Spectrum-2 ASIC port type-speed operations.
Since multiple ethtool link modes are represented using a single bit in the
ASIC, the driver forces the user to configure all types per a specific
speed. For example, if the user wants to advertise 100Gbps 4-lanes speed,
he should advertise all the types of 100Gbps 4-lanes speed that are
supported by the ASIC as shown below:
Supported ethtool bits for 100Gbps 4-lanes:
0x1000000000 100000baseKR4 Full
0x2000000000 100000baseSR4 Full
0x4000000000 100000baseCR4 Full
0x8000000000 100000baseLR4_ER4 Full
Command for advertising 100Gbps 4-lanes:
ethtool -s enp3s0np1 advertise 0xF000000000
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PTYS register introduces a new layout for port type-speed fields. These
fields extend the existing ones in order to handle more types and speeds.
For example, the new 200Gbps speed.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename p_eth_proto_adm to p_eth_proto_admin in mlxsw_reg_ptys_eth_unpack
function.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add port type-speed operations in order to have different operations for
different ASICs. For now, both ASICs use the same pointer.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename port speed-type functions to be Spectrum-1 ASIC specific.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of deriving the port connector type from port admin state, query it
from firmware.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove eth_proto_lp_advertise field in PTYS register since it is not
supported by the firmware.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove duplicate port link mode entry from mlxsw_sp_port_link_mode.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Give precision identifiers to the two snprintf() formatting the priority
and TC strings to avoid producing these two warnings:
drivers/net/ethernet/mellanox/mlxsw/spectrum.c: In function
'mlxsw_sp_port_get_prio_strings':
drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2132:37: warning: '%d'
directive output may be truncated writing between 1 and 3 bytes into a
region of size between 0 and 31 [-Wformat-truncation=]
snprintf(*p, ETH_GSTRING_LEN, "%s_%d",
^~
drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2132:3: note: 'snprintf'
output between 3 and 36 bytes into a destination of size 32
snprintf(*p, ETH_GSTRING_LEN, "%s_%d",
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mlxsw_sp_port_hw_prio_stats[i].str, prio);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/mellanox/mlxsw/spectrum.c: In function
'mlxsw_sp_port_get_tc_strings':
drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2143:37: warning: '%d'
directive output may be truncated writing between 1 and 11 bytes into a
region of size between 0 and 31 [-Wformat-truncation=]
snprintf(*p, ETH_GSTRING_LEN, "%s_%d",
^~
drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2143:3: note: 'snprintf'
output between 3 and 44 bytes into a destination of size 32
snprintf(*p, ETH_GSTRING_LEN, "%s_%d",
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mlxsw_sp_port_hw_tc_stats[i].str, tc);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow enabling VEPA mode on the HCA's port in legacy devlink mode.
Example:
bridge link set dev ens1f0 hwmode vepa
will turn on VEPA mode on the netdev ens1f0.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In Virtual Ethernet Port Aggregator (VEPA) mode, the packet skips
the system internal virtual switch and forwards to external network
switch. In Mellanox HCA case, the virtual switch is the HCA's Eswitch.
To support this, an new FDB flow table are created with level 0 and
linked to the existing FDB flow table in legacy mode. By default,
VEPA is turned off and this FDB flow table is empty. When VEPA is
turned on, two rules are created. One rule to forward on uplink vport
traffic to the legacy FDB. The other rule forward all other traffic
to uplink vport.
Other design alternatives were not chosen as explained below:
1. Create a forward rule in ACL flow table (most efficient design).
This approach is the not chosen because firmware does not support
forward rule to uplink vport (0xffff) for ACL flow table.
2. Add additional source port criteria in all the FDB rules to make the
FDB rules to be received rules only. This approach is not chosen because
it is not efficient as there can many rules in the FDB and VEPA mode
cannot be controlled per vport.
3. Add a highest prioirty flow group in the existing legacy FDB Flow
Table instead of a new flow table. This approoach does not work because the
new flow group has the same match criteria as the promiscuous flow group
and mlx5_add_flow_rules does not allow specifying flow group.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
If reporter is ERR_PTR or NULL, error code shall be returned. At all other
cases it shall return success. Fix that.
Fixes: de8650a820 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In case of lost interrupt recover, we shall return success. Fix that.
Fixes: 7d91126b1a ("net/mlx5e: Add tx timeout support for mlx5e tx reporter")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reported-by: Maria Pasechnik <mariap@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When TX reporter was introduced, it took ownership over TX timeout error
handling. this introduced a regression in case TX reporter is not valid
(NET_DEVLINK is not set, or devlink_health_reporter_create failure).
Fix mlx5e_tx_reporter_timeout function so it can be called at all times.
In addition, remove a warning print that indicates that a TX timeout won't
be handled in case of no valid TX reporter.
Fixes: 7d91126b1a ("net/mlx5e: Add tx timeout support for mlx5e tx reporter")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Print warning message in case of TX reporter creation failure, only if the
return value is ERR_PTR type. NULL pointer return indicates that
NET_DEVLINK is not set, and the warning print can be skipped.
Fixes: de8650a820 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Flow entropy is calculated on the inner packet headers and used for
flow distribution in processing, routing etc. For GRE-type
encapsulations the entropy value is placed in the eight LSB of the key
field in the GRE header as defined in NVGRE RFC 7637. For UDP based
encapsulations the entropy value is placed in the source port of the
UDP header.
The hardware may support entropy calculation specifically for GRE and
for all tunneling protocols. With commit df2ef3bff1 ("net/mlx5e: Add
GRE protocol offloading") GRE is offloaded, but the hardware is
configured by default to calculate flow entropy so packets transmitted
on the wire have a wrong key. To support UDP based tunnels (i.e VXLAN),
GRE (i.e. no flow entropy) and NVGRE (i.e. with flow entropy) the
hardware behaviour must be controlled by the driver.
Ensure port entropy calculation is enabled for offloaded VXLAN tunnels
and disable port entropy calculation in the presence of offloaded GRE
tunnels by monitoring the presence of entropy enabling tunnels (i.e
VXLAN) and entropy disabing tunnels (i.e GRE).
Fixes: df2ef3bff1 ("net/mlx5e: Add GRE protocol offloading")
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently changing a PCMR field is done by setting the field in a
zeroed buffer, zeroing other unrelated fields.
Fix this behaviour by modifying only the required field after first
reading the current register values, as a pre-step towards using more
fields in PCMR register.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
After AF_PACKET is fixed to calculate the transport header offset
correctly, trust the value set by the kernel. If the offset wasn't set,
it means there is no transport header in the packet.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_transport_offset() == 0 is not a special value. The only special
value is when skb->transport_header is ~0U, and it's checked by
skb_transport_header_was_set().
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cap_max_headroom_size holds maximum headroom size supported.
Overstepping that limit might under certain conditions lead to ASIC
freeze.
Query and store the value, and add mlxsw_sp_sb_max_headroom_cells() for
obtaining the stored value. In __mlxsw_sp_port_headroom_set(), reject
requests where the total port buffer is larger than the advertised
maximum.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recommendation for headroom size for 100Gbps port and 100m cable is
101.6KB, reduced accordingly for split ports. The closest higher number
evenly divisible by cell size for both Spectrum-1 and Spectrum-2, and
such that the number of cells can be further divided by maximum split
factor of 4, is 102528 bytes, or 25632 bytes per lane.
Update mlxsw_sp_port_pb_init() to compute the headroom taking into
account this recommended per-lane value and number of lanes actually
dedicated to a given port.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Customize the tables related to shared buffer configuration to match the
current recommendation for Spectrum-2 systems.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SBMM register configures the shared buffer quota for MC packets
according to Switch-Priority. The default configuration depends on the
chip type. Therefore keep the table and length in struct
mlxsw_sp_sb_vals. Redirect the references from the global definitions to
the fields.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SBCM register configures shared buffer quota according to
port-priority resp. port-TC. The default configuration depends on the
chip type. Therefore keep the tables and their lengths in struct
mlxsw_sp_sb_vals. Redirect the references from the global definitions to
the fields.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SBPR register configures shared buffer pools. The default
configuration depends on the chip type. Therefore keep it in struct
mlxsw_sp_sb_vals. Redirect the one reference from the global array to
the field.
Because the pool descriptor ID is implicit in the ordering of array
members, both this array and the pool descriptor array have the same
length. Therefore reuse mlxsw_sp_sb.pool_dess_len for the purpose of
determining the length of SBPR array.
Drop the now useless MLXSW_SP_SB_PRS_LEN.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SBPM register can be used to configure quotas for packets ingressing
from a certain pool to a certain port, and egressing from a certain pool
to a certain port. The default configuration depends on the chip type.
Therefore keep it in struct mlxsw_sp_sb_vals. Redirect the one reference
from the global array to the field.
Because the pool descriptor ID is implicit in the ordering of array
members, both this array and the pool descriptor array have the same
length. Therefore reuse mlxsw_sp_sb.pool_dess_len for the purpose of
determining the length of SBPM array.
Drop the now useless MLXSW_SP_SB_PMS_LEN.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Keep the table of pool descriptors and its length in struct
mlxsw_sp_sb_vals so that it can be specialized per chip type. Redirect
all users from the global definitions to the mlxsw_sp_sb fields.
Give mlxsw_sp_pool_count() an extra mlxsw_sp parameter so that it can
access the descriptor table.
Drop the now unnecessary MLXSW_SP_SB_POOL_DESS_LEN.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-2 will be configured with a different set of pools than
Spectrum-1. The size of prs and pms buffers will therefore depend on the
chip type of the device.
Therefore, instead of reserving an array directly in a structure
definition, allocate the buffer in mlxsw_sp_sb_port{,s}_init().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-2 will be configured with a different shared buffer
configuration than Spectrum-1. Therefore introduce a structure for
keeping the chip-specific default and immutable configuration.
Configuration mutable in runtime will still be kept in struct
mlxsw_sp_sb.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the bridge no longer calling switchdev_port_attr_get() to obtain
the supported bridge port flags from a driver but instead trying to set
the bridge port flags directly and relying on driver to reject
unsupported configurations, we can effectively get rid of
switchdev_port_attr_get() entirely since this was the only place where
it was called.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have converted the bridge code and the drivers to check for
bridge port(s) flags at the time we try to set them, there is no need
for a get() -> set() sequence anymore and
SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT therefore becomes unused.
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for getting rid of switchdev_port_attr_get(), have mlxsw
check for the bridge flags being set through switchdev_port_attr_set()
when the SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS attribute identifier is
used.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
To resolve conflicts with net-next and pick up the first patch.
* branch 'mlx5-next':
net/mlx5: Factor out HCA capabilities functions
IB/mlx5: Add support for 50Gbps per lane link modes
net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register
net/mlx5: Add new fields to Port Type and Speed register
net/mlx5: Refactor queries to speed fields in Port Type and Speed register
net/mlx5: E-Switch, Avoid magic numbers when initializing offloads mode
net/mlx5: Relocate vport macros to the vport header file
net/mlx5: E-Switch, Normalize the name of uplink vport number
net/mlx5: Provide an alternative VF upper bound for ECPF
net/mlx5: Add host params change event
net/mlx5: Add query host params command
net/mlx5: Update enable HCA dependency
net/mlx5: Introduce Mellanox SmartNIC and modify page management logic
IB/mlx5: Use unified register/load function for uplink and VF vports
net/mlx5: Use consistent vport num argument type
net/mlx5: Use void pointer as the type in address_of macro
net/mlx5: Align ODP capability function with netdev coding style
mlx5: use RCU lock in mlx5_eq_cq_get()
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This series includes misc updates to mlx5 drivers and one ethtool update.
1) From Aya Levin:
- ethtool: Define 50Gbps per lane link modes
- add support for 50Gbps per lane link modes in mlx5 driver
2) From Tariq Toukan,
- Add a helper function to unify mlx5 resource reloading
3) From Vlad Buslov,
- Remove wrong and superfluous tc pedit header type check
4) From Tonghao Zhang,
- Some refactoring in en_tc.c to simplify the mlx5e_tc_add_fdb_flow
5) From Leon Romanovsky & Saeed,
- Compilation warning fixes
6) From Bodong wang,
- E-Switch fixes that are related to the SmarNIC series
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcbH/oAAoJEEg/ir3gV/o+bJQIAKIbILkQsAIn3B3U1Y7WgE8l
4pXdcYe6A3Mr9ZOi+U2ovJqVEHKPc7ASNfBfK5zsRayxhDQmoNhaloZSlwkgxenN
+guf86OBqTmXq9vzk3AFfPbiceQ+ENkM8ZHsv0yc9q8ZgZX5KR3ghG0j7wONDdgv
UOGuHNbO2QPA2d38V+xdwGkr/M8eep2xjYwDqT8+s7F+7BZeRThF2s9NBhE4rgob
KlPXwYTqkx5fiODtlFE9igryi/aSr9iaVIVCOs9ZBwQrETc/WLttWbfMgrMWFW5z
r4sa/84oVe0GVwZ+KPJKjUpXR4x1C3O+vSIAREsZfBDfjADHvGgXyRCYw+xrnq4=
=9kOH
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2019-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-02-19
This series includes misc updates to mlx5 drivers and one ethtool update.
1) From Aya Levin:
- ethtool: Define 50Gbps per lane link modes
- add support for 50Gbps per lane link modes in mlx5 driver
2) From Tariq Toukan,
- Add a helper function to unify mlx5 resource reloading
3) From Vlad Buslov,
- Remove wrong and superfluous tc pedit header type check
4) From Tonghao Zhang,
- Some refactoring in en_tc.c to simplify the mlx5e_tc_add_fdb_flow
5) From Leon Romanovsky & Saeed,
- Compilation warning fixes
6) From Bodong wang,
- E-Switch fixes that are related to the SmarNIC series
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When disabling vport, relevant vport configurations will be cleaned
up. These cleanups should be done to the vports which had these configs
applied at vport enablement. As esw manager vport didn't have such
vport config applied, cleanup should not touch it.
Fixes: de9e6a8136c5 ("net/mlx5: E-Switch, Properly refer to host PF vport as other vport")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reported-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When eswitch gets vport data structure, the index should not be out
of the range of the vport array. Driver mistakenly used vport number
to check the range.
Fixes: 22b8ddc86bf4 ("net/mlx5: E-Switch, Assign a different position for uplink rep and vport")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
fpga_qpn was assigned but never used and compilation with W=1
produced the following warning:
drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c: In function _mlx5_fpga_event_:
drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c:320:6: warning:
variable _fpga_qpn_ set but not used [-Wunused-but-set-variable]
u32 fpga_qpn;
^~~~~~~~
Fixes: 98db16bab5 ("net/mlx5: FPGA, Handle QP error event")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Compilation with W=1 produces following warning:
drivers/net/ethernet/mellanox/mlx5/core/en/monitor_stats.c:69:6:
warning: no previous prototype for _mlx5e_monitor_counter_start_ [-Wmissing-prototypes]
void mlx5e_monitor_counter_start(struct mlx5e_priv *priv)
^~~~~~~~~~~~~~~~~~~~~~~~~~~
Avoid it by declaring mlx5e_monitor_counter_start() as a static function.
Fixes: 5c7e8bbb02 ("net/mlx5e: Use monitor counters for update stats")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This patch is a little improvement. Simplify the mlx5e_tc_add_fdb_flow().
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
With recent introduction of flow_rule infrastructure drivers no longer
directly include action headers, so it is no longer possible to use
constants defined in them. Instead, one of flow_rule patches substituted
pedit action header constant with hardcoded value '2' in mlx5
set_pedit_val() function conditional which verifies that header type is in
range of values allowed by pedit action. That conditional is now both
wrong (hardcoded value is '2' but __PEDIT_HDR_TYPE_MAX is 6 in current
version) and superfluous (pedit action already verifies that header type is
in allowed range during init). Remove the described check from mlx5 code.
Fixes: 7386788175 ("drivers: net: use flow action infrastructure")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Dmytro Linkin <dmitrolin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Take into a function the common code structure of opening
a side set of channels followed by a call to apply them.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In previous patch, driver added new speed modes: 50Gbps per lane support
for 50G/100G/200G. This patch modifies mlx5e_get_link_ksettings and
mlx5e_set_link_ksettings to set and get these link modes via ethtool.
In order to do so, added mapping of new HW bits to ethtool bitmap and
enforce mutual exclusion between extended link modes and previously
defined link modes.
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Combine all HCA capabilities setters under one function
and compile out the ODP related function in case kernel
was compiled without ODP support.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The IP2ME packet trap is triggered by packets hitting local routes.
After evaluating current defaults used by the driver it was decided to
reduce the amount of traffic generated by this trap to 1Kpps and
increase the burst size. This is inline with similarly deployed systems.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a spelling mistake in a en_err error message. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a dedicated thermal zone for each QSFP/SFP module. The current
temperature is obtained from the module's temperature sensor and the
trip points are set based on the warning and critical thresholds
read from the module.
A cooling device (fan) is bound to all the thermal zones. The
thermal zone governor is set to user space in order to avoid
collisions between thermal zones.
For example, one thermal zone might want to increase the speed of
the fan, whereas another one would like to decrease it.
Deferring this decision to user space allows the user to the take
the most suitable decision.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function-local variable "delay" enters the loop interpreted as delay
in bits. However, inside the loop it gets overwritten by the result of
mlxsw_sp_pg_buf_delay_get(), and thus leaves the loop as quantity in
cells. Thus on second and further loop iterations, the headroom for a
given priority is configured with a wrong size.
Fix by introducing a loop-local variable, delay_cells. Rename thres to
thres_cells for consistency.
Fixes: f417f04da5 ("mlxsw: spectrum: Refactor port buffer configuration")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bodong Wang says,
BlueField device is a multi-core ARM processor in a highly integrated
system on chip coupled with the ConnectX interconnect controller.
BlueField device can be presented in one out of two modes:
- SEPARATED_HOST: ARM processors as a separated and orthogonal host
like any other external host in the multi-host virtualization model.
- EMBEDDED_CPU: ARM processors as Embedded CPU (EC) and part of the
external hosts virtualization model.
While existing driver already supports the device on separated_host
mode, this patch series focus on the functionalities of embedded_cpu
mode.
On embedded_cpu mode, BlueField device exposes regular network
controller PCI function in the BlueField host(e.g, x86). However, a
separate PCI function called Embedded CPU Physical Function(ECPF) is
also added to the ARM host side, where standard Linux distributions is
able to run on the ARM cores. Depends on the NV configuration from
firmware, ECPF can be the e-switch manager and firmware pages supplier.
If ECPF is configured as e-switch manager and page supplier, it will
take over the responsibilities from the PF on BlueField host includes:
- Owns, controls and manages all e-switch parts, and takes e-switch
traffic by default. It also should perform ENABLE_HCA for the host
PF just like a PF does for its VFs.
- Provides and manages the ICM host memory required for the HCA to
store various contexts for itself, the PF and VFs belong the
e-switch it manages.
The PF on BlueField host side is still responsible for:
- Control its own permanent MAC.
- PCI and SRIOV configurations and perform ENABLE_HCA for its VFs.
The ECPF can also retrieve information about the external host it
controls, like host identifier, PCI BDF and number of virtual functions.
As these parameters may be changed dynamically, an event will be triggered
to the driver on ECPF side.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcZ2amAAoJEEg/ir3gV/o+lzMIALvLNUoD6pXi41MWsOwvmAHg
07mzg1N80Z66MCcFau40I8T3h9NLiRMzNtFrBNtxx+ruwKFUpAjJHjaU0sms0yQH
WVjr35vk4XsZyDPSCJ4g/hCQVlgCT/1tIUvPO0YM9hjhDuVa9mT4wEpucQRDu8bO
KfeXNXLDFnlWxjokhpSVj369ozh+LTv4Kzy0MBBbji97bG6MktGAT8uCimUy7wG0
7dlYimnZ1+iUD1/DZQadiLCUHUu/rTcnvF2+DcdG/nbSU8ydVLgj6vgtIfCYt4e5
kcQO5hmatnl0iJUA8GNpdHQwGjYytKneoGmIfbMAK4KjvFNF+3/8N4Bytoa5kzk=
=DgTl
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2019-02-15' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Support Mellanox BlueField SmartNIC (mlx5-updates-2019-02-15)
Bodong Wang says,
BlueField device is a multi-core ARM processor in a highly integrated
system on chip coupled with the ConnectX interconnect controller.
BlueField device can be presented in one out of two modes:
- SEPARATED_HOST: ARM processors as a separated and orthogonal host
like any other external host in the multi-host virtualization model.
- EMBEDDED_CPU: ARM processors as Embedded CPU (EC) and part of the
external hosts virtualization model.
While existing driver already supports the device on separated_host
mode, this patch series focus on the functionalities of embedded_cpu
mode.
On embedded_cpu mode, BlueField device exposes regular network
controller PCI function in the BlueField host(e.g, x86). However, a
separate PCI function called Embedded CPU Physical Function(ECPF) is
also added to the ARM host side, where standard Linux distributions is
able to run on the ARM cores. Depends on the NV configuration from
firmware, ECPF can be the e-switch manager and firmware pages supplier.
If ECPF is configured as e-switch manager and page supplier, it will
take over the responsibilities from the PF on BlueField host includes:
- Owns, controls and manages all e-switch parts, and takes e-switch
traffic by default. It also should perform ENABLE_HCA for the host
PF just like a PF does for its VFs.
- Provides and manages the ICM host memory required for the HCA to
store various contexts for itself, the PF and VFs belong the
e-switch it manages.
The PF on BlueField host side is still responsible for:
- Control its own permanent MAC.
- PCI and SRIOV configurations and perform ENABLE_HCA for its VFs.
The ECPF can also retrieve information about the external host it
controls, like host identifier, PCI BDF and number of virtual functions.
As these parameters may be changed dynamically, an event will be triggered
to the driver on ECPF side.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a spelling mistake in several dev_err messages, fix these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the e-switch driver requires going to legacy mode before
changing to the offloads mode. This makes sense for regular case as
the legacy mode is done by creating VFs.
However, it's problematic when ECPF is the eswitch manager. In such
case, ECPF will control the vports on peer host including the peer
PF and VFs. But ECPF doesn't need and shall not create VFs as the
VFs are created in the peer PF host.
Grant ECPF the ability to change from none to the offloads mode. Note
that currently the only way to go back to none mode is by unloading
the ECPF driver.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When host PF changes the number of VFs, the ECPF esw driver will get
a FW event. It should query the number of VFs enabled by host PF and
update the VF reps accordingly. Note that host PF can't change the
number of VFs dynamically, it has to reset the number of VFs to 0
before changing to a new positive number.
The host event is registered when driver is moving to switchdev mode,
and it's the last step to do in esw_offloads_init. It's unregistered
and the work queue is flushed when driver quits from switchdev mode.
In this way, the host event and devlink command are serialized.
When driver is enabling switchdev mode, pay attention to the following
two facts:
1. Host PF must not have VF initialized as the flow table in ECPF has
ENCAP enabled as default. Such flow table can't be created with
existing initialized VFs.
2. ECPF doesn't know how many VFs the host PF will enable, ECPF
offloads flow steering shall create the flow table/groups based on
the max number of VFs possibly supported by host PF.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
ECPF connects to the eswitch through vport 0xfffe. ECPF may or may
not be the eswitch manager depending on firmware configuration.
1. If ECPF is eswitch manager: ECPF will take over the eswitch manager
responsibility. A rep of the host PF shall be created at the ECPF
side for the eswitch manager to control.
2. If ECPF is not eswitch manager: host PF will be the eswitch manager,
ECPF acts similar as a VF to the host PF. Host PF will be aware
of the ECPF vport presence and control it's rep.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In offloads mode, the current implementation puts the uplink
representor at index zero of the vport reps array. It is not "natural"
to place it at index 0 since we want to put the representor for vport
0 at index 0 with the introduction of SmartNIC. A separate patch will
handle the case whether a rep is needed for vport 0 (PF vport).
So, we want to have a different placeholder for uplink vport and
representor. It was placed at the end of vport and rep array. Since
vport number can no longer act as an index into the vport or
representors arrays, use functions to map vport numbers to indices
when accessing the vports or representors arrays, and vice versa.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eswitch has two users: IB and ETH. They both register repersentors
when mlx5 interface is added, and unregister the repersentors when
mlx5 interface is removed. Ideally, each driver should only deal with
the entities which are unique to itself. However, current IB and ETH
drivers have to perform the following eswitch operations:
1. When registering, specify how many vports to register. This number
is the same for both drivers which is the total available vport
numbers.
2. When unregistering, specify the number of registered vports to do
unregister. Also, unload the repersentors which are already loaded.
It's unnecessary for eswitch driver to hands out the control of above
operations to individual driver users, as they're not unique to each
driver. Instead, such operations should be centralized to eswitch
driver. This consolidates eswitch control flow, and simplified IB and
ETH driver.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently the driver loads and unloads all reps in an unbreakable
group. However, with ECPF, the reps of special vports such as uplink
and host PF should always be loaded in switchdev mode where the reps
for VFs will be loaded on-demand and unloaded on no-demand. This is
a pre-step for that change.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently the eswitch vport reps have a valid indicator, which is
set on register and unset on unregister. However, a rep can be loaded
or not loaded when doing unregister, current driver checks if the
vport of that rep is enabled as a flag to imply the rep is loaded.
However, for ECPF, this is not valid as the host PF will enable the
vports for its VFs instead.
Add three states: {unregistered, registered, loaded}, with the
following state changes across different operations:
create: (none) -> unregistered
reg: unregistered -> registered
load: registered -> loaded
unload: loaded -> registered
unreg: registered -> unregistered
Note that the state shall only be updated inside eswitch driver rather
than individual drivers such as ETH or IB.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Suggested-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
With only PF and VF, it is sufficient to have the vport/rep array
index as the vport number. This is because PF and VF vports numbers
are consecutive serial numbers. In downstream patches with
introducing of ECPF and UPLINK vports, it's not consecutive any more.
Use getter to get specific vport/rep, and use iterator to traversal
a list of vport/rep. This hides the translation between array index
and vport number, and provides flexibility of using different
translation mechanism in the future.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Suggested-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When driver is entering offloads mode, there are two major tasks to
do: initialize flow steering and create representors. Flow steering
should make sure enough flow table/group spaces are reserved for all
reps. Representors will be created in a group, all or none.
With the introduction of ECPF, flow steering should still reserve the
same spaces. But, the representors are not always loaded/unloaded in a
single piece. Once ECPF is in offloads mode, it will get the number
of VF changing event from host PF. In such scenario, only the VF reps
should be loaded/unloaded, not the reps for special vports (such as
the uplink vport).
Thus, when entering offloads mode, driver should specify the total
number of reps, and the number of VF reps separately. When leaving
offloads mode, the cleanup should use the information self-contained
in eswitch such as number of VFs.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
E-switch offloads mode initialize/cleanup multiple steering related
entities (flow table/group). Refactor these operations to internal
helper functions for better block design.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Commands referring to vports use the following scheme:
1. When referring to my own vport, put 0 in vport and 0 in other_vport.
2. When referring to another vport, put the vport number of the
referred vport and put 1 in other_vport. It was assumed that driver
is accessing other vport when vport number is greater than 0.
With the above scheme, the case that ECPF eswitch manager is trying
to access host PF vport will fall over with scheme 1 as the vport
number is 0. This is apparently wrong as driver is trying to refer
other vport.
As such usage can only happen in the eswitch context, change relevant
functions to provide other vport input properly.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In SmartNIC mode, the eswitch manager is not necessarily the PF
(vport 0). Use a helper function to get the correct eswitch manager
vport number and cache on the eswitch instance for fast reference.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When bonding is added, driver assumes that it's RoCE LAG if no VF is
enabled. This is not enough for ECPF as the VF is enabled in host PF
side. LAG should only choose RoCE mode when both slave devices meet
conditions below:
1. E-Switch offloads mode is NONE.
2. No VF is enabled.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Merge mlx5-next shared branched into net-next,
From Bodong Wang:
1) Introduction of ECPF (Embedded CPU Physical Function), and low level
bits for mlx5 SmartNic capabilities support.
2) Vport enumeration refactoring that affect mlx5_ib and mlx5_core
From Aya Levin,
3) Add support for 50Gbps per lane link modes in the Port Type and Speed
register (PTYS)
4) Refactor low level query functions for PTYS register
5) Add support for 50Gbps per lane link modes to mlx5_ib
Note: due to a change in API in mlx5/core and a later patch from net-next,
a fixup was squashed with this merge commit that replaces FDB_UPLINK_VPORT
with MLX5_VPORT_UPLINK which exists only in upstream net-next.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The netfilter conflicts were rather simple overlapping
changes.
However, the cls_tcindex.c stuff was a bit more complex.
On the 'net' side, Cong is fixing several races and memory
leaks. Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.
What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU. I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.
Signed-off-by: David S. Miller <davem@davemloft.net>