Currently, the e-switch driver requires going to legacy mode before
changing to the offloads mode. This makes sense for regular case as
the legacy mode is done by creating VFs.
However, it's problematic when ECPF is the eswitch manager. In such
case, ECPF will control the vports on peer host including the peer
PF and VFs. But ECPF doesn't need and shall not create VFs as the
VFs are created in the peer PF host.
Grant ECPF the ability to change from none to the offloads mode. Note
that currently the only way to go back to none mode is by unloading
the ECPF driver.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When host PF changes the number of VFs, the ECPF esw driver will get
a FW event. It should query the number of VFs enabled by host PF and
update the VF reps accordingly. Note that host PF can't change the
number of VFs dynamically, it has to reset the number of VFs to 0
before changing to a new positive number.
The host event is registered when driver is moving to switchdev mode,
and it's the last step to do in esw_offloads_init. It's unregistered
and the work queue is flushed when driver quits from switchdev mode.
In this way, the host event and devlink command are serialized.
When driver is enabling switchdev mode, pay attention to the following
two facts:
1. Host PF must not have VF initialized as the flow table in ECPF has
ENCAP enabled as default. Such flow table can't be created with
existing initialized VFs.
2. ECPF doesn't know how many VFs the host PF will enable, ECPF
offloads flow steering shall create the flow table/groups based on
the max number of VFs possibly supported by host PF.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
ECPF connects to the eswitch through vport 0xfffe. ECPF may or may
not be the eswitch manager depending on firmware configuration.
1. If ECPF is eswitch manager: ECPF will take over the eswitch manager
responsibility. A rep of the host PF shall be created at the ECPF
side for the eswitch manager to control.
2. If ECPF is not eswitch manager: host PF will be the eswitch manager,
ECPF acts similar as a VF to the host PF. Host PF will be aware
of the ECPF vport presence and control it's rep.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In offloads mode, the current implementation puts the uplink
representor at index zero of the vport reps array. It is not "natural"
to place it at index 0 since we want to put the representor for vport
0 at index 0 with the introduction of SmartNIC. A separate patch will
handle the case whether a rep is needed for vport 0 (PF vport).
So, we want to have a different placeholder for uplink vport and
representor. It was placed at the end of vport and rep array. Since
vport number can no longer act as an index into the vport or
representors arrays, use functions to map vport numbers to indices
when accessing the vports or representors arrays, and vice versa.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eswitch has two users: IB and ETH. They both register repersentors
when mlx5 interface is added, and unregister the repersentors when
mlx5 interface is removed. Ideally, each driver should only deal with
the entities which are unique to itself. However, current IB and ETH
drivers have to perform the following eswitch operations:
1. When registering, specify how many vports to register. This number
is the same for both drivers which is the total available vport
numbers.
2. When unregistering, specify the number of registered vports to do
unregister. Also, unload the repersentors which are already loaded.
It's unnecessary for eswitch driver to hands out the control of above
operations to individual driver users, as they're not unique to each
driver. Instead, such operations should be centralized to eswitch
driver. This consolidates eswitch control flow, and simplified IB and
ETH driver.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently the driver loads and unloads all reps in an unbreakable
group. However, with ECPF, the reps of special vports such as uplink
and host PF should always be loaded in switchdev mode where the reps
for VFs will be loaded on-demand and unloaded on no-demand. This is
a pre-step for that change.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently the eswitch vport reps have a valid indicator, which is
set on register and unset on unregister. However, a rep can be loaded
or not loaded when doing unregister, current driver checks if the
vport of that rep is enabled as a flag to imply the rep is loaded.
However, for ECPF, this is not valid as the host PF will enable the
vports for its VFs instead.
Add three states: {unregistered, registered, loaded}, with the
following state changes across different operations:
create: (none) -> unregistered
reg: unregistered -> registered
load: registered -> loaded
unload: loaded -> registered
unreg: registered -> unregistered
Note that the state shall only be updated inside eswitch driver rather
than individual drivers such as ETH or IB.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Suggested-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
With only PF and VF, it is sufficient to have the vport/rep array
index as the vport number. This is because PF and VF vports numbers
are consecutive serial numbers. In downstream patches with
introducing of ECPF and UPLINK vports, it's not consecutive any more.
Use getter to get specific vport/rep, and use iterator to traversal
a list of vport/rep. This hides the translation between array index
and vport number, and provides flexibility of using different
translation mechanism in the future.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Suggested-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When driver is entering offloads mode, there are two major tasks to
do: initialize flow steering and create representors. Flow steering
should make sure enough flow table/group spaces are reserved for all
reps. Representors will be created in a group, all or none.
With the introduction of ECPF, flow steering should still reserve the
same spaces. But, the representors are not always loaded/unloaded in a
single piece. Once ECPF is in offloads mode, it will get the number
of VF changing event from host PF. In such scenario, only the VF reps
should be loaded/unloaded, not the reps for special vports (such as
the uplink vport).
Thus, when entering offloads mode, driver should specify the total
number of reps, and the number of VF reps separately. When leaving
offloads mode, the cleanup should use the information self-contained
in eswitch such as number of VFs.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
E-switch offloads mode initialize/cleanup multiple steering related
entities (flow table/group). Refactor these operations to internal
helper functions for better block design.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Commands referring to vports use the following scheme:
1. When referring to my own vport, put 0 in vport and 0 in other_vport.
2. When referring to another vport, put the vport number of the
referred vport and put 1 in other_vport. It was assumed that driver
is accessing other vport when vport number is greater than 0.
With the above scheme, the case that ECPF eswitch manager is trying
to access host PF vport will fall over with scheme 1 as the vport
number is 0. This is apparently wrong as driver is trying to refer
other vport.
As such usage can only happen in the eswitch context, change relevant
functions to provide other vport input properly.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In SmartNIC mode, the eswitch manager is not necessarily the PF
(vport 0). Use a helper function to get the correct eswitch manager
vport number and cache on the eswitch instance for fast reference.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When bonding is added, driver assumes that it's RoCE LAG if no VF is
enabled. This is not enough for ECPF as the VF is enabled in host PF
side. LAG should only choose RoCE mode when both slave devices meet
conditions below:
1. E-Switch offloads mode is NONE.
2. No VF is enabled.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Merge mlx5-next shared branched into net-next,
From Bodong Wang:
1) Introduction of ECPF (Embedded CPU Physical Function), and low level
bits for mlx5 SmartNic capabilities support.
2) Vport enumeration refactoring that affect mlx5_ib and mlx5_core
From Aya Levin,
3) Add support for 50Gbps per lane link modes in the Port Type and Speed
register (PTYS)
4) Refactor low level query functions for PTYS register
5) Add support for 50Gbps per lane link modes to mlx5_ib
Note: due to a change in API in mlx5/core and a later patch from net-next,
a fixup was squashed with this merge commit that replaces FDB_UPLINK_VPORT
with MLX5_VPORT_UPLINK which exists only in upstream net-next.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The netfilter conflicts were rather simple overlapping
changes.
However, the cls_tcindex.c stuff was a bit more complex.
On the 'net' side, Cong is fixing several races and memory
leaks. Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.
What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU. I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exposes new link modes (including 50Gbps per lane), and ext_*
fields which describes the new link modes in Port Type and Speed
register (PTYS).
Access functions, translation functions (speed <-> HW bits) and
link max speed function were modified.
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This patch fascicles queries to speed related fields in Port Type and
Speed register (PTYS) into a single API. I addition, this patch
refactors functions which serves only Ethernet driver: remove the
protocol type as an input parameter, move code from 'core' directory
into 'en' directory and add 'eth' prefix to the function's name. The
patch also encapsulates functions that are not used outside the Ethernet
driver removes redundant include files.
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When dealing with the offloads mode initialization, driver refers to
the number of VFs and add magic number one (1) to take account of the
uplink. This is not clear and will make the code less readable after
adding other vports (e.g. host PF). As these are special vports
compared to VF vports, add a helper macro to denote such special
vports and eliminate the use of magic number.
Moreover, when creating offloads flow table and groups, the driver
reserves two more slots for UC and MC miss rules. Replace this magic
number with a helper macro as well.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
These are two macros in the driver general header which deal with the
number of total vports and if a vport is vport manager. Such macros
are vport entities, better to place them at the vport header file.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Driver used to name uplink vport as FDB_UPLINK_VPORT, it's hard to
comply with the same naming convention along with the introduction of
other vports. Use MLX5_VPORT as the prefix for such vports and
relocate the uplink vport definition to public header file for the
benefits of both net and IB drivers.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In Embedded CPU (EC) configurations, the EC driver needs to know when
the number of virtual functions change on the corresponding PF at the
host side. This is required so the EC driver can create or destroy
representor net devices that represent the VFs ports.
Whenever a change in the number of VFs occurs, firmware will generate an
event towards the EC which will trigger a work to complete the rest of
the handling. The specifics of the handling will be introduced in a
downstream patch.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The QUERY_HOST_PARAMS command is used by an Embedded CPU Physical
Function (ECPF) driver to identify and retrieve information about the
PF on the host side. E.g, number of virtual functions and PCI BDF.
The number of VFs can be changed on the fly, a function is added to
query current number of VFs and will be used in downstream patches.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
With the introduction of ECPF, we require that the ECPF driver will
aways call enable/disable HCA for that PF in the same way a PF does
this for its VFs. The PF is still responsible for calling enable and
disable HCA for its VFs.
To distinguish between the ECPF executing enable/disable HCA for
itself or for the PF, it sets the embedded CPU function bit in the
input params struct of these commands. When the bit is cleared and
function ID is zero, it refers to the peer PF.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mellanox's SmartNIC combines embedded CPU(e.g, ARM) processing power
with advanced network offloads to accelerate a multitude of security,
networking and storage applications.
With the introduction of the SmartNIC, there is a new PCI function
called Embedded CPU Physical Function(ECPF). And it's possible for a
PF to get its ICM pages from the ECPF PCI function. Driver shall
identify if it is running on such a function by reading a bit in
the initialization segment.
When firmware asks for pages, it would issue a page request event
specifying how many pages it requests and for which function. That
driver responds with a manage_pages command providing the requested
pages along with an indication for which function it is providing these
pages.
The encoding before this patch was as follows:
function_id == 0: pages are requested for the function receiving
the EQE.
function_id != 0: pages are requested for VF identified by the
function_id value
A new one bit field in the EQE identifies that pages are requested for
the ECPF.
The notion of page_supplier can be introduced here and to support that,
manage pages and query pages were modified so firmware can distinguish
the following cases:
1. Function provides pages for itself
2. PF provides pages for its VF
3. ECPF provides pages to itself
4. ECPF provides pages for another function
This distinction is possible through the introduction of the bit
"embedded_cpu_function" in query_pages, manage_pages and page request
EQE.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Use u16 for vport number, which matches how hardware refers to this
argument throughout commands.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Allow thermal zone binding to an external cooling device from the
cooling devices white list.
It provides support for Mellanox next generation systems on which
cooling device logic is not controlled through the switch registers.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add label attribute to hwmon object for exposing QSFP module's
temperature sensor name. Modules are labeled as "front panel xxx". The
label is used by utilities such as "sensors":
front panel 001: +0.0C (crit = +0.0C, emerg = +0.0C)
..
front panel 020: +31.0C (crit = +70.0C, emerg = +80.0C)
..
front panel 056: +41.0C (crit = +70.0C, emerg = +80.0C)
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new attributes to hwmon object for exposing QSFP module temperature
input, fault indication, critical and emergency thresholds. Temperature
input and fault indication are read from Management Temperature Bulk
Register. Temperature thresholds are read from Management Cable Info
Access Register.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new fan hwmon attribute for exposing fan faults (fault indication is
read from Fan Out of Range Event Register).
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename cooling device from "Fan" to "mlxsw_fan". Name "Fan" is too
common name, and such name is misleading, while it's interpreted by
user. For example name "Fan" could be used by ACPI.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace thermal hardcoded temperature trip values with defines.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify thermal zone trip points setting for better alignment with system
thermal requirement.
Add hysteresis thresholds for thermal trips in order to avoid throttling
around thermal trip point. If hysteresis temperature is not considered,
PWM can have side effect of flip up/down on thermal trip point boundary.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add low frequency bus capability in order to allow core functionality
separation based on bus type. Driver could run over PCIe, which is
considered as high frequency bus or I2C, which is considered as low
frequency bus. In the last case time setting, for example, for thermal
polling interval, should be increased.
Use different thermal monitoring based on bus type. For I2C bus time is
set to 20 seconds, while for PCIe 1 second polling interval is used.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new API to read QSFP module's temperature thresholds - warning and
critical.
New internal API reads the temperature thresholds from the modules,
which are equipped with the thermal sensor. These thresholds will be
exposed via hwmon subsystem.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add FORE (Fan Out of Range Event Register), which is used for fan fault
reading.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add MTBR (Management Temperature Bulk Register), which is used for port
temperature reading in a bulk mode.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move QSFP EEPROM definitions to common location from the spectrum driver
in order to make them available for other mlxsw modules. They are common
for all kind of chips and have relation to SFF specifications 8024,
8436, 8472, 8636, rather than to chip type.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently mlx5 driver creates xdp redirect hw queues unconditionally on
netdevice open, This is great until someone starts redirecting XDP traffic
via ndo_xdp_xmit on mlx5 device and changes the device configuration at
the same time, this might cause crashes, since the other device's napi
is not aware of the mlx5 state change (resources un-availability).
To fix this we must synchronize with other devices napi's on the system.
Added a new flag under mlx5e_priv to determine XDP TX resources are
available, set/clear it up when necessary and use synchronize_rcu()
when the flag is turned off, so other napi's are in-sync with it, before
we actually cleanup the hw resources.
The flag is tested prior to committing to transmit on mlx5e_xdp_xmit, and
it is sufficient to determine if it safe to transmit or not. The other
two internal flags (MLX5E_STATE_OPENED and MLX5E_SQ_STATE_ENABLED) become
unnecessary. Thus, they are removed from data path.
Fixes: 58b99ee3e3 ("net/mlx5e: Add support for XDP_REDIRECT in device-out side")
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eliminate the following compilation warning:
drivers/net/ethernet/mellanox/mlx5/core/events.c: warning: 'error_str'
may be used uninitialized in this function [-Wuninitialized]: => 238:3
Fixes: c2fb3db22d ("net/mlx5: Rework handling of port module events")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Mikhael Goikhman <migo@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When EEH is injected and PCI bus stalls, mlx5's pci error detect
function is called to deactivate the command interface and tear down
the device. The issue is that there can be a thread that already
passed MLX5_DEVICE_STATE_INTERNAL_ERROR check, it will send the command
and stuck in the wait_func.
Solution:
Add function mlx5_cmd_flush to disable command interface and clear all
the pending commands. When device state is set to
MLX5_DEVICE_STATE_INTERNAL_ERROR, call mlx5_cmd_flush to ensure all
pending threads waiting for firmware commands completion are terminated.
Fixes: c1d4d2e92a ("net/mlx5: Avoid calling sleeping function by the health poll thread")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
New channels are applied to the priv channels only after they
are successfully opened. Then, the indirection table should be built
according to the new number of channels.
Currently, such build is preformed independently of whether the
channels opening is successful, and is not reverted on failure.
The bug is caused due to removal of rss params from channels struct
and moving it to priv struct. That change cause to independency between
channels and rss params.
This causes a crash on a later point, when accessing rqn of a non
existing channel.
This patch fixes it by moving the indirection table build right before
switching the priv channels to new channels struct, after the new set of
channels was successfully opened.
Fixes: bbeb53b8b2 ("net/mlx5e: Move RSS params to a dedicated struct")
Signed-off-by: Maria Pasechnik <mariap@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Update newly introduced function to be aligned to netdev coding style.
Fixes: 46861e3e88 ("net/mlx5: Set ODP SRQ support in firmware")
Reported-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
After 1ecb195753 ("mlxsw: spectrum_switchdev: Remove getting
PORT_BRIDGE_FLAGS") we are not accessing any driver private data
structure, so the mlxsw_sp_port and mlxsw_sp variables are unused.
Fixes: 1ecb195753 ("mlxsw: spectrum_switchdev: Remove getting PORT_BRIDGE_FLAGS")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are usually zero's, which don't affect
checksum. However, it is not guaranteed. For example, switches might
choose to make other use of these octets.
This repeatedly causes kernel hardware checksum fault.
Prior to the cited commit below, skb checksum was forced to be
CHECKSUM_NONE when padding is detected. After it, we need to keep
skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
verify and parse IP headers, it does not worth the effort as the packets
are so small that CHECKSUM_COMPLETE has no significant advantage.
Future work: when reporting checksum complete is not an option for
IP non-TCP/UDP packets, we can actually fallback to report checksum
unnecessary, by looking at cqe IPOK bit.
Fixes: 88078d98d1 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no code that will query the SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS
attribute remove support for that.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are usually zero's, which don't affect
checksum. However, it is not guaranteed. For example, switches might
choose to make other use of these octets.
This repeatedly causes kernel hardware checksum fault.
Prior to the cited commit below, skb checksum was forced to be
CHECKSUM_NONE when padding is detected. After it, we need to keep
skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
verify and parse IP headers, it does not worth the effort as the packets
are so small that CHECKSUM_COMPLETE has no significant advantage.
Future work: when reporting checksum complete is not an option for
IP non-TCP/UDP packets, we can actually fallback to report checksum
unnecessary, by looking at cqe IPOK bit.
Fixes: 88078d98d1 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver does not support VLAN push and pop, but only VLAN modify.
Fixes: 7386788175 ("drivers: net: use flow action infrastructure")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case the register access failed an error would be logged anyway, so
we can drop the warning.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The LAG port collecting (receive) function was mistakenly set when the
port was registered as a LAG member, while it should be set only when
the port collection state is set to true. Set LAG port to collecting
when it is set to distributing, as described in the IEEE link
aggregation standard coupled control mux machine state diagram.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
gets a lot of contentions when we test some heavy workload
with 60 RX queues and 80 CPU's, and it is clearly shown in the
flame graph.
In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
we don't have to take a spinlock on this hot path. This is pretty
much similar to commit 291c566a28
("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
are still serialized with the spinlock, and with synchronize_irq()
it should be safe to just move the fast path to RCU read lock.
This patch itself reduces the latency by about 50% for our memcached
workload on a 4.14 kernel we test. In upstream, as pointed out by Saeed,
this spinlock gets some rework in commit 02d92f7903
("net/mlx5: CQ Database per EQ"), so the difference could be smaller.
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
size = sizeof(struct foo) + count * sizeof(struct boo);
instance = kzalloc(size, GFP_KERNEL)
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL)
Notice that, in this case, variable alloc_size is not necessary, hence
it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As vregion rehash is happening in delayed work, add some visibility to
the process using a few tracepoints.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose new driver-specific "acl_region_rehash_interval" devlink param
which would allow user to alter default ACL region rehash interval.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the hints are returned, the migration should be started. For that to
happen, there is a need to create a second physical region in TCAM with
new ERP set by passing the hints and then move chunk by chunk,
entry by entry.
During the transition, two lookups will occur. One in old region and
another in new region. The highest priority rule will be chosen.
In an unlikely case that the migration will fail and also rollback to
original region will fail the vregion will become in bad state.
Everything will work, only no future rehash will be possible. In a
follow-up work, this can be resolved by trying to resume the rollback
in delayed work and repair the vregion.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For Spectrum-2 this allows parallel lookups in multiple regions.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hints priv comes from ERP code and it is possible to obtain it from
TCAM code. Add arg to appropriate functions so the hints
priv could be passed back down to ERP code. Pass NULL now as the
follow-up patches would pass an actual hints priv pointer.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an initial implementation of rehash logic in ERP code.
Currently, the rehash is considered as needed only in case number of
roots in the hints is smaller than the number of roots actually in use.
In that case return hints pointer and let it be obtained through the
callpath through the Spectrum-2 TCAM op.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the split of entry struct so the new entry struct is related to the
actual HW entry, whereas ventry struct is a SW abstration of that.
This split prepares possibility for ventry to hold 2 HW entries which
is needed for region ERP rehash flow.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the split of chunk struct so the new chunk struct is related to the
actual HW chunk (differs between Spectrum and Spectrum-2),
whereas vchunk struct is a SW abstraction of that. This split
prepares possibility for vchunk to hold 2 HW chunks which is needed
for region ERP rehash flow.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the split of region struct so the new region struct is related to the
actual HW region, whereas vregion struct is a SW abstration of that.
This split prepares possibility for vregion to hold 2 HW regions which
is needed for region ERP rehash flow.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement simple greedy algo to find more optimized root-delta tree for
a given objagg instance. This "hints" can be used by a driver to:
1) check if the hints are better (driver's choice) than the original
objagg tree. Driver does comparison of objagg stats and hints stats.
2) use the hints to create a new objagg instance which will construct
the root-delta tree according to the passed hints. Currently, only a
simple greedy algorithm is implemented. Basically it finds the roots
according to the maximal possible user count including deltas.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, user can do dump or get of param values right after the
devlink params are registered. However the driver may not be initialized
which is an issue. The same problem happens during notification
upon param registration. Allow driver to publish devlink params
whenever it is ready to handle get() ops. Note that this cannot
be resolved by init reordering, as the "driverinit" params have
to be available before the driver is initialized (it needs the param
values there).
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.
Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, ndo_tx_timeout callback will be redirected to the tx
reporter in order to detect a tx timeout error and report it to the
devlink health. (The watchdog detects tx timeouts, but the driver verify
the issue still exists before launching any recover method).
In addition, recover from tx timeout in case of lost interrupt was added
to the tx reporter recover method. The tx timeout recover from lost
interrupt is not a new feature in the driver, this patch re-organize the
functionality and move it to the tx reporter recovery flow.
tx timeout example:
(with auto_recover set to false, if set to true, the manual recover and
diagnose sections are irrelevant)
$cat /sys/kernel/debug/tracing/trace
...
devlink_health_report: bus_name=pci dev_name=0000:00:09.0
driver_name=mlx5_core reporter_name=tx: TX timeout on queue: 0, SQ: 0x8a,
CQ: 0x35, SQ Cons: 0x2 SQ Prod: 0x2, usecs since last trans: 14912000
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 0 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": true
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: true
sqn: 142 HW state: 1 stopped: false
$devlink health recover pci/0000:00:09 reporter tx
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 1 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some case, we may use multiple pedit actions to modify packets.
The command shown as below: the last pedit action is effective.
$ tc filter add dev netdev_rep parent ffff: protocol ip prio 1 \
flower skip_sw ip_proto icmp dst_ip 3.3.3.3 \
action pedit ex munge ip dst set 192.168.1.100 pipe \
action pedit ex munge eth src set 00:00:00:00:00:01 pipe \
action pedit ex munge eth dst set 00:00:00:00:00:02 pipe \
action csum ip pipe \
action tunnel_key set src_ip 1.1.1.100 dst_ip 1.1.1.200 dst_port 4789 id 100 \
action mirred egress redirect dev vxlan0
To fix it, we add max_mod_hdr_actions to mlx5e_tc_flow_parse_attr struction,
max_mod_hdr_actions will store the max pedit action number we support and
num_mod_hdr_actions indicates how many pedit action we used, and store all
pedit action to mod_hdr_actions.
Fixes: d79b6df6b1 ("net/mlx5e: Add parsing of TC pedit actions to HW format")
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we offload tc filters to hardware, hardware flows can
be updated when mac of encap destination ip is changed.
But we ignore one case, that the mac of local encap ip can
be changed too, so we should also update them.
To fix it, add route_dev in mlx5e_encap_entry struct to save
the local encap netdevice, and when mac changed, kernel will
flush all the neighbour on the netdevice and send NETEVENT_NEIGH_UPDATE
event. The mlx5 driver will delete the flows and add them when neighbour
available again.
Fixes: 232c001398 ("net/mlx5e: Add support to neighbour update flow")
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new FIB entry type for blackhole routes and set it in case the
type of the notified route is 'RTN_BLACKHOLE'.
Program such routes with a discard action and mark them as offloaded
since the device is dropping the packets instead of the kernel.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw implements SWITCHDEV_ATTR_ID_PORT_PARENT_ID and we want to get rid
of switchdev_ops eventually, ease that migration by implementing a
ndo_get_port_parent_id() function which returns what
switchdev_port_attr_get() would do.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5e only supports SWITCHDEV_ATTR_ID_PORT_PARENT_ID, which makes it a
great candidate to be converted to use the ndo_get_port_parent_id() NDO
instead of implementing switchdev_port_attr_get().
Since mlx5e makes use of switchdev_port_parent_id() convert it to use
netdev_port_same_parent_id().
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trace EMAD errors returned from HW.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates drivers to use the new flow action infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides the flow_stats structure that acts as container for
tc_cls_flower_offload, then we can use to restore the statistics on the
existing TC actions. Hence, tcf_exts_stats_update() is not used from
drivers anymore.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds pedit_headers_action structure to store the result of
parsing tc pedit actions. Then, it calls alloc_tc_pedit_action() to
populate the mlx5e hardware intermediate representation once all actions
have been parsed.
This patch comes in preparation for the new flow_action infrastructure,
where each packet mangling comes in an separated action, ie. not packed
as in tc pedit.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch wraps the dissector key and mask - that flower uses to
represent the matching side - around the flow_match structure.
To avoid a follow up patch that would edit the same LoCs in the drivers,
this patch also wraps this new flow match structure around the flow rule
object. This new structure will also contain the flow actions in follow
up patches.
This introduces two new interfaces:
bool flow_rule_match_key(rule, dissector_id)
that returns true if a given matching key is set on, and:
flow_rule_match_XYZ(rule, &match);
To fetch the matching side XYZ into the match container structure, to
retrieve the key and the mask with one single call.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In packets that need to be decaped the internal headers
have to be checked, not the external ones.
Fixes: bdd66ac0ae ("net/mlx5e: Disallow TC offloading of unsupported match/action combinations")
Signed-off-by: Guy Shattah <sguy@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The match level computed by the driver gets to be wrong for decap
rules with wildcarded inner packet match such as:
tc filter add dev vxlan_sys_4789 protocol all parent ffff: prio 2 flower
enc_dst_ip 192.168.0.9 enc_key_id 100 enc_dst_port 4789
action tunnel_key unset
action mirred egress redirect dev eth1
The FW errs for a missing matching meta-data indicator for the outer
headers (where we do have a match), and a wrong matching meta-data
indicator for the inner headers (where we don't have a match).
Fix that by taking into account the matching on the tunnel info and
relating the match level of the encapsulated packet to the firmware
inner headers indicator in case of decap.
As for vxlan we mandate a match on the tunnel udp dst port, and in general
we practically madndate a match on the source or dest ip for any IP tunnel,
the fix was done in a minimal manner around the tunnel match parsing code.
Fixes: d708f90298 ('net/mlx5e: Get the required HW match level while parsing TC flow matches')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Slava Ovsiienko <viacheslavo@mellanox.com>
Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
At Innova IPsec TX offload data path a special software parser metadata
is used to pass some packet attributes to the hardware, this metadata
is passed using the Ethernet control segment of a WQE (a HW descriptor)
header.
The cited commit might nullify this header, hence the metadata is lost,
this caused a significant performance drop during hw offloading
operation.
Fix by restoring the metadata at the Ethernet control segment in case
it was nullified.
Fixes: 37fdffb217 ("net/mlx5: WQ, fixes for fragmented WQ buffers API")
Signed-off-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add the tab before '}' and keep the code style consistent.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxXYaEeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkSQH/2yrfnviNPFYpZOR
QQdc71Bfhkd8m85SmWIsSebkxmi3hKFVj15sGbWXd6+0/VxjEEGvQCZpvVwJceke
LwDxtkKGg/74wAqJvlSAWxFNZ+Had4jDeoSoeQChddsBVXBBCxQx2v6ECg3o2x7W
k8Z8t4+3RijDf8fYXY9ETyO2zW8R/wgT+dnl+DPgUH7u4dxh7FzAUfc4bgZIDg+i
FzBQfbTJuz4BU7uRZ9IJiwhWKv0Iyi2DR3BY8Z1pqEpRaUMJMrCs2WGytHbTgt9e
0EtO1airbVneU4eumU/ZaF9cyEbah9HousEPnP7J09WG4s/Odxc4zE+uK1QqS2im
5Xv88is=
=dVd1
-----END PGP SIGNATURE-----
Merge tag 'v5.0-rc5' into rdma.git for-next
Linux 5.0-rc5
Needed to merge the include/uapi changes so we have an up to date
single-tree for these files. Patches already posted are also expected to
need this for dependencies.
Shared buffer allocation is usually done in cell increments.
Drivers will either round up the allocation or refuse the
configuration if it's not an exact multiple of cell size.
Drivers know exactly the cell size of shared buffer, so help
out users by providing this information in dumps.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid compatibility issue with older kernels the firmware doesn't
allow SRQ to work with ODP unless kernel asks for it.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Add some visibility to the rule addition process and trace whenever rule
spilled into C-TCAM.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently only ERP mask masked bits in key are considered for
the hashtable key. That leads to false negative collisions
and fallbacks to C-TCAM in case two keys differ only in delta bits.
Fix this by taking full encoded key as a hashtable key,
including delta bits.
Reported-by: Nir Dotan <nird@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add more extack messages that let the user know why VXLAN offload
failed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: David Ahern <dsahern@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the struct to the place where they belong, alongside with the rest
of the MR code.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to pass ruleset/group and chunk pointers on action_replace call
path, nobody uses them.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcS2rdAAoJEEg/ir3gV/o+CZ8IAMX4yWUWjzgrIDdOc1OEgHJQ
FDCMtTDm/I+m1UInAgoigRia2RyIL99N81Vrx+p+otCOml/U5IXPt5tuJfSNQYZ3
p9MibQSXtKJ9jhowOVZdYlTx9CXti75BS45wHKZPlL7V+gCC2Q7/8aviuRlkQKAN
4pQNDreNiiB8PznXG/yCyzEu0Kc9M/Bv89xdLynVpDTxBfEfI8xSv4PttUcAxaQ5
Fs/489wuODyhAczOquQj1dvn61zUamaPp6z0h+GbZOCQv0AO7ABRZl+7o5C5r+rA
i7R4ak6WIgW3snIzhCdpynPfas8kUIvxJjkNCZuoFsefrBRsNv5RvtfCkuzyms4=
=3epF
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2019-01-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2019-01-25
This series introduces some fixes to mlx5 driver.
For more information please see tag log below.
Please pull and let me know if there is any problem.
For -stable v4.13
('net/mlx5e: Allow MAC invalidation while spoofchk is ON')
For -stable v4.18
('Revert "net/mlx5e: E-Switch, Initialize eswitch only if eswitch manager"')
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Representors software stats are basic, this patch is reusing the
mlx5e_fold_sw_stats in representors, which sums up the basic stats64 for a
mlx5e netdevice.
Fixes: 8bfaf07f78 ("net/mlx5e: Present SW stats when state is not opened")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This behavior is already adopted for all other cases in the cited patch.
The representor's functions were missed, here we modify the them to
behave similarly.
Fixes: 8bfaf07f78 ("net/mlx5e: Present SW stats when state is not opened")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
mlx5e_grp_sw_update_stats can be called from two threads,
1) ndo_get_stats64
2) get_ethtool_stats
For this reason and to minimize concurrency issue impact on 64bit machines
mlx5e_grp_sw_update_stats folds the software stats into a temporary
variable then copies it to the global driver stats, both ethtool and ndo
statistics callbacks will use the global software stats variable to report
whatever stats they need.
Actually ndo_get_stats64 doesn't need to fold the whole software stats
(mlx5e_grp_sw_update_stats), all it needs is five counters to fill the
rtnl_link_stats64 relevant stats parameter.
Hence this patch introduces a simpler helper function to fold software
stats for ndo_get_stats64 which will work directly on rtnl_link_stats64
stats parameter and not on the global or even temporary mlx5e_sw_stats
variable.
Since now mlx5e_grp_sw_update_stats is not called by ndo_get_stats64 we
can make it static and remove the temp var.
Unlike mlx5e_grp_sw_update_stats the new fold stats function doesn't
need to zero out the output statistics parameter since it is already
done by the stack @dev_get_stats().
This patch is fixing stack usage of mlx5e_grp_sw_update_stats on
x86 gcc-4.9 and higher, the concurrency issue between mlx5's
ndo_get_stats64 and get_ethtool_stats is resolved as well.
Fixes: 8bfaf07f78 ("net/mlx5e: Present SW stats when state is not opened")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
We were not tracking flow tables so far, add it up.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This confusing construction confuses the compiler which can't see
that flow is initialized if !err:
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c: In function `mlx5e_configure_flower`
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:2727:28: warning:
`flow` may be used uninitialized in this function
[-Wmaybe-uninitialized]
There is no reason for two function outputs, just return the
pointer directly and use ERR_PTR to encode a failure.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently we have one cpu in XPS cpumask per tx queue, this is good
enough for default configuration where there is a tx queue per cpu.
However, once configuration changes to use less tx queues, part of the
cpus are not XPS-mapped and so the select queue decision falls back to
hash calculation and balancing is not guaranteed.
Expand XPS cpumask to enable using all cpus even when number of tx
queues is smaller than number of cpus.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Only the Receive CQ makes use of these fields. Take them
out into a separate struct and save space in the generic
CQ structure.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In the non-linear SKB memory scheme of Striding RQ, a packet header
could cross page boundary. This requires special care in fast path
that costs LoC, additional runtime instructions and branches.
It could happen when the header (up to 256B) does not fit in
a single stride. Avoid this by working with a stride size that fits
the maximum possible header. Stride is increased form 64B to 256B.
Performance:
Tested packet rate for UDP streams, single ring, on ConnectX-5.
Configuration:
Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme).
GRO OFF, early drop by TC rule.
64B: 4x worse memory utilization, no page-crossers headers
- No degradation (5,887,305 pps).
- The reduction in memory utilization is compensated by the saving of
branches tests.
192B: 1.33x worse memory utilization, avoid page-crossers headers
- Before: 5,727,252. After: 5,777,037. ~1% gain.
256B: Same memory util, no page-crossers
- Before: 5,691,885. After: 5,748,007. ~1% gain.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
It turns out that libvirt uses 0-vid as a default if no vlan was
set for the guest (which is the case for switchdev mode) and errs
if we disallow that:
error: Failed to start domain vm75
error: Cannot set interface MAC/vlanid to 6a:66:2d:48:92:c2/0 \
for ifname enp59s0f0 vf 0: Operation not supported
So allow this in order not to break existing systems.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Maor Dickman <maord@mellanox.com>
Reviewed-by: Gavi Teitz <gavi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
With VF LAG commit 491c37e49b "net/mlx5e: In case of LAG, one switch
parent id is used for all representors", both uplinks and all the VFs
(on both of them) get the same switchdev id.
This cause the provisioning system method to identify the rep of a given
VF from the parent PF PCI device using switchev id and physical port
name to break, since VFm of PF0 will have the (id, name) as VFm of PF1.
To fix that, we align to use the framework agreed upstream and set by
nfp commit 168c478e10 "nfp: wire get_phys_port_name on representors":
$ cat /sys/class/net/eth4_*/phys_port_name
p0
pf0vf0
pf0vf1
Now, the names will be different, e.g. pf0vf0 vs. pf1vf0.
Fixes: 491c37e49b ("net/mlx5e: In case of LAG, one switch parent id is used for all representors")
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Waleed Musa <waleedm@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Prior to this patch the driver prohibited spoof checking on invalid MAC.
Now the user can set this configuration if it wishes to.
This is required since libvirt might invalidate the VF Mac by setting it
to zero, while spoofcheck is ON.
Fixes: 1ab2068a4c ("net/mlx5: Implement vports admin state backup/restore")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The lock in qp_table might be taken from process context or from
interrupt context. This may lead to a deadlock unless it is taken with
IRQs disabled.
Discovered by lockdep
================================
WARNING: inconsistent lock state
4.20.0-rc6
--------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W}
python/12572 [HC1[1]:SC0[0]:HE0:SE1] takes:
00000000052a4df4 (&(&table->lock)->rlock#2){?.+.}, /0x50 [mlx5_core]
{HARDIRQ-ON-W} state was registered at:
_raw_spin_lock+0x33/0x70
mlx5_get_rsc+0x1a/0x50 [mlx5_core]
mlx5_ib_eqe_pf_action+0x493/0x1be0 [mlx5_ib]
process_one_work+0x90c/0x1820
worker_thread+0x87/0xbb0
kthread+0x320/0x3e0
ret_from_fork+0x24/0x30
irq event stamp: 103928
hardirqs last enabled at (103927): [] nk+0x1a/0x1c
hardirqs last disabled at (103928): [] unk+0x1a/0x1c
softirqs last enabled at (103924): [] tcp_sendmsg+0x31/0x40
softirqs last disabled at (103922): [] 80
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&table->lock)->rlock#2);
lock(&(&table->lock)->rlock#2);
*** DEADLOCK ***
Fixes: 032080ab43 ("IB/mlx5: Lock QP during page fault handling")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
MLX5E_PFLAG_* definitions were changed from bitmask to enumerated
values. However, in mlx5e_open_rq(), the proper API (MLX5E_GET_PFLAG macro)
was not used to read the flag value of MLX5E_PFLAG_RX_NO_CSUM_COMPLETE.
Fixed it.
Fixes: 8ff57c18e9 ("net/mlx5e: Improve ethtool private-flags code structure")
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This reverts commit 5f5991f36d.
With the original commit, eswitch instance will not be initialized for
a function which is vport group manager but not eswitch manager such as
host PF on SmartNIC (BlueField) card. This will result in a kernel crash
when such a vport group manager is trying to access vports in its group.
E.g, PF vport manager (not eswitch manager) tries to configure the MAC
of its VF vport, a kernel trace will happen similar as bellow:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
...
RIP: 0010:mlx5_eswitch_get_vport_config+0xc/0x180 [mlx5_core]
...
Fixes: 5f5991f36d ("net/mlx5e: E-Switch, Initialize eswitch only if eswitch manager")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reported-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Soften the memory barrier call of mb() by a sufficient wmb() in the
consumer index update of the event queues.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Procedure mlx4_init_user_cqes() handles returns by copy_to_user
incorrectly. copy_to_user() returns the number of bytes not copied.
Thus, a non-zero return should be treated as a -EFAULT error
(as is done elsewhere in the kernel). However, mlx4_init_user_cqes()
error handling simply returns the number of bytes not copied
(instead of -EFAULT).
Note, though, that this is a harmless bug: procedure mlx4_alloc_cq()
(which is the only caller of mlx4_init_user_cqes()) treats any
non-zero return as an error, but that returned error value is processed
internally, and not passed further up the call stack.
In addition, fixes the following sparse warning:
warning: incorrect type in argument 1 (different address spaces)
expected void [noderef] <asn:1>*to
got void *buf
Fixes: e45678973d ("{net, IB}/mlx4: Initialize CQ buffers in the driver when possible")
Reported by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Driver reads the query HCA capabilities without the corresponding masks.
Without the correct masks, the base addresses of the queues are
unaligned. In addition some reserved bits were wrongly read. Using the
correct masks, ensures alignment of the base addresses and allows future
firmware versions safe use of the reserved bits.
Fixes: ab9c17a009 ("mlx4_core: Modify driver initialization flow to accommodate SRIOV for Ethernet")
Fixes: 0ff1fb654b ("{NET, IB}/mlx4: Add device managed flow steering firmware API")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling pci_enable_atomic_ops_to_root enables AtomicOp requests to pci
root port.
AtomicOp requests will be enabled only if the completer and all
intermediate pci bridges support PCI atomic operations.
This, together with appropriate settings in the NVCONFIG should enable
PCI atomic operations on the device.
PCI atomic operations were first introduced in PCI Express Base Specification
2.1. The Supported operations are Swap (Unconditional Swap), CAS (Compare and
Swap) and FetchAdd (Fetch and Add).
Unlike other atomic operation modes PCI atomic operations gives the user
the option to do atomic operations on local memory, without involving verbs
api, without it compromising the operation's atomicity.
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
APIs that have deferred callbacks should have some kind of cleanup
function that callers can use to fence the callbacks. Otherwise things
like module unloading can lead to dangling function pointers, or worse.
The IB MR code is the only place that calls this function and had a
really poor attempt at creating this fence. Provide a good version in
the core code as future patches will add more places that need this
fence.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Enable VXLAN on Spectrum-2 as previous patches added the required
functionality.
Note that for now Spectrum-1 and Spectrum-2 use the same function to
determine whether the VXLAN configuration is valid or not. In the
future, when the driver will be extended to support features not present
in Spectrum-1, two different functions will be needed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-1 and Spectrum-2 are largely backward compatible with regards
to VXLAN. One difference - as explained in previous patch - is that an
underlay RIF needs to be specified instead of an underlay VR during NVE
initialization. This is accomplished by calling the relevant function
that returns the index of such a RIF based on the table ID
(RT_TABLE_MAIN) where underlay look up occurs.
The second difference is that VXLAN learning (snooping) is controlled
via a different register (TNPC).
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The configuration of a VXLAN tunnel in Spectrum-1 and Spectrum-2 is
largely the same. To avoid code duplication, breakout the common parts
to a common function that can be invoked from the ASIC-specific code.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In Spectrum-2, instead of providing the ID of the virtual router (VR)
where NVE underlay lookups will occur as in Spectrum-1, the ID of a
router interface (RIF) in this VR is required.
Expose functions to create and destroy such a RIF.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
This patch fixes the following warning:
drivers/net/ethernet/mellanox/mlx4/eq.c: In function ‘mlx4_eq_int’:
drivers/net/ethernet/mellanox/mlx4/mlx4.h:219:5: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (mlx4_debug_level) \
^
drivers/net/ethernet/mellanox/mlx4/eq.c:558:4: note: in expansion of macro ‘mlx4_dbg’
mlx4_dbg(dev, "%s: MLX4_EVENT_TYPE_SRQ_LIMIT. srq_no=0x%x, eq 0x%x\n",
^~~~~~~~
drivers/net/ethernet/mellanox/mlx4/eq.c:561:3: note: here
case MLX4_EVENT_TYPE_SRQ_CATAS_ERROR:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
This patch is part of the ongoing efforts to enabling
-Wimplicit-fallthrough.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-2 GRE tunnel implementation requires a specific underlay RIF that
points to the virtual router used for forwarding the encapsulated packet.
Add Spectrum-2 specific loopback router interface creation methods which
may create or reuse the dedicated underlay RIF.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-2 requires to specify the egress RIF when setting tunnel decap
properties. Add a method for accessing the underlay RIF index and then use
it when setting decap properties.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-2 underlay RIF is merely an auxiliary RIF that points to the
virtual router used for encapsulated packets lookup. It exists only when
its overlay RIF exists but may be shared with other overlay RIFs.
Hence it is undesired to mark any device as related to it.
Therefore allow usage of NULL device when allocating RIF.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the sake of Spectrum-2 GRE support, as ul_vr_id field is reserved for
Spectrum-2, Change mlxsw_sp_ipip_lb_ul_vr_id() implementation not to use
the reserved field.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-2 GRE tunnels underlay should be given not only the virtual router
information for an encapsulated packet lookup, but also an underlay RIF
object which belongs to a virtual router.
Therefore add ul_rif_id field in struct mlxsw_sp_rif_ipip_lb, to be used
later in Spectrum-2 underlay RIF implementation. This field complements
ul_vr_id field, already present and defined as reserved for Spectrum-2.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The presence of an allocated RIF in mlxsw_sp->router->rifs[rif_index] marks
that rif_index as taken.
Set the marking of a taken RIF to happen before calling ops->create in
order to allow creation of a GRE underlay RIF, which may be allocated and
created as part of an overlay RIF creation.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In Spectrum-2, the underlay routing table is pointed by an underlay router
interface in contrary to Spectrum where only an underlay virtual router
should be set. That makes the underlay virtual router field in RITR
reserved for Spectrum-2.
Change loopback RIF creation function to support the new underlay RIF
field, however leave this field reserved for Spectrum-1 updates.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set RIF ops array as member of mlxsw_sp in order to control which RIF
operations callbacks are called per ASIC type. This is needed to control
per ASIC handling of loopback RIF configurations.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split RIF ops array for Spectrum-1 and Spectrum-2 callbacks in order to
support different sets of operations for loopback RIF handling, as
underlying implementation differs between the ASICs.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In Spectrum-2 we need to specify the underlay egress router interface
when performing IP-in-IP and NVE packet decapsulation in the underlay
router.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add fields relevant for Spectrum-2 Loopback IPinIP router interface
creation. Add additional Loopback RIF protocol value - Generic, used for
creation of an explicit underlay RIF, and also add a field named
underlay_rif used for specifying the underlay RIF of a tunnel.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcQmwjAAoJEEg/ir3gV/o+aBEIAIE8q3gEonMEPLnRoYSBlgUy
ZBl7/51yzC8CVdsdO7hx5dyyiI+gYq10Jp6TVpqA2i/V8P871T/u6+xRhP3T7dOc
Fq2YE70NdkCqFkpyc0QONzMC7ypwVqIEJrX7KnuLi5Ybsm9+tDia8AgC0NX/0H3U
bRgfzpupfK0TiLgtBe7kqC2WTo+bPzu0cEUu7xjQqUgZZie5QPyzW/6cEtn/V75+
A3btMNS5w0cfYK4mR4MMuc+UT4rSbsdyIZQrzICo75C2UnVMlojDf90+RuW4JFVw
tuOMA1LCB5l3lphgEEEerV2dZvs0ZJEuoYoFUkfgu2QEFSOMhod71t8/rurih0s=
=f23b
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2019-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2019-01-18
This series introduces some fixes to mlx5 driver.
Please pull and let me know if there is any problem.
For -stable v4.18
('net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames')
The patch doesn't apply cleanly to 4.18.y, but it is very simple to
resolve, what should be the procedure here ?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously the identifier used for indirect block callback registry
and for block rule cb registry (when done via indirect blocks) was the
pointer to the tunnel netdev we were interested in receiving updates on.
This worked fine if a single PF existed that registered one callback for
the tunnel netdev of interest. However, if multiple PFs are in place then
the 2nd PF tries to register with the same tunnel netdev identifier. This
leads to EEXIST errors and/or incorrect cb deletions.
Prevent this conflict by using the rpriv pointer as the identifier for
netdev indirect block cb registry, allowing each PF to register a unique
callback per tunnel netdev. For block cb registry, the same PF may
register multiple cbs to the same block if using TC shared blocks.
Instead of the rpriv, use the pointer to the allocated indr_priv data as
the identifier here. This means that there can be a unique block callback
for each PF/tunnel netdev combo.
Fixes: f5bc2c5de1 ("net/mlx5e: Support TC indirect block notifications
for eswitch uplink reprs")
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
For representors, the TX dropped counter is not folded from the
per-ring counters. Fix it.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are usually zero's, which don't affect
checksum. However, we have a switch which pads non-zero octets, this
causes kernel hardware checksum fault repeatedly.
Prior to:
commit '88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE ...")'
skb checksum was forced to be CHECKSUM_NONE when padding is detected.
After it, we need to keep skb->csum updated, like what we do for RXFCS.
However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
headers, it is not worthy the effort as the packets are so small that
CHECKSUM_COMPLETE can't save anything.
Fixes: 88078d98d1 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"),
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The driver currently treats static FDB entries as both static and
sticky. This is incorrect and prevents such entries from being roamed to
a different port via learning.
Fix this by configuring static entries with ageing disabled and roaming
enabled.
In net-next we can add proper support for the newly introduced 'sticky'
flag.
Fixes: 56ade8fe3f ("mlxsw: spectrum: Add initial support for Spectrum ASIC")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alexander Petrovskiy <alexpe@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using a tc flower action of egress mirred redirect, the driver adds
an implicit FID setting action. This implicit action sets a dummy FID to
the packet and is used as part of a design for trapping unmatched flows
in OVS. While this implicit FID setting action is supposed to be a NOP
when a redirect action is added, in Spectrum-2 the FID record is
consulted as the dummy FID index is an 802.1D FID index and the packet
is dropped instead of being redirected.
Set the dummy FID index value to be within 802.1Q range. This satisfies
both Spectrum-1 which ignores the FID and Spectrum-2 which identifies it
as an 802.1Q FID and will then follow the redirect action.
Fixes: c3ab435466 ("mlxsw: spectrum: Extend to support Spectrum-2 ASIC")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return an appropriate error in the case when the driver timeouts on waiting
for firmware to go out of PCI reset.
Fixes: 233fa44bd6 ("mlxsw: pci: Implement reset done check")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-2 PHY layer introduces a calibration period which is a part of the
Spectrum-2 firmware boot process. Hence increase the SW timeout waiting for
the firmware to come out of boot. This does not increase system boot time
in cases where the firmware PHY calibration process is done quickly.
Fixes: c3ab435466 ("mlxsw: spectrum: Extend to support Spectrum-2 ASIC")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a packet should be trapped to the CPU the device consumes a WQE
(work queue element) from an RDQ (receive descriptor queue) and copies
the packet to the address specified in the WQE. The device then tries to
post a CQE (completion queue element) that contains various metadata
(e.g., ingress port) about the packet to a CQ (completion queue).
In case the device managed to consume a WQE, but did not manage to post
the corresponding CQE, it will get stuck. This unlikely situation can be
triggered due to the scheme the driver is currently using to process
CQEs.
The driver will consume up to 512 CQEs at a time and after processing
each corresponding WQE it will ring the RDQ's doorbell, letting the
device know that a new WQE was posted for it to consume. Only after
processing all the CQEs (up to 512), the driver will ring the CQ's
doorbell, letting the device know that new ones can be posted.
Fix this by having the driver ring the CQ's doorbell for every processed
CQE, but before ringing the RDQ's doorbell. This guarantees that
whenever we post a new WQE, there is a corresponding CQE available. Copy
the currently processed CQE to prevent the device from overwriting it
with a new CQE after ringing the doorbell.
Note that the driver still arms the CQ only after processing all the
pending CQEs, so that interrupts for this CQ will only be delivered
after the driver finished its processing.
Before commit 8404f6f2e8 ("mlxsw: pci: Allow to use CQEs of version 1
and version 2") the issue was virtually impossible to trigger since the
number of CQEs was twice the number of WQEs and the number of CQEs
processed at a time was equal to the number of available WQEs.
Fixes: 8404f6f2e8 ("mlxsw: pci: Allow to use CQEs of version 1 and version 2")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Semion Lisyansky <semionl@mellanox.com>
Tested-by: Semion Lisyansky <semionl@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, ndo_tx_timeout callback will be redirected to the TX
reporter in order to detect a TX timeout error and report it to the
devlink health. (The watchdog detects TX timeouts, but the driver verify
the issue still exists before launching any recover method).
In addition, recover from TX timeout in case of lost interrupt was added
to the TX reporter recover method. The TX timeout recover from lost
interrupt is not a new feature in the driver, this patch re-organize the
functionality and move it to the TX reporter recovery flow.
TX timeout example:
(with auto_recover set to false, if set to true, the manual recover and
diagnose sections are irrelevant)
$cat /sys/kernel/debug/tracing/trace
...
devlink_health_report: bus_name=pci dev_name=0000:00:09.0
driver_name=mlx5_core reporter_name=TX: TX timeout on queue: 0, SQ: 0xd8a, CQ:
0x406, SQ Cons: 0x2 SQ Prod: 0x2, usecs since last trans: 13972000
$devlink health diagnose pci/0000:00:09 reporter TX
SQ 0xd8a: HW state: 1, stopped: 1
SQ 0xe44: HW state: 1, stopped: 0
SQ 0xeb4: HW state: 1, stopped: 0
SQ 0xf1f: HW state: 1, stopped: 0
SQ 0xf80: HW state: 1, stopped: 0
SQ 0xfe5: HW state: 1, stopped: 0
$devlink health recover pci/0000:00:09 reporter TX
$devlink health show
pci/0000:00:09.0:
name TX state healthy #err 1 #recover 1 last_dump_ts N/A dump_available false
attributes:
grace_period 500 auto_recover false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of TX errors.
This patch declares the TX reporter operations and allocate it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full TX recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unneeded semicolon.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Annotate the rejections in mlxsw_sp_switchdev_vxlan_work_prepare() with
textual reasons.
Because this code ends up being invoked for FDB replay as well, drop the
default message from there, so that the more accurate error message is
not overwritten.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A follow-up patch will enable vetoing of FDB entries. Make it possible
to communicate details of why an FDB entry is not acceptable back to the
user.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are four sources of VXLAN switchdev notifier calls:
- the changelink() link operation, which already supports extack,
- ndo_fdb_add() which got extack support in a previous patch,
- FDB updates due to packet forwarding,
- and vxlan_fdb_replay().
Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
switchdev message that it sends, and propagate the argument upwards to
the callers. For the first two cases, pass in the extack gotten through
the operation. For case #3, pass in NULL.
To cover the last case, extend vxlan_fdb_replay() to take extack
argument, which might come from whatever operation necessitated the FDB
replay.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A follow-up patch will extend vxlan_fdb_replay() with an extack
argument. Extend the fdb_replay callback in mlxsw likewise so that the
argument is ready for the vxlan conversion.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This issue was detected with the help of Coccinelle.
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix regression in multi-SKB responses to RTM_GETADDR, from Arthur
Gautier.
2) Fix ipv6 frag parsing in openvswitch, from Yi-Hung Wei.
3) Unbounded recursion in ipv4 and ipv6 GUE tunnels, from Stefano
Brivio.
4) Use after free in hns driver, from Yonglong Liu.
5) icmp6_send() needs to handle the case of NULL skb, from Eric
Dumazet.
6) Missing rcu read lock in __inet6_bind() when operating on mapped
addresses, from David Ahern.
7) Memory leak in tipc-nl_compat_publ_dump(), from Gustavo A. R. Silva.
8) Fix PHY vs r8169 module loading ordering issues, from Heiner
Kallweit.
9) Fix bridge vlan memory leak, from Ido Schimmel.
10) Dev refcount leak in AF_PACKET, from Jason Gunthorpe.
11) Infoleak in ipv6_local_error(), flow label isn't completely
initialized. From Eric Dumazet.
12) Handle mv88e6390 errata, from Andrew Lunn.
13) Making vhost/vsock CID hashing consistent, from Zha Bin.
14) Fix lack of UMH cleanup when it unexpectedly exits, from Taehee Yoo.
15) Bridge forwarding must clear skb->tstamp, from Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
bnxt_en: Fix context memory allocation.
bnxt_en: Fix ring checking logic on 57500 chips.
mISDN: hfcsusb: Use struct_size() in kzalloc()
net: clear skb->tstamp in bridge forwarding path
net: bpfilter: disallow to remove bpfilter module while being used
net: bpfilter: restart bpfilter_umh when error occurred
net: bpfilter: use cleanup callback to release umh_info
umh: add exit routine for UMH process
isdn: i4l: isdn_tty: Fix some concurrency double-free bugs
vhost/vsock: fix vhost vsock cid hashing inconsistent
net: stmmac: Prevent RX starvation in stmmac_napi_poll()
net: stmmac: Fix the logic of checking if RX Watchdog must be enabled
net: stmmac: Check if CBS is supported before configuring
net: stmmac: dwxgmac2: Only clear interrupts that are active
net: stmmac: Fix PCI module removal leak
tools/bpf: fix bpftool map dump with bitfields
tools/bpf: test btf bitfield with >=256 struct member offset
bpf: fix bpffs bitfield pretty print
net: ethernet: mediatek: fix warning in phy_start_aneg
tcp: change txhash on SYN-data timeout
...
Management Datagram Interface (MAD) is applicable
only when physical port is Infiniband. It makes MAD
command logic to be completely unrelated to eth/core
parts of mlx5.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
When a VLAN is deleted from a bridge port we should not change the PVID
unless the deleted VLAN is the PVID.
Fixes: fe9ccc785d ("mlxsw: spectrum_switchdev: Don't batch VLAN operations")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a VLAN on a port can trigger the offload of a VXLAN tunnel which
is already a member in the VLAN. In case the configuration of the VXLAN
is not supported, the driver would return -EOPNOTSUPP.
This is problematic since bridge code does not interpret this as error,
but rather that it should try to setup the VLAN using the 8021q driver
instead of switchdev.
Fixes: d70e42b22d ("mlxsw: spectrum: Enable VxLAN enslavement to VLAN-aware bridges")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers are not supposed to return errors in switchdev commit phase if
they returned OK in prepare phase. Otherwise, a WARNING is emitted.
However, when the offloading of a VXLAN tunnel is triggered by the
addition of a VLAN on a local port, it is not possible to guarantee that
the commit phase will succeed without doing a lot of work.
In these cases, the artificial division between prepare and commit phase
does not make sense, so simply do the work in the prepare phase.
Fixes: d70e42b22d ("mlxsw: spectrum: Enable VxLAN enslavement to VLAN-aware bridges")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VXLAN is a loadable module, MLXSW_SPECTRUM must not be built-in:
drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c:2547: undefined
reference to `vxlan_fdb_find_uc'
Add Kconfig dependency to enforce usable configurations.
Fixes: 1231e04f5b ("mlxsw: spectrum_switchdev: Add support for VxLAN encapsulation")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that lag port TX is disabled before mlxsw_sp_port_lag_leave()
is called and prevent from possible EMAD error.
Fixes: 0d65fc1304 ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removal of the mlxsw driver on Spectrum-2 platforms hits an ASSERT_RTNL()
in Spectrum-2 ACL Bloom filter and in ERP removal paths. This happens
because the multicast router implementation in Spectrum-2 relies on ACLs.
Taking the RTNL lock upon driver removal is useless since the driver first
removes its ports and unregisters from notifiers so concurrent writes
cannot happen at that time. The assertions were originally put as a
reminder for future work involving ERP background optimization, but having
these assertions only during addition serves this purpose as well.
Therefore remove the ASSERT_RTNL() in both places related to ERP and Bloom
filter removal.
Fixes: cf7221a4f5 ("mlxsw: spectrum_router: Add Multicast routing support for Spectrum-2")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When writing to C-TCAM, mlxsw driver uses cregion->ops->entry_insert().
In case of C-TCAM HW insertion error, the opposite action should take
place.
Add error handling case in which the C-TCAM region entry is removed, by
calling cregion->ops->entry_remove().
Fixes: a0a777b940 ("mlxsw: spectrum_acl: Start using A-TCAM")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already need to zero out memory for dma_alloc_coherent(), as such
using dma_zalloc_coherent() is superflous. Phase it out.
This change was generated with the following Coccinelle SmPL patch:
@ replace_dma_zalloc_coherent @
expression dev, size, data, handle, flags;
@@
-dma_zalloc_coherent(dev, size, handle, flags)
+dma_alloc_coherent(dev, size, handle, flags)
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: re-ran the script on the latest tree]
Signed-off-by: Christoph Hellwig <hch@lst.de>
pci_{,un}map_sg are deprecated and replaced by dma_{,un}map_sg. This is
especially relevant since the rest of the driver uses the DMA API. Fix
the driver to use the replacement APIs.
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch solves a crash at the time of mlx4 driver unload or system
shutdown. The crash occurs because dma_alloc_coherent() returns one
value in mlx4_alloc_icm_coherent(), but a different value is passed to
dma_free_coherent() in mlx4_free_icm_coherent(). In turn this is because
when allocated, that pointer is passed to sg_set_buf() to record it,
then when freed it is re-calculated by calling
lowmem_page_address(sg_page()) which returns a different value. Solve
this by recording the value that dma_alloc_coherent() returns, and
passing this to dma_free_coherent().
This patch is roughly equivalent to commit 378efe798e ("RDMA/hns: Get
rid of page operation after dma_alloc_coherent").
Based-on-code-from: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This has been a fairly typical cycle, with the usual sorts of driver
updates. Several series continue to come through which improve and
modernize various parts of the core code, and we finally are starting to
get the uAPI command interface cleaned up.
- Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5,
qib, rxe, usnic
- Rework the entire syscall flow for uverbs to be able to run over
ioctl(). Finally getting past the historic bad choice to use write()
for command execution
- More functional coverage with the mlx5 'devx' user API
- Start of the HFI1 series for 'TID RDMA'
- SRQ support in the hns driver
- Support for new IBTA defined 2x lane widths
- A big series to consolidate all the driver function pointers into
a big struct and have drivers provide a 'static const' version of the
struct instead of open coding initialization
- New 'advise_mr' uAPI to control device caching/loading of page tables
- Support for inline data in SRPT
- Modernize how umad uses the driver core and creates cdev's and sysfs
files
- First steps toward removing 'uobject' from the view of the drivers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlwhV2oACgkQOG33FX4g
mxpF8A/9EkRCg6wCDC59maA53b5PjuNmD//9hXbycQPQSlxntI2PyYtxrzBqc0+2
yIaFFMehL41XNN6y1zfkl7ndl62McCH2TpiidU8RyTxVw/e3KsDD5sU6++atfHRo
M82RNfedDtxPG8TcCPKVLof6JHADApGSR1r4dCYfAnu7KFMyvlLmeYyx4r/2E6yC
iQPmtKVOdbGkuWGeX+brGEA0vg7FUOAvaysnxddjyh9hyem4h0SUR3Af/Ik0N5ME
PYzC+hMKbkPVBLoCWyg7QwUaqK37uWwguMQLtI2byF7FgbiK/lBQt6TsidR4Fw3p
EalL7uqxgCTtLYh918vxLFjdYt6laka9j7xKCX8M8d06sy/Lo8iV4hWjiTESfMFG
usqs7D6p09gA/y1KISji81j6BI7C92CPVK2drKIEnfyLgY5dBNFcv9m2H12lUCH2
NGbfCNVaTQVX6bFWPpy2Bt2y/Litsfxw5RviehD7jlG0lQjsXGDkZzsDxrMSSlNU
S79iiTJyK4kUZkXzrSSlN58pLBlbupJwm5MDjKmM+irsrsCHjGIULvc902qtnC3/
8ImiTtW6XvqLbgWXyy2Th8/ZgRY234p1ybhog+DFaGKUch0XqB7VXTV2OZm0GjcN
Fp4PUeBt+/gBgYqjpuffqQc1rI4uwXYSoz7wq9RBiOpw5zBFT1E=
=T0p1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This has been a fairly typical cycle, with the usual sorts of driver
updates. Several series continue to come through which improve and
modernize various parts of the core code, and we finally are starting
to get the uAPI command interface cleaned up.
- Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4,
mlx5, qib, rxe, usnic
- Rework the entire syscall flow for uverbs to be able to run over
ioctl(). Finally getting past the historic bad choice to use
write() for command execution
- More functional coverage with the mlx5 'devx' user API
- Start of the HFI1 series for 'TID RDMA'
- SRQ support in the hns driver
- Support for new IBTA defined 2x lane widths
- A big series to consolidate all the driver function pointers into a
big struct and have drivers provide a 'static const' version of the
struct instead of open coding initialization
- New 'advise_mr' uAPI to control device caching/loading of page
tables
- Support for inline data in SRPT
- Modernize how umad uses the driver core and creates cdev's and
sysfs files
- First steps toward removing 'uobject' from the view of the drivers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits)
RDMA/srpt: Use kmem_cache_free() instead of kfree()
RDMA/mlx5: Signedness bug in UVERBS_HANDLER()
IB/uverbs: Signedness bug in UVERBS_HANDLER()
IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported
IB/umad: Start using dev_groups of class
IB/umad: Use class_groups and let core create class file
IB/umad: Refactor code to use cdev_device_add()
IB/umad: Avoid destroying device while it is accessed
IB/umad: Simplify and avoid dynamic allocation of class
IB/mlx5: Fix wrong error unwind
IB/mlx4: Remove set but not used variable 'pd'
RDMA/iwcm: Don't copy past the end of dev_name() string
IB/mlx5: Fix long EEH recover time with NVMe offloads
IB/mlx5: Simplify netdev unbinding
IB/core: Move query port to ioctl
RDMA/nldev: Expose port_cap_flags2
IB/core: uverbs copy to struct or zero helper
IB/rxe: Reuse code which sets port state
IB/rxe: Make counters thread safe
IB/mlx5: Use the correct commands for UMEM and UCTX allocation
...
Drop LIST_HEAD where the variable it declares has never
been used.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Fixes: c82e9aa0a8 ("mlx4_core: resource tracking for HCA resources used by guests")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop LIST_HEAD where the variable it declares is never used.
The uses were removed in 244cd96adb ("net_sched: remove
list_head from tc_action"), but not the declaration.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Fixes: 244cd96adb ("net_sched: remove list_head from tc_action")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop LIST_HEAD where the variable it declares is never used.
These became useless in 244cd96adb ("net_sched: remove list_head
from tc_action")
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Fixes: 244cd96adb ("net_sched: remove list_head from tc_action")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ethtool private flag 'xdp_tx_mpwqe' to control the feature
from userspace.
Feature is set ON by default, if supported.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add support for the HW feature of multi-packet WQE in XDP
xmit flow.
The conventional TX descriptor (WQE, Work Queue Element) serves
a single packet. Our HW has support for multi-packet WQE (MPWQE)
in which a single descriptor serves multiple TX packets.
This reduces both the PCI overhead and the CPU cycles wasted on
writing them.
In this patch we add support for the HW feature, which is supported
starting from ConnectX-5.
Performance:
Tested packet rate for UDP 64Byte multi-stream over ConnectX-5 NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
XDP_TX:
We see a huge gain on single port ConnectX-5, and reach the 100 Mpps
milestone.
* Single-port HCA:
Before: 70 Mpps
After: 100 Mpps (+42.8%)
* Dual-port HCA:
Before: 51.7 Mpps
After: 57.3 Mpps (+10.8%)
* In both cases we tested traffic on one port and for now On Dual-port HCAs
we see only small gain, we are working to overcome this bottleneck, but
for the moment only with experimental firmware on dual port HCAs we can
reach the wanted numbers as seen on Single-port HCAs.
XDP_REDIRECT:
Redirect from (A) ConnectX-5 to (B) ConnectX-5.
Due to a setup limitation, (A) and (B) are on different NUMA nodes,
so absolute performance numbers are not optimal.
Note:
Below is the transmit rate of (B), not the redirect rate of (A)
which is in some cases higher.
* (B) is single-port:
Before: 77 Mpps
After: 90 Mpps (+16.8%)
* (B) is dual-port:
Before: 61 Mpps
After: 72 Mpps (+18%)
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Each xdp_wqe_info instance describes the number of data-segments
and WQEBBs of the WQE.
This is useful for a downstream patch that adds support for
Multi-Packet TX WQE feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This provides infrastructure to have multiple xdp_info instances
for the same consumer index.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Instead of calculating the control segment to be used upon an
XDP xmit doorbell, save it in SQ structure.
Nullify when no pending doorbell.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Do not ignore the CQE opcode.
This helps expose issues and debug them.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Do not maintain an SQ state bit to indicate whether an
XDP SQ serves redirect operations.
Instead, rely on the fact that such an XDP SQ doesn't reside
in an RQ instance, while the others do.
This info is not known to the XDP SQ functions themselves,
and they rely on their callers to distinguish between the cases.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
At the end of the RQ polling loop, some XDP-related operations
might be required. Before checking them one by one, check if
an XDP program is even loaded.
Combine all the checks and operations in a single function in xdp files.
This saves unnecessary checks for non-XDP flows.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The opcode indicates about the error reason.
Printing it helps in debug.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This series adds some misc updates and the support for tunnels over VLAN
tc offloads.
From Miroslav Lichvar, patches #1,2
1) Update timecounter at least twice per counter overflow
2) Extend PTP gettime function to read system clock
From Gavi Teitz, patch #3
3) Increase VF representors' SQ size to 128
From Eli Britstein and Or Gerlitz, patches #4-10
4) Adds the capability to support tunnels over VLAN device.
Patch 4 avoids crash for TC flow with egress upper devices
Patch 5 refactors tunnel routing devs into a helper function
Patch 6 avoids crash for TC encap flows with vlan on underlay
Patches 7-8 refactor encap tunnel header preparing code.
Patch 9 adds support for building VLAN tagged ETH header.
Patch 10 adds support for tunnel routing to VLAN device.
From Aviv, patches 11,12 to fix earlier VF lag series
5) Fix query_nic_sys_image_guid() error during init
6) Fix LAG requirement when CONFIG_MLX5_ESWITCH is off
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcG5O8AAoJEEg/ir3gV/o+v0AH/ja1bJ6PkAlw2bCRMIuI6HK/
1vh9n8D74tIOAlvUi6QiLOgJ7CLdAgiAVFppDWzJSBUsY2XNAtRdYvLDDt9hGafO
QGysBWNVcX2aUVp+pLDCCVEYBWyyIzW416CWHx2IUgdAg9S6cvJK6P/81wd+l4Zp
8YlPstEUANP/JKZxHHjSelMcnY+Bj0JrDzuyyyaQdwmcHo5I7Ht0tNex14yFfFbl
gd7YvfPr1PtovPX2w9hMt3y3ml4mommB+jtxc0+59D+A7680hBOpAnH5pONms9rv
OKKKV8KqpxL1m32/TGbd7XSG+llLJwsSM6RtD4oUdiPA4iCiVDVdreP5vac52C4=
=0QQ5
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2018-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2018-12-19
This series adds some misc updates and the support for tunnels over VLAN
tc offloads.
From Miroslav Lichvar, patches #1,2
1) Update timecounter at least twice per counter overflow
2) Extend PTP gettime function to read system clock
From Gavi Teitz, patch #3
3) Increase VF representors' SQ size to 128
From Eli Britstein and Or Gerlitz, patches #4-10
4) Adds the capability to support tunnels over VLAN device.
Patch 4 avoids crash for TC flow with egress upper devices
Patch 5 refactors tunnel routing devs into a helper function
Patch 6 avoids crash for TC encap flows with vlan on underlay
Patches 7-8 refactor encap tunnel header preparing code.
Patch 9 adds support for building VLAN tagged ETH header.
Patch 10 adds support for tunnel routing to VLAN device.
From Aviv, patches 11,12 to fix earlier VF lag series
5) Fix query_nic_sys_image_guid() error during init
6) Fix LAG requirement when CONFIG_MLX5_ESWITCH is off
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
VID 1 is not reserved anymore, so remove the check that prevented the
creation of VLAN devices with this VID over mlxsw ports.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to abuse VID 1 anymore and we can instead use VID 4095
as the default VLAN, which will be configured on the port throughout its
lifetime.
The OVS join / leave functions are changed to enable VIDs 1-4094
(inclusive) instead of 2-4095. This because VID 4095 is now the default
VLAN instead of 1.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VLAN entries on a port can be associated with either a bridge VLAN or a
router port. Before the VLAN entry is destroyed these associations need
to be cleaned up.
Currently, this is always invoked from the function which destroys the
VLAN entry, but next patch is going to skip the destruction of the
default entry when a port in unlinked from a LAG.
The above does not mean that the associations should not be cleaned up,
so add a helper that will be invoked from both call sites.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subsequent patches will need to access the default port VLAN. Since this
VLAN will exist throughout the lifetime of the port, simply store it in
the port's struct.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function allows flushing all the existing VLAN entries on a port. It
is invoked when a port is destroyed and when it is unlinked from a LAG.
In the latter case, when moving to the new default VLAN, there will not
be a need to destroy the default VLAN entry.
Therefore, add an argument that allows to control whether the default
port VLAN should be destroyed or not. Currently it is always set to
'true'.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the driver does not set the port's PVID when initializing a
new port. This is because the driver is using VID 1 as PVID which is the
firmware default.
Subsequent patches are going to change the PVID the driver is setting
when initializing a new port.
Prepare for that by explicitly setting the port's PVID.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subsequent patches are going to replace the current default VID (1) with
VLAN_N_VID - 1 (4095).
Prepare for this conversion by replacing the hard-coded '1' with a
define.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In symmetric routing, the only two members in the VLAN corresponding to
the L3 VNI are the router port and the VXLAN tunnel.
In case the VXLAN device is already enslaved to the bridge and only
later the VLAN interface is configured, the tunnel will not be
offloaded.
The reason for this is that when the router interface (RIF)
corresponding to the VLAN interface is configured, it calls the core
fid_get() API which does not check if NVE should be enabled on the FID.
Instead, call into the bridge code which will check if NVE should be
enabled on the FID.
This effectively means that the same code path is used to retrieve a FID
when either a local port or a router port joins the FID.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
mlx5 updates taken for dependencies on following patches.
* branche 'mlx5-next': (23 commits)
IB/mlx5: Introduce uid as part of alloc/dealloc transport domain
net/mlx5: Add shared Q counter bits
net/mlx5: Continue driver initialization despite debugfs failure
net/mlx5: Fold the modify lag code into function
net/mlx5: Add lag affinity info to log
net/mlx5: Split the activate lag function into two routines
net/mlx5: E-Switch, Introduce flow counter affinity
IB/mlx5: Unify e-switch representors load approach between uplink and VFs
net/mlx5: Use lowercase 'X' for hex values
net/mlx5: Remove duplicated include from eswitch.c
net/mlx5: Remove the get protocol device interface entry
net/mlx5: Support extended destination format in flow steering command
net/mlx5: E-Switch, Change vhca id valid bool field to bit flag
net/mlx5: Introduce extended destination fields
net/mlx5: Revise gre and nvgre key formats
net/mlx5: Add monitor commands layout and event data
net/mlx5: Add support for plugged-disabled cable status in PME
net/mlx5: Add support for PCIe power slot exceeded error in PME
net/mlx5: Rework handling of port module events
net/mlx5: Move flow counters data structures from flow steering header
...
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.
Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_MLX5_ESWITCH is not defined, test for SR-IOV being disabled,
instead of calling e-switch LAG prereq routine.
Since LAG with SRIOV is allowed only when switchdev mode is on.
Fixes: eff849b2c6 ("net/mlx5: Allow/disallow LAG according to pre-req only")
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
vport system image guid should be queried using vport nic API for
Ethernet ports, and vport hca API for Infiniband ports.
Fixes: fadd59fc50 ("net/mlx5: Introduce inter-device communication mechanism")
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Generate encap header depending on the routed device to support
native/tagged Ethernet header.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Support generation of native or tagged Ethernet header for encap
header, depending on provided net device.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Change the order to first route IPv4/6 and return if error. Only after
successful route continue to allocate an encap header, with no
functional change.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In tunnel encap we prepare the encap header for IPv4/6 cases, in two
separate functions. For ETH header generation the code is almost
duplicated.
Move the ETH header generation code from IPv4/6 functions to a helper
function, with no functional change.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently we don't support nor fail attempts to offload encap flows routed
to vlan device on the underlay network. We wrongly consider a vlan underlay
device to be on the same e-switch b/c the switchdev ID is retrieved recursively.
Add explicit check for that and fail such attempts.
Also align to a more strict check for the ingress and the underlay devices
to practically be on the same eswitch.
Fixes: ce99f6b97f ('net/mlx5e: Support SRIOV TC encapsulation offloads for IPv6 tunnels')
Fixes: 3e621b19b0 ('net/mlx5e: Support TC encapsulation offloads with upper devices')
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
For tunnel we determine the output devs for IPv4/6 cases, in two
separate functions, with a duplicated code.
Move that code from IPv4/6 functions to a helper function, with no
functional change.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
We use the switchdev parent HW id helper to identify if the mirred device
shares the same ASIC/port with the ingress device. This can get us wrong
in the presence of upper devices such as vlan or bridge set over the HW
devices (VF or uplink representors), b/c the switchdev ID is retrieved
recursively.
To fail offload attempts in such cases, we condition the check on the
egress device to have not only the same switchdev ID but also the relevant
mlx5 netdev ops.
Fixes: 03a9d11e6e ('net/mlx5e: Add TC drop and mirred/redirect action parsing for SRIOV offloads')
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
There are cases (e.g tunneling with vlan on underlay and potentially
more) where this makes sense, so allow that.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The default size for the VF representors' SQ was too small to handle high
packet rates. Doubling the size from 64 to 128 drastically improves the
packet rate under stress (by about 50%), whereas increasing the size
beyond 128 has not shown to make any further difference.
The impact of the SQ size was measured with UDP traffic, in the following
topology: TG <-> PF <-> TC forwarding <-> VF representor <-> VF in VM
over a single core processing bi-directional traffic, with the following
results:
SQ size of 64: SQ size of 128:
Packet rate for 64B UDP packets: 860 [Kpps] 1280 [Kpps]
Packet rate for 114B VxLan
encapsulated UDP packets: 320 [Kpps] 500 [Kpps]
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Read the system time right before and immediately after reading the low
register of the internal timer. This adds support for the
PTP_SYS_OFFSET_EXTENDED ioctl.
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The timecounter needs to be updated at least once in half of the
cyclecounter interval to prevent timecounter_cyc2time() interpreting a
new timestamp as an old value and causing a backward jump.
This would be an issue if the timecounter multiplier was so small that
the update interval would not be limited by the 64-bit overflow in
multiplication.
Shorten the calculated interval to make sure the timecounter is updated
in time even when the system clock is slowed down by up to 10%, the
multiplier is increased by up to 10%, and the scheduled overflow check
is late by 15%.
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ariel Levkovich <lariel@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Expression terminated with "," instead of ";", resulted in
set_fte getting bad value for modify_enable_mask field.
Fixes: bd5251dbf1 ("net/mlx5_core: Introduce flow steering destination of type counter")
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>