Avoid exiting the function with a lingering sysfs file (if the first
call to device_create_file() fails while the second succeeds), and avoid
calling devlink_port_unregister() twice.
In other words, either mlx4_init_port_info() succeeds and returns zero, or
it fails, returns non-zero, and requires no cleanup.
Fixes: 096335b3f9 ("mlx4_core: Allow dynamic MTU configuration for IB ports")
Signed-off-by: Tarick Bedeir <tarick@google.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJa9NFvAAoJEEg/ir3gV/o+vNoIAM/5zfT9f2iu6uNADcrFFfYY
bdHY/psg9qDAjqZxmFYezcwdOrPY5GxLn+1VVZPLfwCir/qupTOO2skLRyAKEau0
uKSP45LD6E+M0Sew+15//sEB3J2JzcjJsNd61lzdl+3GKT/Nr/ZGY0K8iFXItdc3
Ye/vsL1IRNaosl4dnAGzOylGeit2VeUkmS/JrFRVqFjVLu78zxEuLHdnIZApt+4W
lwpLnsplhUbPk6lwHNNureSuzQq4SXMLWIB+v1uxzHOSSZT8nkrr4/ew/BYHp1oo
EteRykl4x3SkDbPcTBeElvpb52nduC6jgn8auVXOY9XZDmqX+rxhfBh3fBVFYB8=
=FwjF
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2018-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2018-05-10
the following series includes some fixes for mlx5 core driver.
Please pull and let me know if there's any problem.
For -stable v4.5
("net/mlx5: E-Switch, Include VF RDMA stats in vport statistics")
For -stable v4.10
("net/mlx5e: Err if asked to offload TC match on frag being first")
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Resources are not freed in the reverse order of the allocation.
Labels are also mixed-up.
Fix it and reorder code and labels in the error handling path of
'mlxsw_core_bus_device_register()'
Fixes: ef3116e540 ("mlxsw: spectrum: Register KVD resources with devlink")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The HW doesn't support matching on frag first/later, return error if we are
asked to offload that.
Fixes: 3f7d0eb42d ("net/mlx5e: Offload TC matching on packets being IP fragments")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The host side reporting of VF vport statistics didn't include the VF
RDMA traffic.
Fixes: 3b751a2a41 ("net/mlx5: E-Switch, Introduce get vf statistics")
Signed-off-by: Adi Nissim <adin@mellanox.com>
Reported-by: Ariel Almog <ariela@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Some platforms require IRQs to be free'd in the shutdown path. Otherwise
they will fail to be reallocated after a kexec.
Fixes: 8812c24d28 ("net/mlx5: Add fast unload support in shutdown flow")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.
So, doing some explicit kfree in the error handling path would lead to
some double kfree.
Simplify code to avoid such a case.
Fixes: 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add check of coalescing parameters received through ethtool are within
range of values supported by the HW.
Driver gets the coalescing rx/tx-usecs and rx/tx-frames as set by the
users through ethtool. The ethtool support up to 32 bit value for each.
However, mlx4 modify cq limits the coalescing time parameter and
coalescing frames parameters to 16 bits.
Return out of range error if user tries to set these parameters to
higher values.
Change type of sample-interval and adaptive_rx_coal parameters in mlx4
driver to u32 as the ethtool holds them as u32 and these parameters are
not limited due to mlx4 HW.
Fixes: c27a02cd94 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC')
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial fix to spelling mistake in netdev_err error message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
trivial fix to spelling mistake in mlx4_warn message.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJa4ixLAAoJEEg/ir3gV/o+PB0IAKlexiYEyID3N3cvG/I0m4Zh
yxIN4H8sUu3kz7dG6dkvpL5Oo5c43/huE+tspOYfFmydMWvOV+DakwqAhkE+KUfe
VIpF0cM7+T8Els8e7OzuT0Zu5ggeN1wU0uPRhAn1F592BH0ppXGBn8WIKTffYb8c
5XeB2JSZyw4yMgk1zurm4tVtFvHbYO7SAkLZZG5E0m7EeIujVWi1lTnhXNi9zdm/
48LoLZ/1Rmx0e/Qpey2fm9HEPRPTgCNSBLEsx2hIDiJG56YyWPH6+N7U9Acf2PaI
lKu2JMqLYe3du8hhPtCbPYH0i74af/LbNCQgXXgPstAI49v+MydlBcv2NX9J6NY=
=nw/Q
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2018-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2018-04-26
This pull request includes fixes for mlx5 core and netdev driver.
Please pull and let me know if there's any problems.
For -stable v4.12
net/mlx5e: TX, Use correct counter in dma_map error flow
For -stable v4.13
net/mlx5: Avoid cleaning flow steering table twice during error flow
For -stable v4.14
net/mlx5e: Allow offloading ipv4 header re-write for icmp
For -stable v4.15
net/mlx5e: DCBNL fix min inline header size for dscp
For -stable v4.16
net/mlx5: Fix mlx5_get_vector_affinity function
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When IGMP snooping is enabled on a bridge, traffic forwarded by an MDB
entry should be sent to both ports member in the MDB's ports list and
mrouter ports.
In case a port needs to be removed from an MDB's ports list, but this
port is also configured as an mrouter port, then do not update the
device so that it will continue to forward traffic through that port.
Fix a copy-paste error that checked that IGMP snooping is enabled twice
instead of checking the port's mrouter state.
Fixes: ded711c87a ("mlxsw: spectrum_switchdev: Consider mrouter status for mdb changes")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Colin King <colin.king@canonical.com>
Reviewed-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When deleting a flow counter, the modify mask should be the action and
the flow counter. Otherwise the flow counter is not deleted and we'll
get a firmware warning when deleting the remaining destinations on the
same FTE.
It only happens in the presence of flow counter and multiple vport
destinations. If there is only one vport destination, there is no
need to update the FTE when deleting the only vport destination,
we just delete the FTE.
Fixes: ae05831424 ("net/mlx5: Add option to add fwd rule with counter")
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
After the cited commit, WQE RQ size is calculated based on sw_mtu but it
was not set for representors. This commit fixes that.
Fixes: 472a1e44b3 ("net/mlx5e: Save MTU in channels params")
Signed-off-by: Shahar Klein <shahark@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When we fail to initialize the RX root namespace, we need
to clean only that and not the entire flow steering.
Currently the code may try to clean the flow steering twice
on error witch leads to null pointer deference.
Make sure we clean correctly.
Fixes: fba53f7b57 ("net/mlx5: Introduce mlx5_flow_steering structure")
Signed-off-by: Talat Batheesh <talatb@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In case of a dma_mapping_error, do not use wi->num_dma
as a parameter for dma unmap function because it's yet
to be set, and holds an out-of-date value.
Use actual value (local variable num_dma) instead.
Fixes: 34802a42b3 ("net/mlx5e: Do not modify the TX SKB")
Fixes: e586b3b0ba ("net/mlx5: Ethernet Datapath files")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When the trust state is set to dscp and the netdev is down, the inline
header size is not updated. When netdev is up, the inline header size
stays at L2 instead of IP.
Fix this issue by updating the private parameter when the netdev is in
down so that when netdev is up, it picks up the right header size.
Fixes: fbcb127e89 ("net/mlx5e: Support DSCP trust state ...")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
For ICMPv4, the checksum is calculated from the ICMP headers and data.
Since the ICMPv4 checksum doesn't cover the IP header, we can allow to
do L3 header re-write for this protocol.
Fixes: bdd66ac0ae ('net/mlx5e: Disallow TC offloading of unsupported match/action combinations')
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Pull networking fixes from David Miller:
1) The sockmap code has to free socket memory on close if there is
corked data, from John Fastabend.
2) Tunnel names coming from userspace need to be length validated. From
Eric Dumazet.
3) arp_filter() has to take VRFs properly into account, from Miguel
Fadon Perlines.
4) Fix oops in error path of tcf_bpf_init(), from Davide Caratti.
5) Missing idr_remove() in u32_delete_key(), from Cong Wang.
6) More syzbot stuff. Several use of uninitialized value fixes all
over, from Eric Dumazet.
7) Do not leak kernel memory to userspace in sctp, also from Eric
Dumazet.
8) Discard frames from unused ports in DSA, from Andrew Lunn.
9) Fix DMA mapping and reset/failover problems in ibmvnic, from Thomas
Falcon.
10) Do not access dp83640 PHY registers prematurely after reset, from
Esben Haabendal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
vhost-net: set packet weight of tx polling to 2 * vq size
net: thunderx: rework mac addresses list to u64 array
inetpeer: fix uninit-value in inet_getpeer
dp83640: Ensure against premature access to PHY registers after reset
devlink: convert occ_get op to separate registration
ARM: dts: ls1021a: Specify TBIPA register address
net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
ibmvnic: Do not reset CRQ for Mobility driver resets
ibmvnic: Fix failover case for non-redundant configuration
ibmvnic: Fix reset scheduler error handling
ibmvnic: Zero used TX descriptor counter on reset
ibmvnic: Fix DMA mapping mistakes
tipc: use the right skb in tipc_sk_fill_sock_diag()
sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
net: dsa: Discard frames from unused ports
sctp: do not leak kernel memory to user space
soreuseport: initialise timewait reuseport field
ipv4: fix uninit-value in ip_route_output_key_hash_rcu()
dccp: initialize ireq->ir_mark
net: fix uninit-value in __hw_addr_add_ex()
...
This resolves race during initialization where the resources with
ops are registered before driver and the structures used by occ_get
op is initialized. So keep occ_get callbacks registered only when
all structs are initialized.
The example flows, as it is in mlxsw:
1) driver load/asic probe:
mlxsw_core
-> mlxsw_sp_resources_register
-> mlxsw_sp_kvdl_resources_register
-> devlink_resource_register IDX
mlxsw_spectrum
-> mlxsw_sp_kvdl_init
-> mlxsw_sp_kvdl_parts_init
-> mlxsw_sp_kvdl_part_init
-> devlink_resource_size_get IDX (to get the current setup
size from devlink)
-> devlink_resource_occ_get_register IDX (register current
occupancy getter)
2) reload triggered by devlink command:
-> mlxsw_devlink_core_bus_device_reload
-> mlxsw_sp_fini
-> mlxsw_sp_kvdl_fini
-> devlink_resource_occ_get_unregister IDX
(struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
which is using mlxsw_sp would cause use-after free)
-> mlxsw_sp_init
-> mlxsw_sp_kvdl_init
-> mlxsw_sp_kvdl_parts_init
-> mlxsw_sp_kvdl_part_init
-> devlink_resource_size_get IDX (to get the current setup
size from devlink)
-> devlink_resource_occ_get_register IDX (register current
occupancy getter)
Fixes: d9f9b9a4d0 ("devlink: Add support for resource abstraction")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlrHeY8UHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxhLRAAndV/0NDyWZU0eZNM6twri2SEFnF7
E4ar+YthxDxxJG4TLJbIA12jc5NgHZy4WuttDa6Jb99KreBXIHJFlNi/V/tme6zf
+yXUuxWae7wJzBiaay57VqLGSc80gt/LTgjLa1siwQqjTbO3wSXR6JJXNaE9FtQ4
/jL61t8bD1Peb5cWTpt9p0hrnKI0/pHwASdReyFS4F/HDKdvpof7BxE/OU3HSxxA
XKC2v6RjY4S93vkzvApDXQ+vhKquVRK7/ojyTXQUO/GIzcARprO7H4k62N4ar0x/
qbXLkR8IMkwA8ecsNmcL92ftb/cXoHfd+wdK8WpijqzF4kW4SdteVWbIhUzI0gbr
0gjDYIzjplvH3pZGv/qvx+8sFtAP95OdPjuAAW2qJ9TCVfmiS8naNFCvcxg87RhD
gjyQD3If1X7F8wy309lhq7VNyRexTHgIMgTXHyFvuZMzn/Qe1huL2XCwDcEAg/OX
AvU2iuSE5tWAh7gIUMF/aWi3uoeJUyyoru5ZR//gqdFfx9YxpSimO1UDXnpPi8SR
Iz/jzHJc0aWGYdQ9l6HiSbJF3P/QQcWYs9igt0A7BRGB05SPdWCh7sSO70FJa8ME
f4WID5/qEiaH26kiSRX4cUqpc8Amk8bT0DXw2OT57qy3JM0ZdV5ENQX11pSpr9hv
uLEf0DU7AEmdvzQ=
=T++R
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- move pci_uevent_ers() out of pci.h (Michael Ellerman)
- skip ASPM common clock warning if BIOS already configured it (Sinan
Kaya)
- fix ASPM Coverity warning about threshold_ns (Gustavo A. R. Silva)
- remove last user of pci_get_bus_and_slot() and the function itself
(Sinan Kaya)
- add decoding for 16 GT/s link speed (Jay Fang)
- add interfaces to get max link speed and width (Tal Gilboa)
- add pcie_bandwidth_capable() to compute max supported link bandwidth
(Tal Gilboa)
- add pcie_bandwidth_available() to compute bandwidth available to
device (Tal Gilboa)
- add pcie_print_link_status() to log link speed and whether it's
limited (Tal Gilboa)
- use PCI core interfaces to report when device performance may be
limited by its slot instead of doing it in each driver (Tal Gilboa)
- fix possible cpqphp NULL pointer dereference (Shawn Lin)
- rescan more of the hierarchy on ACPI hotplug to fix Thunderbolt/xHCI
hotplug (Mika Westerberg)
- add support for PCI I/O port space that's neither directly accessible
via CPU in/out instructions nor directly mapped into CPU physical
memory space. This is fairly intrusive and includes minor changes to
interfaces used for I/O space on most platforms (Zhichang Yuan, John
Garry)
- add support for HiSilicon Hip06/Hip07 LPC I/O space (Zhichang Yuan,
John Garry)
- use PCI_EXP_DEVCTL2_COMP_TIMEOUT in rapidio/tsi721 (Bjorn Helgaas)
- remove possible NULL pointer dereference in of_pci_bus_find_domain_nr()
(Shawn Lin)
- report quirk timings with dev_info (Bjorn Helgaas)
- report quirks that take longer than 10ms (Bjorn Helgaas)
- add and use Altera Vendor ID (Johannes Thumshirn)
- tidy Makefiles and comments (Bjorn Helgaas)
- don't set up INTx if MSI or MSI-X is enabled to align cris, frv,
ia64, and mn10300 with x86 (Bjorn Helgaas)
- move pcieport_if.h to drivers/pci/pcie/ to encapsulate it (Frederick
Lawler)
- merge pcieport_if.h into portdrv.h (Bjorn Helgaas)
- move workaround for BIOS PME issue from portdrv to PCI core (Bjorn
Helgaas)
- completely disable portdrv with "pcie_ports=compat" (Bjorn Helgaas)
- remove portdrv link order dependency (Bjorn Helgaas)
- remove support for unused VC portdrv service (Bjorn Helgaas)
- simplify portdrv feature permission checking (Bjorn Helgaas)
- remove "pcie_hp=nomsi" parameter (use "pci=nomsi" instead) (Bjorn
Helgaas)
- remove unnecessary "pcie_ports=auto" parameter (Bjorn Helgaas)
- use cached AER capability offset (Frederick Lawler)
- don't enable DPC if BIOS hasn't granted AER control (Mika Westerberg)
- rename pcie-dpc.c to dpc.c (Bjorn Helgaas)
- use generic pci_mmap_resource_range() instead of powerpc and xtensa
arch-specific versions (David Woodhouse)
- support arbitrary PCI host bridge offsets on sparc (Yinghai Lu)
- remove System and Video ROM reservations on sparc (Bjorn Helgaas)
- probe for device reset support during enumeration instead of runtime
(Bjorn Helgaas)
- add ACS quirk for Ampere (née APM) root ports (Feng Kan)
- add function 1 DMA alias quirk for Marvell 88SE9220 (Thomas
Vincent-Cross)
- protect device restore with device lock (Sinan Kaya)
- handle failure of FLR gracefully (Sinan Kaya)
- handle CRS (config retry status) after device resets (Sinan Kaya)
- skip various config reads for SR-IOV VFs as an optimization
(KarimAllah Ahmed)
- consolidate VPD code in vpd.c (Bjorn Helgaas)
- add Tegra dependency on PCI_MSI_IRQ_DOMAIN (Arnd Bergmann)
- add DT support for R-Car r8a7743 (Biju Das)
- fix a PCI_EJECT vs PCI_BUS_RELATIONS race condition in Hyper-V host
bridge driver that causes a general protection fault (Dexuan Cui)
- fix Hyper-V host bridge hang in MSI setup on 1-vCPU VMs with SR-IOV
(Dexuan Cui)
- fix Hyper-V host bridge hang when ejecting a VF before setting up MSI
(Dexuan Cui)
- make several structures static (Fengguang Wu)
- increase number of MSI IRQs supported by Synopsys DesignWare bridges
from 32 to 256 (Gustavo Pimentel)
- implemented multiplexed IRQ domain API and remove obsolete MSI IRQ
API from DesignWare drivers (Gustavo Pimentel)
- add Tegra power management support (Manikanta Maddireddy)
- add Tegra loadable module support (Manikanta Maddireddy)
- handle 64-bit BARs correctly in endpoint support (Niklas Cassel)
- support optional regulator for HiSilicon STB (Shawn Guo)
- use regulator bulk API for Qualcomm apq8064 (Srinivas Kandagatla)
- support power supplies for Qualcomm msm8996 (Srinivas Kandagatla)
* tag 'pci-v4.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (123 commits)
MAINTAINERS: Add John Garry as maintainer for HiSilicon LPC driver
HISI LPC: Add ACPI support
ACPI / scan: Do not enumerate Indirect IO host children
ACPI / scan: Rename acpi_is_serial_bus_slave() for more general use
HISI LPC: Support the LPC host on Hip06/Hip07 with DT bindings
of: Add missing I/O range exception for indirect-IO devices
PCI: Apply the new generic I/O management on PCI IO hosts
PCI: Add fwnode handler as input param of pci_register_io_range()
PCI: Remove __weak tag from pci_register_io_range()
MAINTAINERS: Add missing /drivers/pci/cadence directory entry
fm10k: Report PCIe link properties with pcie_print_link_status()
net/mlx5e: Use pcie_bandwidth_available() to compute bandwidth
net/mlx5: Report PCIe link properties with pcie_print_link_status()
net/mlx4_core: Report PCIe link properties with pcie_print_link_status()
PCI: Add pcie_print_link_status() to log link speed and whether it's limited
PCI: Add pcie_bandwidth_available() to compute bandwidth available to device
misc: pci_endpoint_test: Handle 64-bit BARs properly
PCI: designware-ep: Make dw_pcie_ep_reset_bar() handle 64-bit BARs properly
PCI: endpoint: Make sure that BAR_5 does not have 64-bit flag set when clearing
PCI: endpoint: Make epc->ops->clear_bar()/pci_epc_clear_bar() take struct *epf_bar
...
- Fix RDMA uapi headers to actually compile in userspace and be more
complete
- Three shared with netdev pull requests from Mellanox:
* 7 patches, mostly to net with 1 IB related one at the back). This
series addresses an IRQ performance issue (patch 1), cleanups related to
the fix for the IRQ performance problem (patches 2-6), and then extends
the fragmented completion queue support that already exists in the net
side of the driver to the ib side of the driver (patch 7).
* Mostly IB, with 5 patches to net that are needed to support the remaining
10 patches to the IB subsystem. This series extends the current
'representor' framework when the mlx5 driver is in switchdev mode from
being a netdev only construct to being a netdev/IB dev construct. The IB
dev is limited to raw Eth queue pairs only, but by having an IB dev of
this type attached to the representor for a switchdev port, it enables
DPDK to work on the switchdev device.
* All net related, but needed as infrastructure for the rdma driver
- Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
- SRP performance updates
- IB uverbs write path cleanup patch series from Leon
- Add RDMA_CM support to ib_srpt. This is disabled by default. Users need to
set the port for ib_srpt to listen on in configfs in order for it to be
enabled (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
- TSO and Scatter FCS support in mlx4
- Refactor of modify_qp routine to resolve problems seen while working on new
code that is forthcoming
- More refactoring and updates of RDMA CM for containers support from Parav
- mlx5 'fine grained packet pacing', 'ipsec offload' and 'device memory'
user API features
- Infrastructure updates for the new IOCTL interface, based on increased usage
- ABI compatibility bug fixes to fully support 32 bit userspace on 64 bit
kernel as was originally intended. See the commit messages for
extensive details
- Syzkaller bugs and code cleanups motivated by them
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJax5Z0AAoJEDht9xV+IJsacCwQAJBIgmLCvVp5fBu2kJcXMMVI
y3l2YNzAUJvDDKv1r5yTC9ugBXEkDtgzi/W/C2/5es2yUG/QeT/zzQ3YPrtsnN68
5FkiXQ35Tt7+PBHMr0cacGRmF4M3Td3MeW0X5aJaBKhqlNKwA+aF18pjGWBmpVYx
URYCwLb5BZBKVh4+1Leebsk4i0/7jSauAqE5M+9notuAUfBCoY1/Eve3DipEIBBp
EyrEnMDIdujYRsg4KHlxFKKJ1EFGItknLQbNL1+SEa0Oe0SnEl5Bd53Yxfz7ekNP
oOWQe5csTcs3Yr4Ob0TC+69CzI71zKbz6qPDILTwXmsPFZJ9ipJs4S8D6F7ra8tb
D5aT1EdRzh/vAORPC9T3DQ3VsHdvhwpUMG7knnKrVT9X/g7E+gSji1BqaQaTr/xs
i40GepHT7lM/TWEuee/6LRpqdhuOhud7vfaRFwn2JGRX9suqTcvwhkBkPUDGV5XX
5RkHcWOb/7KvmpG7S1gaRGK5kO208LgmAZi7REaJFoZB74FqSneMR6NHIH07ha41
Zou7rnxV68CT2bgu27m+72EsprgmBkVDeEzXgKxVI/+PZ1oadUFpgcZ3pRLOPWVx
rEqjHu65rlA/YPog4iXQaMfSwt/oRD3cVJS/n8EdJKXi4Qt2RDDGdyOmt74w4prM
QuLEdvJIFmwrND1KDoqn
=Ku8g
-----END PGP SIGNATURE-----
Merge tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Doug and I are at a conference next week so if another PR is sent I
expect it to only be bug fixes. Parav noted yesterday that there are
some fringe case behavior changes in his work that he would like to
fix, and I see that Intel has a number of rc looking patches for HFI1
they posted yesterday.
Parav is again the biggest contributor by patch count with his ongoing
work to enable container support in the RDMA stack, followed by Leon
doing syzkaller inspired cleanups, though most of the actual fixing
went to RC.
There is one uncomfortable series here fixing the user ABI to actually
work as intended in 32 bit mode. There are lots of notes in the commit
messages, but the basic summary is we don't think there is an actual
32 bit kernel user of drivers/infiniband for several good reasons.
However we are seeing people want to use a 32 bit user space with 64
bit kernel, which didn't completely work today. So in fixing it we
required a 32 bit rxe user to upgrade their userspace. rxe users are
still already quite rare and we think a 32 bit one is non-existing.
- Fix RDMA uapi headers to actually compile in userspace and be more
complete
- Three shared with netdev pull requests from Mellanox:
* 7 patches, mostly to net with 1 IB related one at the back).
This series addresses an IRQ performance issue (patch 1),
cleanups related to the fix for the IRQ performance problem
(patches 2-6), and then extends the fragmented completion queue
support that already exists in the net side of the driver to the
ib side of the driver (patch 7).
* Mostly IB, with 5 patches to net that are needed to support the
remaining 10 patches to the IB subsystem. This series extends
the current 'representor' framework when the mlx5 driver is in
switchdev mode from being a netdev only construct to being a
netdev/IB dev construct. The IB dev is limited to raw Eth queue
pairs only, but by having an IB dev of this type attached to the
representor for a switchdev port, it enables DPDK to work on the
switchdev device.
* All net related, but needed as infrastructure for the rdma
driver
- Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
- SRP performance updates
- IB uverbs write path cleanup patch series from Leon
- Add RDMA_CM support to ib_srpt. This is disabled by default. Users
need to set the port for ib_srpt to listen on in configfs in order
for it to be enabled
(/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
- TSO and Scatter FCS support in mlx4
- Refactor of modify_qp routine to resolve problems seen while
working on new code that is forthcoming
- More refactoring and updates of RDMA CM for containers support from
Parav
- mlx5 'fine grained packet pacing', 'ipsec offload' and 'device
memory' user API features
- Infrastructure updates for the new IOCTL interface, based on
increased usage
- ABI compatibility bug fixes to fully support 32 bit userspace on 64
bit kernel as was originally intended. See the commit messages for
extensive details
- Syzkaller bugs and code cleanups motivated by them"
* tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (199 commits)
IB/rxe: Fix for oops in rxe_register_device on ppc64le arch
IB/mlx5: Device memory mr registration support
net/mlx5: Mkey creation command adjustments
IB/mlx5: Device memory support in mlx5_ib
net/mlx5: Query device memory capabilities
IB/uverbs: Add device memory registration ioctl support
IB/uverbs: Add alloc/free dm uverbs ioctl support
IB/uverbs: Add device memory capabilities reporting
IB/uverbs: Expose device memory capabilities to user
RDMA/qedr: Fix wmb usage in qedr
IB/rxe: Removed GID add/del dummy routines
RDMA/qedr: Zero stack memory before copying to user space
IB/mlx5: Add ability to hash by IPSEC_SPI when creating a TIR
IB/mlx5: Add information for querying IPsec capabilities
IB/mlx5: Add IPsec support for egress and ingress
{net,IB}/mlx5: Add ipsec helper
IB/mlx5: Add modify_flow_action_esp verb
IB/mlx5: Add implementation for create and destroy action_xfrm
IB/uverbs: Introduce ESP steering match filter
IB/uverbs: Add modify ESP flow_action
...
This change updates the mlx5 interface to create mkey
on the device.
The updates in the command mailbox include increasing the
access mode type field to 5 bits in order to support additional
types such as MLX5_MKC_ACCESS_MODE_MEMIC which represents device
memory access type and will be used when registering MR on allocated
device memory.
All the places that use the old access mode format are adjusted as
well.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds querying of device memory capabilities by the mlx5_core
driver during initialization.
Device memory capabilities is a new capability type and structure
which contains the necessary data that is needed for future device
memory allocation.
The presence of this new capabilities struct is indicated in the
general capabilities struct which is queried first by the driver.
If the presence bit is set, the driver will also query the new
capabilities struct and save it in the device context.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the new pci_bandwidth_available() function to calculate maximum
available bandwidth through the PCI chain instead of computing it ourselves
with mlx5e_get_pci_bw().
This is used to detect when the device is capable of more bandwidth than is
available in the current slot. The driver may adjust compression settings
accordingly.
Note that pci_bandwidth_available() accounts for PCIe encoding overhead, so
it is more accurate than mlx5e_get_pci_bw() was.
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
[bhelgaas: remove mlx5e_get_pci_bw() wrapper altogether]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Use pcie_print_link_status() to report PCIe link speed and possible
limitations.
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The default TX moderation mode was mistakenly set to CQE based. The
intention was to add a control ability in order to improve some specific
use-cases. In general, we prefer to use EQE based moderation as it gives
much better numbers for the common cases.
CQE based causes a degradation in the common case since it resets the
moderation timer on CQE generation. This causes an issue when TSO is
well utilized (large TSO sessions). The timer is set to 16us so traffic
of ~64KB TSO sessions per second would mean timer reset (CQE per TSO
session -> long time between CQEs). In this case we quickly reach the
tcp_limit_output_bytes (256KB by default) and cause a halt in TX traffic.
By setting EQE based moderation we make sure timer would expire after
16us regardless of the packet rate.
This fixes an up to 40% packet rate and up to 23% bandwidth degradtions.
Fixes: 0088cbbc4b ("net/mlx5e: Enable CQE based moderation on TX CQ")
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Refine the RX check summing handling to propagate the
hardware provided checksum so that we do not have to
compute it later in software.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 145307460b ("devlink: Remove top_hierarchy arg to
devlink_resource_register"), the "top_hierarchy" parameter to
devlink_resource_register() was removed in favor of using the parameter
"parent_resource_id" exclusively to determine who the parent is. The
root node's resource ID for this purpose is
DEVLINK_RESOURCE_ID_PARENT_TOP with the value 0. It is therefore
problematic that the resource MLXSW_SP_RESOURCE_KVD has also ID of 0.
Fix this by numbering driver-specific resources from 1.
Fixes: 145307460b ("devlink: Remove top_hierarchy arg to devlink_resource_register")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass struct mlxsw_core instead of devlink since it is nicer within mlxsw
code and we need both structs in mlxsw_sp_kvdl_resources_register()
anyway.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As struct mlxsw_config_profile is mapped to the payload of the FW
command of the same name, resources_query_enable flag does not belong
there. Move it to struct mlxsw_driver.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check should be done directly in mlxsw_pci_config_profile, as for
other profile items. Also, be consistent in naming with the rest and
rename to "used_kvd_sizes".
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First arg of these helpers should be "mlxsw_core".
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This should not be part of the struct, as the struct fields
are tightly coupled with the FW command payload of the same name.
Just use the "granularity" define directly, as in other places.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parts info is array. The parts copy this info array, yet they are a
list. So make the indexing according to the id and change the list of
parts into array of parts. This helps to eliminate lookups and
constructs like mlxsw_sp_kvdl_part_update() (took me some non-trivial
time to figure out what is going on there).
Alongside with that, introduce a helper macro to define the parts infos.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink_resource_ops should be const as the arg of register function is
also const.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current code uses global variables, adjusts them and passes pointer down
to devlink. With every other mlxsw_core instance, the previously passed
pointer values are rewritten. Fix this by de-globalize the variables.
Fixes: 7f47b19bd7 ("mlxsw: spectrum_kvdl: Add support for per part occupancy")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon a new UMR post, check if the WQE buffer contains
a previous UMR WQE. If so, modify the dynamic fields
instead of a whole WQE overwrite. This saves a memcpy.
In current setting, after 2 WQ cycles (12 UMR posts),
this will always be the case.
No degradation sensed.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
All UMR WQEs of an RQ share many common fields. We use
pre-initialized structures to save calculations in datapath.
One field (xlt_offset) was the only reason we saved a pre-initialized
copy per WQE index.
Here we remove its initialization (move its calculation to datapath),
and reduce the number of copies to one-per-RQ.
A very small datapath calculation is added, it occurs once per a MPWQE
(i.e. once every 256KB), but reduces memory consumption and gives
better cache utilization.
Performance testing:
Tested packet rate, no degradation sensed.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add XDP support over Striding RQ.
Now that linear SKB is supported over Striding RQ,
we can support XDP by setting stride size to PAGE_SIZE
and headroom to XDP_PACKET_HEADROOM.
Upon a MPWQE free, do not release pages that are being
XDP xmit, they will be released upon completions.
Striding RQ is capable of a higher packet-rate than
conventional RQ.
A performance gain is expected for all cases that had
a HW packet-rate bottleneck. This is the case whenever
using many flows that distribute to many cores.
Performance testing:
ConnectX-5, 24 rings, default MTU.
CQE compression ON (to reduce completions BW in PCI).
XDP_DROP packet rate:
--------------------------------------------------
| pkt size | XDP rate | 100GbE linerate | pct% |
--------------------------------------------------
| 64byte | 126.2 Mpps | 148.0 Mpps | 85% |
| 128byte | 80.0 Mpps | 84.8 Mpps | 94% |
| 256byte | 42.7 Mpps | 42.7 Mpps | 100% |
| 512byte | 23.4 Mpps | 23.4 Mpps | 100% |
--------------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Make the xdp_xmit indication available for Striding RQ
by taking it out of the type-specific union.
This refactor is a preparation for a downstream patch that
adds XDP support over Striding RQ.
In addition, use a bitmap instead of a boolean for possible
future flags.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When modifying the page mapping of a HW memory region
(via a UMR post), post the new values inlined in WQE,
instead of using a data pointer.
This is a micro-optimization, inline UMR WQEs of different
rings scale better in HW.
In addition, this obsoletes a few control flows and helps
delete ~50 LOC.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Do not busy-wait a pending UMR completion. Under high HW load,
busy-waiting a delayed completion would fully utilize the CPU core
and mistakenly indicate a SW bottleneck.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gets the process of a UMR WQE post in one function,
in preparation for a downstream patch that inlines
the WQE data.
No functional change here.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In Striding RQ, each WQE serves multiple packets
(hence called Multi-Packet WQE, MPWQE).
The size of a MPWQE is constant (currently 256KB).
Upon a ringparam set operation, we calculate the number of
MPWQEs per RQ. For this, first it is needed to determine the
number of packets that can reside within a single MPWQE.
In this patch we use the actual MTU size instead of ETH_DATA_LEN
for this calculation.
This implies that a change in MTU might require a change
in Striding RQ ring size.
In addition, this obsoletes some WQEs-to-packets translation
functions and helps delete ~60 LOC.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>