This patch adds debug control parameters for congestion control which
can be read or written through debugfs. They are for reaction point and
notification point nodes.
These control parameters are as below:
+------------------------------+-----------------------------------------+
| Name | Description |
|------------------------------+-----------------------------------------|
|rp_clamp_tgt_rate | When set target rate is updated to |
| | current rate |
|------------------------------+-----------------------------------------|
|rp_clamp_tgt_rate_ati | When set update target rate based on |
| | timer as well |
|------------------------------+-----------------------------------------|
|rp_time_reset | time between rate increase if no |
| | CNP is received unit in usec |
|------------------------------+-----------------------------------------|
|rp_byte_reset | Number of bytes between rate inease if |
| | no CNP is received |
|------------------------------+-----------------------------------------|
|rp_threshold | Threshold for reaction point rate |
| | control |
|------------------------------+-----------------------------------------|
|rp_ai_rate | Rate for target rate, unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_hai_rate | Rate for hyper increase state |
| | unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_min_dec_fac | Minimum factor by which the current |
| | transmit rate can be changed when |
| | processing a CNP, unit is percerntage |
|------------------------------+-----------------------------------------|
|rp_min_rate | Minimum value for rate limit, |
| | unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_rate_to_set_on_first_cnp | Rate that is set when first CNP is |
| | received, unit is Mbps |
|------------------------------+-----------------------------------------|
|rp_dce_tcp_g | Used to calculate alpha |
|------------------------------+-----------------------------------------|
|rp_dce_tcp_rtt | Time between updates of alpha value, |
| | unit is usec |
|------------------------------+-----------------------------------------|
|rp_rate_reduce_monitor_period | Minimum time between consecutive rate |
| | reductions |
|------------------------------+-----------------------------------------|
|rp_initial_alpha_value | Initial value of alpha |
|------------------------------+-----------------------------------------|
|rp_gd | When CNP is received, flow rate is |
| | reduced based on gd, rp_gd is given as |
| | log2(rp_gd) |
|------------------------------+-----------------------------------------|
|np_cnp_dscp | dscp code point for generated cnp |
|------------------------------+-----------------------------------------|
|np_cnp_prio_mode | 802.1p priority for generated cnp |
|------------------------------+-----------------------------------------|
|np_cnp_prio | cnp priority mode |
+------------------------------+-----------------------------------------+
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The old logic ignored link state. This led to missing IB events like
when link goes down on the switch while admin state is up or to redundant
events like when admin state goes up while link is down.
To fix that, probe the port state on NETDEV events and compare to last
known state to decide if IB events needs to be dispatched.
FIxes: 5ec8c83e3a ("IB/mlx5: Port events in RoCE now rely on netdev events")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, unicast/multicast loopback raw ethernet
(non-RDMA) packets are sent back to the vport.
A unicast loopback packet is the packet with destination
MAC address the same as the source MAC address.
For multicast, the destination MAC address is in the
vport's multicast filter list.
Moreover, the local loopback is not needed if
there is one or none user space context.
After this patch, the raw ethernet unicast and multicast
local loopback are disabled by default. When there is more
than one user space context, the local loopback is enabled.
Note that when local loopback is disabled, raw ethernet
packets are not looped back to the vport and are forwarded
to the next routing level (eswitch, or multihost switch,
or out to the wire depending on the configuration).
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
"umem" is a valid pointer. We intended to print "*umem" or even just
"err" instead.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The failure in creation of debugfs entries for mr_cache left entries,
which were already created.
It caused to mismatch and misguiding for the end users. The solution
is to clean mr_cache debugfs root, so no leftovers will be in the
system. In addition, let's document why the error is not needed to be
forwarded to user in case of failure.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_map_mr_sg() can pass an SG-list to .map_mr_sg() that is larger
than what fits into a single MR. .map_mr_sg() must not attempt to
map more SG-list elements than what fits into a single MR.
Hence make sure that mlx5_ib_sg_to_klms() does not write outside
the MR klms[] array.
Fixes: b005d31647 ("mlx5: Add arbitrary sg list support")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Israel Rukshin <israelr@mellanox.com>
Cc: <stable@vger.kernel.org>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
- 2 Fixes for OPA found by debug kernel
- 1 Fix for user supplied input causing kernel problems
- 1 Fix for the IPoIB fixes submitted around -rc4
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZXVfPAAoJELgmozMOVy/dRRcP/AiN4wyEQ897se1fKXAktL1g
a17tiSkK2MukAVHbM++9Ea/YXK66e2s7Ls8Pd230E85N3V48rSUhWZUIUQLOm+gS
b98z53uNs6KkdBCezXABsHIi4PB6u1CfzaFaUfN5WI3ymAgsYqpQWMtNyO6GNe/R
Dur3vDieXPNJ2x+F1jiNxHFBXLKofCG0y1FX88zqsQI5vVVq7ASKgaaSX3T1emQY
18l4Dd7pesrWj4QD9jaqQiYkruF5VC1NE8/he8Zzy6XjSgnUZZfjbjuMptbW4y3y
Tvvd5bjMAkJhCbK1mhe1dZHPlYJhAguUBZfThjVSKtiMGwRhGA4SYkRtek3nZOga
/OLhERgj0VomHx7o+Pwp74DWnsSv08EMoc4hXKHZPPyxok83r9czejqm7mC2VbGd
Sa8LmVeLQp79e9MbGAj+PbNRHf9CE9dnLeFUmbj+qptXUVGvT8j9U1a9iTjTz0+2
NX/O4iWjtnt/CIkH9dhN9aWolswbmO2jSmmzb/x2EuCLv94GNtTyZLSifvxSYMnN
IWO86aGQmuUkWJ3RI/5tzq+gVzI6bdKB9hG5DOPWN/uJVF9nWkq3c69Bv9djvUoM
xi/rI0grxTqYHelRx3ja4ZqaI43R6YwL928XdtZJKQ/uNanq65Lyd6KKz3W7hT0l
emCoqb2MjuzsNWIPkSgg
=JEor
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma update from Doug Ledford:
"This includes two bugs against the newly added opa vnic that were
found by turning on the debug kernel options:
- sleeping while holding a lock, so a one line fix where they
switched it from GFP_KERNEL allocation to a GFP_ATOMIC allocation
- a case where they had an isolated caller of their code that could
call them in an atomic context so they had to switch their use of a
mutex to a spinlock to be safe, so this was considerably more lines
of diff because all uses of that lock had to be switched
In addition, the bug that was discussed with you already about an out
of bounds array access in ib_uverbs_modify_qp and ib_uverbs_create_ah
and is only seven lines of diff.
And finally, one fix to an earlier fix in the -rc cycle that broke
hfi1 and qib in regards to IPoIB (this one is, unfortunately, larger
than I would like for a -rc7 submission, but fixing the problem
required that we not treat all devices as though they had allocated a
netdev universally because it isn't true, and it took 70 lines of diff
to resolve the issue, but the final patch has been vetted by Intel and
Mellanox and they've both given their approval to the fix).
Summary:
- Two fixes for OPA found by debug kernel
- Fix for user supplied input causing kernel problems
- Fix for the IPoIB fixes submitted around -rc4"
[ Doug sent this having not noticed the 4.12 release, so I guess I'll be
getting another rdma pull request with the actuakl merge window
updates and not just fixes.
Oh well - it would have been nice if this small update had been the
merge window one. - Linus ]
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/core, opa_vnic, hfi1, mlx5: Properly free rdma_netdev
RDMA/uverbs: Check port number supplied by user verbs cmds
IB/opa_vnic: Use spinlock instead of mutex for stats_lock
IB/opa_vnic: Use GFP_ATOMIC while sending trap
IPOIB is calling free_rdma_netdev even though alloc_rdma_netdev has
returned -EOPNOTSUPP.
Move free_rdma_netdev from ib_device structure to rdma_netdev structure
thus ensuring proper cleanup function is called for the rdma net device.
Fix the following trace:
ib0: Failed to modify QP to ERROR state
BUG: unable to handle kernel paging request at 0000000000001d20
IP: hfi1_vnic_free_rn+0x26/0xb0 [hfi1]
Call Trace:
ipoib_remove_one+0xbe/0x160 [ib_ipoib]
ib_unregister_device+0xd0/0x170 [ib_core]
rvt_unregister_device+0x29/0x90 [rdmavt]
hfi1_unregister_ib_device+0x1a/0x100 [hfi1]
remove_one+0x4b/0x220 [hfi1]
pci_device_remove+0x39/0xc0
device_release_driver_internal+0x141/0x200
driver_detach+0x3f/0x80
bus_remove_driver+0x55/0xd0
driver_unregister+0x2c/0x50
pci_unregister_driver+0x2a/0xa0
hfi1_mod_cleanup+0x10/0xf65 [hfi1]
SyS_delete_module+0x171/0x250
do_syscall_64+0x67/0x150
entry_SYSCALL64_slow_path+0x25/0x25
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reserved gids are taken by the mlx5_core, report smaller GID table
size to IB core.
Set mlx5_query_roce_port's return value back to int. In case of
error, return an indication. This rolls back some of the change
in commit 50f22fd8ec ("IB/mlx5: Set mlx5_query_roce_port's return value to void")
Change set_roce_addr to use gid_set function, instead of directly
sending the command.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed few places where endianness was misspelled and
one spot whwere output was:
CHECK: 'endianess' may be misspelled - perhaps 'endianness'?
CHECK: 'ouput' may be misspelled - perhaps 'output'?
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cache the needed umr_fence and set the wqe ctrl segmennt
accordingly.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Commit a7c3e901a4 ("mm: introduce kv[mz]alloc helpers") added
proper implementation of mlx5_vzalloc function to the MM core.
This made the mlx5_vzalloc function useless, so let's remove it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Enable mlx5 IPoIB acceleration by declaring
mlx5_ib_{alloc,free}_rdma_netdev and assigning the mlx5
IPoIB rdma_netdev callbacks.
In addition, this patch brings in sync mlx5's IPoIB parts for net and IB
trees. As a precaution, we disabled IPoIB acceleration by default (in
the mlx5_core Kconfig file).
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add port_xmit_wait to the error counters read by mlx5_ib_process_mad to
ensure sysfs port counter provides correct value for PortXmitWait.
Otherwise the sysfs port_xmit_wait file always contains zero.
The previous MAD_IFC implementation populated this counter, but it was
removed during the migration to PPCNT for error counters (32-bit only).
Signed-off-by: Tim Wright <tim@binbash.co.uk>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In case we got an initial sg_offset, we need to
account for it in the mr length.
Cc: stable@vger.kernel.org
Fixes: ff2ba99365 ("IB/core: Add passing an offset into the SG to
ib_map_mr_sg")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rdma_ah_attr can now be either ib or roce allowing
core components to use one type or the other and also
to define attributes unique to a specific type. struct
ib_ah is also initialized with the type when its first
created. This ensures that calls such as modify_ah
dont modify the type of the address handle attribute.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Modify core and driver components to use accessor functions
introduced to access individual fields of rdma_ah_attr
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
local function to_ib_ah_attr is renamed so it in
sync with the rename of the ib_ah_attr structure
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch simply renames struct ib_ah_attr to
rdma_ah_attr as these fields specify attributes that are
not necessarily specific to IB.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Internally MW implemented as KLM MKey and filled by userspace UMR
postsends. Handle pagefault trigered by operations on this MKeys.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
To make page fault handling code more flexible
split pagefault_single_data_segment() function.
Keep MR resolution in pagefault_single_data_segment() and
move actual updates into pagefault_single_mr().
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currenlty ODP supports only regular MMU pages.
Add ODP support for regions consisting of physically contiguous chunks
of arbitrary order (huge pages for instance) to improve performance.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When implicit MR's leaf MKey becomes unused, i.e. when it's
last page being released my MMU invalidation it is marked as "dying"
and scheduled for release by garbage collector.
Currentle consequent page fault may remove "dying" flag.
Treat leaf MKey as non-existent once it was scheduled to removal
by GC.
Fixes: 81713d3788 ('IB/mlx5: Add implicit MR support')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Translation table updates of large UMR may require multiple post send
operations. The last operations can be in various lengths, but current
code set them to be the same length.
Fixes: 7d0cc6edcc ('IB/mlx5: Add MR cache for large UMR regions')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In memory shortage path we fall back to use spare buffer.
mlx5_ib_update_xlt() called from ib_uverbs_reg_mr when ibmr.ucontext
not initialized yet.
Scenario how to test it:
1. trigger memory exhaustion so __get_free_pages(GFP_KERNEL, 4) will fail
2. register MR
3. there should be no kernel oops
Fixes: 7d0cc6edcc ('IB/mlx5: Add MR cache for large UMR regions')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Size of pages are held by struct ib_umem in page_size field.
It is better to store it as an exponent, because page size by nature
is always power-of-two and used as a factor, divisor or ilog2's argument.
The conversion of page_size to be page_shift allows to have portable
code and avoid following error while compiling on ARM:
ERROR: "__aeabi_uldivmod" [drivers/infiniband/core/ib_core.ko] undefined!
CC: Selvin Xavier <selvin.xavier@broadcom.com>
CC: Steve Wise <swise@chelsio.com>
CC: Lijun Ou <oulijun@huawei.com>
CC: Shiraz Saleem <shiraz.saleem@intel.com>
CC: Adit Ranadive <aditr@vmware.com>
CC: Dennis Dalessandro <dennis.dalessandro@intel.com>
CC: Ram Amrani <Ram.Amrani@Cavium.com>
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Ram Amrani <Ram.Amrani@cavium.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add missing calculation and translation of active_width and
active_speed for RoCE.
Fixes: 3f89a643eb ('IB/mlx5: Extend query_device/port to ...')
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In case of an error, the properties reported to user
are zeroed out, so no need for a return value.
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a difference when parsing a completion entry between Ethernet
and IB ports. When link layer is Ethernet the bits describe the type of
L3 header in the packet. In the case when link layer is Ethernet and VLAN
header is present the value of SL is equal to the 3 UP bits in the VLAN
header. If VLAN header is not present then the SL is undefined and consumer
of the completion should check if IB_WC_WITH_VLAN is set.
While that, this patch also fills the vlan_id field in the completion if
present.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds support to query the congestion related hardware counters
through new command and links them with other hw counters being available
in hw_counters sysfs location.
In order to reuse existing infrastructure it renames related q_counter
data structures to more generic counters to reflect q_counters and
congestion counters and maybe some other counters in the future.
New hardware counters:
* rp_cnp_handled - CNP packets handled by the reaction point
* rp_cnp_ignored - CNP packets ignored by the reaction point
* np_cnp_sent - CNP packets sent by notification point to respond to
CE marked RoCE packets
* np_ecn_marked_roce_packets - CE marked RoCE packets received by
notification point
It also avoids returning ENOSYS which is specific for invalid
system call and produces the following checkpatch.pl warning.
WARNING: ENOSYS means 'invalid syscall nr' and nothing else
+ return -ENOSYS;
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A drop rule is described by an action drop and no destination.
If a user specified IB_FLOW_SPEC_ACTION_DROP then set the action
to MLX5_FLOW_CONTEXT_ACTION_DROP and clear the destination.
Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This change adds the ability for flow steering to classify IPv4/6
packets with MPLS tag (Ethertype 0x8847 and 0x8848) as standard IP
packets and hit IPv4/6 classifed steering rules.
When user added a flow rule with IP classification, driver was
implicitly adding ethertype matching to the created rule in order
to distinguish between IPv4 and IPv6 protocols.
Since IP packets with MPLS tag header have MPLS ethertype, they missed
the rule and ended up hitting the default filters.
Such behavior prevented from MPLS packets to undergo inbound traffic
load balancing flows (if such were defined by configuring RSS) to
achieve higher throughput - the way that non-MPLS IP packets performed.
Since our device is able to look past the MPLS tag and identify the
next protocol we introduce this solution which replaces Ethertype
matching by the device's capability to perform IP version parsing
and matching in order to distinguish between IPv4 and IPv6.
Therefore, whenever a flow with IP spec is added and device support IP
version matching, driver will implicitly add IP version matching to the
rule (Based on the IP spec type) without Ethertype matching which will
cause relevant MPLS tagged packets to hit this rule as well.
Otherwise (device doesn't support IP version matching), we fall back to
setting Ethertype matching.
If the user's filters specify an L2 ethertype and an IP spec
the rule will then match both the ethertype and the IP version.
The device's support for IP version matching is reported by the
device via dedicated capability bit in query_device_cap and named
outer/inner_ip_version.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This change fixes an incomplete validation of the user's
flow attributes list.
Previous implementation validated only matching of IPv4 Ethertype
to IPv4 spec of outer headers (in case both Ethernet with specified
Ethertype and IP specs were present) and lacked the validation of:
1. Matching of IPv6 Ethertype in Ethernet spec (if such exists) to an
IPv6 protocol spec (if such exists).
2. Validation of Ethertype to IP protocol matching on inner headers specs.
Which could cause some combinations of unmatching Ethernet and IP
protocols to pass validation and apply on the device.
The fix adds validation of IPv6 Ethertype and IP spec as well as
performing the scan on both outer and inner attributes.
Fixes: 038d2ef875 ("Add flow steering support")
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The kfree was called to free cqb, while it should free *cqb.
Fixes: 1cbe6fc86c ("IB/mlx5: Add support for CQE compressing")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to enlarge the flow group size to 8k, we decrease
the number of flow group types to 6 and increase the flow
table size to 64k.
Flow group size is calculated as follow:
group_size = table_size / (#group_types + 1)
Fixes: 038d2ef875 ('IB/mlx5: Add flow steering support')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Check that the required flow table size is supported
by device. Return ENOMEM error if no space left.
In addition change the create flow table routine
to return ENOMEM instead of ENOSPC.
Fixes: 038d2ef875 ('IB/mlx5: Add flow steering support')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Anonymous VMA (->vm_ops == NULL) cannot be shared, otherwise
it would lead to SIGBUS.
Remove the shared flags from the vma after we change it to be
anonymous.
This is easily reproduced by doing modprobe -r while running a
user-space application such as raw_ethernet_bw.
Fixes: 7c2344c3bb ('IB/mlx5: Implements disassociate_ucontext API')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When the driver disassociate user context, it changes the vma to
anonymous by setting the vm_ops to null and zap the vma ptes.
In order to avoid race in the kernel, we need to take write lock
before we change the vma entries.
Fixes: 7c2344c3bb ('IB/mlx5: Implements disassociate_ucontext API')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add check for bit IB_QP_CREATE_NETIF_QP while creating QP.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement mlx5e's IPoIB SKB transmit using the helper functions provided
by mlx5e ethernet tx flow, the only difference in the code between
mlx5e_xmit and mlx5i_xmit is that IPoIB has some extra fields to fill
(UD datagram segment) in the TX descriptor (WQE) and it doesn't need to
have any vlan handling.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
But first update usage sites with the new header dependency.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bart Van Assche noted that the ib DMA mapping code was significantly
similar enough to the core DMA mapping code that with a few changes
it was possible to remove the IB DMA mapping code entirely and
switch the RDMA stack to use the core DMA mapping code. This resulted
in a nice set of cleanups, but touched the entire tree. This branch
will be submitted separately to Linus at the end of the merge window
as per normal practice for tree wide changes like this.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYo06oAAoJELgmozMOVy/d9Z8QALedWHdu98St1L0u2c8sxnR9
2zo/4sF5Vb9u7FpmdIX32L4SQ9s9KhPE8Qp8NtZLf9v10zlDebIRJDpXknXtKooV
CAXxX4sxBXV27/UrhbZEfXiPrmm6ccJFyIfRnMU6NlMqh2AtAsRa5AC2/RMp8oUD
Med97PFiF0o6TD22/UH1VFbRpX1zjaKyqm7a3as5sJfzNA+UGIZAQ7Euz8000DKZ
xCgVLTEwS0FmOujtBkCst7xa9TjuqR1HLOB4DdGvAhP6BHdz2yamM7Qmh9NN+NEX
0BtjsuXomtn6j6AszGC+bpipCZh3NUigcwoFAARXCYFHibBvo4DPdFeGsraFgXdy
1+KyR8CCeQG3Aly5Vwr264RFPGkGpwMj8PsBlXgQVtrlg4rriaCzOJNmIIbfdADw
ftqhxBOzReZw77aH2s+9p2ILRfcAmPqhynLvFGFo9LBvsik8LVso7YgZN0xGxwcI
IjI/XGC8UskPVsIZBIYA6sl2bYzgOjtBIHiXjRrPlW3uhduIXLrvKFfLPP/5XLAG
ehLXK+J0bfsyY9ClmlNS8oH/WdLhXAyy/KNmnj5bRRm9qg6BRJR3bsOBhZJODuoC
XgEXFfF6/7roNESWxowff7pK0rTkRg/m/Pa4VQpeO+6NWHE7kgZhL6kyIp5nKcwS
3e7mgpcwC+3XfA/6vU3F
=e0Si
-----END PGP SIGNATURE-----
Merge tag 'for-next-dma_ops' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma DMA mapping updates from Doug Ledford:
"Drop IB DMA mapping code and use core DMA code instead.
Bart Van Assche noted that the ib DMA mapping code was significantly
similar enough to the core DMA mapping code that with a few changes it
was possible to remove the IB DMA mapping code entirely and switch the
RDMA stack to use the core DMA mapping code.
This resulted in a nice set of cleanups, but touched the entire tree
and has been kept separate for that reason."
* tag 'for-next-dma_ops' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (37 commits)
IB/rxe, IB/rdmavt: Use dma_virt_ops instead of duplicating it
IB/core: Remove ib_device.dma_device
nvme-rdma: Switch from dma_device to dev.parent
RDS: net: Switch from dma_device to dev.parent
IB/srpt: Modify a debug statement
IB/srp: Switch from dma_device to dev.parent
IB/iser: Switch from dma_device to dev.parent
IB/IPoIB: Switch from dma_device to dev.parent
IB/rxe: Switch from dma_device to dev.parent
IB/vmw_pvrdma: Switch from dma_device to dev.parent
IB/usnic: Switch from dma_device to dev.parent
IB/qib: Switch from dma_device to dev.parent
IB/qedr: Switch from dma_device to dev.parent
IB/ocrdma: Switch from dma_device to dev.parent
IB/nes: Remove a superfluous assignment statement
IB/mthca: Switch from dma_device to dev.parent
IB/mlx5: Switch from dma_device to dev.parent
IB/mlx4: Switch from dma_device to dev.parent
IB/i40iw: Remove a superfluous assignment statement
IB/hns: Switch from dma_device to dev.parent
...
Because the Mellanox code required being based on a net-next tree,
I keept it separate from the remainder of the RDMA stack submission
that is based on 4.10-rc3.
This branch contains:
- Various mlx4 and mlx5 fixes and minor changes
- Support for adding a tag match rule to flow specs
- Support for cvlan offload operation for raw ethernet QPs
- A change to the core IB code to recognize raw eth capabilities and
enumerate them (touches non-Mellanox code)
- Implicit On-Demand Paging memory registration support
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYrx+WAAoJELgmozMOVy/du70P/1kpW2xY9Le04c3K7na2XOYl
AUVIDrW/8Go63tpOaM7jBT3k4GlwVFr3IOmBpS24KbW/THxjhyUeP5L5+z2x+go+
jkQOgtPWWEHr5zP3MzsNyB8fDx1YQOnJwEXxybQRW/cbw4CLjnhP+ezd6FdV/3Yy
pPEqDVlAErzvNweG+n2r1pjcUbR8uneC3inyMLnyzUBz4CHKmC8fgD3/qJIM+DNb
gtFT5xHFIXKCigWdQ/EwsTDcHub43V8OXlI5sO7loG6vToOUATMkjI4oOUNhDmYS
X7XLN3yRK9QHEfb5kutXIZEWzTGh7LiFtUYGaNNYqqzDfSiMRc9NC5kTOfplEXDV
Uo+AGb6Fh1zYIOzNk7o+tazIv3LaLv6+Fcm+9bbe0VUIqasaylsePqaTwMuIzx/I
xP5nitmd5lbYo8WdlasVdG6mH1DlJEUbU30v4DpmTpxCP6jGpog7lexyGyF3TgzS
NhnG0IiIClWh3WQ2/GdsFK/obIdFkpLeASli1hwD81vzPfly9zc2YpgqydZI3WCr
q6hTXYnANcP6+eciCpQPO7giRdXdiKey08Uoq/2jxb7Qbm4daG6UwopjvH9/lm1F
m6UDaDvzNYm+Rx+bL/+KSx9JO9+fJB1L51yCmvLGpWi6yJI4ZTfanHNMBsCua46N
Kev/DSpIAzX1WOBkte+a
=rspQ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull Mellanox rdma updates from Doug Ledford:
"Mellanox specific updates for 4.11 merge window
Because the Mellanox code required being based on a net-next tree, I
keept it separate from the remainder of the RDMA stack submission that
is based on 4.10-rc3.
This branch contains:
- Various mlx4 and mlx5 fixes and minor changes
- Support for adding a tag match rule to flow specs
- Support for cvlan offload operation for raw ethernet QPs
- A change to the core IB code to recognize raw eth capabilities and
enumerate them (touches non-Mellanox code)
- Implicit On-Demand Paging memory registration support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (40 commits)
IB/mlx5: Fix configuration of port capabilities
IB/mlx4: Take source GID by index from HW GID table
IB/mlx5: Fix blue flame buffer size calculation
IB/mlx4: Remove unused variable from function declaration
IB: Query ports via the core instead of direct into the driver
IB: Add protocol for USNIC
IB/mlx4: Support raw packet protocol
IB/mlx5: Support raw packet protocol
IB/core: Add raw packet protocol
IB/mlx5: Add implicit MR support
IB/mlx5: Expose MR cache for mlx5_ib
IB/mlx5: Add null_mkey access
IB/umem: Indicate that process is being terminated
IB/umem: Update on demand page (ODP) support
IB/core: Add implicit MR flag
IB/mlx5: Support creation of a WQ with scatter FCS offload
IB/mlx5: Enable QP creation with cvlan offload
IB/mlx5: Enable WQ creation and modification with cvlan offload
IB/mlx5: Expose vlan offloads capabilities
IB/uverbs: Enable QP creation with cvlan offload
...
When the "ib_virt" cap is set, configuration of port capabilities need
to be done through mlx5_core_modify_hca_vport_context.
Since modify_hca_vport_context accepts mask and value, there is no need
to read the port capabilities and calculate the new cap values so we
avoid the mutex when ib_virt is set.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>