linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 11:18:45 +07:00

Author	SHA1	Message	Date
stephen hemminger	31975e27a4	mlx4: sizeof style usage The kernel coding style is to treat sizeof as a function (ie. with parenthesis) not as an operator. Also use kcalloc and kmalloc_array Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-16 11:01:57 -07:00
Leon Romanovsky	8900b894e7	{net, IB}/mlx4: Remove gfp flags argument The caller to the driver marks GFP_NOIO allocations with help of memalloc_noio-* calls now. This makes redundant to pass down to the driver gfp flags, which can be GFP_KERNEL only. The patch removes the gfp flags argument and updates all driver paths. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>	2017-07-17 21:21:24 -04:00
Inbar Karmy	ec327f7a43	net/mlx4_en: Do not allocate redundant TX queues when TC is disabled Currently the number of TX queues that are allocated doesn't depend on the number of TCs, the module always loads with max num of UP per channel. In order to prevent the allocation of unnecessary memory, the module will load with minimum number of UPs per channel, and the user will be able to control the number of TX queues per channel by changing the number of TC to 8 using the tc command. The variable num_up will hold the information about the current number of UPs. Due to the change, needed to remove the lines that set the value of UP to be different than zero in the func "mlx4_en_select_queue", since now the num of TX queues that are allocated is only one per channel in default. In order not to force the UP to be zero in case of only one TC, added a condition before forcing it in the func "mlx4_en_fill_qp_context". Tested: After the module is loaded with minimum number of UP per channel, to increase num of TCs to 8, use: tc qdisc add dev ens8 root mqprio num_tc 8 In order to decrease the number of TCs to minimum number of UP per channel, use: tc qdisc del dev ens8 root Signed-off-by: Inbar Karmy <inbark@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: Tarick Bedeir <tarick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-29 15:56:15 -04:00
Tariq Toukan	4c07c13240	net/mlx4_en: Refactor mlx4_en_free_tx_desc Some code re-ordering, functionally equivalent. - The !tx_info->inl check is evaluated anyway in both flows (common case/end case). Run it first, this might finish the flows earlier. - dma_unmap calls are identical in both flows, get it out of the if block into the common area. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Gain is too small to be measurable, no degradation sensed. Results are similar for IPv4 and IPv6. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Tariq Toukan	9573e0d39f	net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift operation instead of a multiplication with TXBB_SIZE. Operations are equivalent as TXBB_SIZE is a power of two. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Gain is too small to be measurable, no degradation sensed. Results are similar for IPv4 and IPv6. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Tariq Toukan	6c78511b05	net/mlx4_en: Poll XDP TX completion queue in RX NAPI Instead of having their own NAPIs, XDP TX completion queues get polled within the corresponding RX NAPI. This prevents any possible race on TX ring prod/cons indices, between the context that issues the transmits (RX NAPI) and the context that handles the completions (was previously done in a separate NAPI). This also improves performance, as it decreases the number of NAPIs running on a CPU, saving the overhead of syncing and switching between the contexts. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Single queue no-RSS optimization ON. XDP_TX packet rate: ------------------------------------- \| Before \| After \| Gain \| IPv4 \| 12.0 Mpps \| 13.8 Mpps \| 15% \| IPv6 \| 12.0 Mpps \| 13.8 Mpps \| 15% \| ------------------------------------- Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Tariq Toukan	36ea796498	net/mlx4_en: Improve XDP xmit function Several performance improvements in XDP TX datapath, including: - Ring a single doorbell for XDP TX ring per NAPI budget, instead of doing it per a lower threshold (was 8). This includes removing the flow of immediate doorbell ringing in case of a full TX ring. - Compiler branch predictor hints. - Calculate values in compile time rather than in runtime. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Single queue no-RSS optimization ON. XDP_TX packet rate: ------------------------------------- \| Before \| After \| Gain \| IPv4 \| 10.3 Mpps \| 12.0 Mpps \| 17% \| IPv6 \| 10.3 Mpps \| 12.0 Mpps \| 17% \| ------------------------------------- Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Tariq Toukan	f28186d6b5	net/mlx4_en: Improve stack xmit function Several small code and performance improvements in stack TX datapath, including: - Compiler branch predictor hints. - Minimize variables scope. - Move tx_info non-inline flow handling to a separate function. - Calculate data_offset in compile time rather than in runtime (for !lso_header_size branch). - Avoid trinary-operator ("?") when value can be preset in a matching branch. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Gain is too small to be measurable, no degradation sensed. Results are similar for IPv4 and IPv6. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Tariq Toukan	cc26a49086	net/mlx4_en: Improve transmit CQ polling Several small performance improvements in TX CQ polling, including: - Compiler branch predictor hints. - Minimize variables scope. - More proper check of cq type. - Use boolean instead of int for a binary indication. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet-rate tests for both regular stack and XDP use cases: No noticeable gain, no degradation. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Tariq Toukan	cf97050d54	net/mlx4_en: Remove unused argument in TX datapath function Remove owner argument, as it is obsolete and unused. This also saves the overhead of calculating its value in data-path. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:22 -04:00
Michal Hocko	752ade68cb	treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Eric Dumazet	75d04aa368	mlx4: trust shinfo->gso_segs mlx4 is the only driver in the tree making a point to recompute shinfo->gso_segs. Lets remove superfluous code. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-06 13:27:49 -07:00
Eric Dumazet	acd7628de0	mlx4: reduce rx ring page_cache size We only need to store the page and dma address. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	69ba943151	mlx4: dma_dir is a mlx4_en_priv attribute No need to duplicate it for all queues and frags. num_frags & log_rx_info become u8 to save space. u8 accesses are a bit faster than u16 anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Alaa Hleihel	4b5e5b7ece	net/mlx4_core: Get num_tc using netdev_get_num_tc Avoid reading num_tc directly from struct net_device, but use the helper function netdev_get_num_tc. Fixes: `bc6a4744b8` ("net/mlx4_en: num cores tx rings for every UP") Fixes: `f5b6345ba8` ("net/mlx4_en: User prio mapping gets corrupted when changing number of channels") Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-30 15:26:42 -05:00
Martin KaFai Lau	ea3349a035	mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active Reserve XDP_PACKET_HEADROOM for packet and enable bpf_xdp_adjust_head() support. This patch only affects the code path when XDP is active. After testing, the tx_dropped counter is incremented if the xdp_prog sends more than wire MTU. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 14:25:13 -05:00
Eric Dumazet	e3f42f8453	mlx4: reorganize struct mlx4_en_tx_ring Goal is to reorganize this critical structure to increase performance. ndo_start_xmit() should only dirty one cache line, and access as few cache lines as possible. Add sp_ (Slow Path) prefix to fields that are not used in fast path, to make clear what is going on. After this patch pahole reports something much better, as all ndo_start_xmit() needed fields are packed into two cache lines instead of seven or eight struct mlx4_en_tx_ring { u32 last_nr_txbb; /* 0 0x4 / u32 cons; / 0x4 0x4 / long unsigned int wake_queue; / 0x8 0x8 / struct netdev_queue tx_queue; /* 0x10 0x8 / u32 (free_tx_desc)(struct mlx4_en_priv , struct mlx4_en_tx_ring , int, u8, u64, int); /* 0x18 0x8 / struct mlx4_en_rx_ring recycle_ring; /* 0x20 0x8 / / XXX 24 bytes hole, try to pack / / --- cacheline 1 boundary (64 bytes) --- / u32 prod; / 0x40 0x4 / unsigned int tx_dropped; / 0x44 0x4 / long unsigned int bytes; / 0x48 0x8 / long unsigned int packets; / 0x50 0x8 / long unsigned int tx_csum; / 0x58 0x8 / long unsigned int tso_packets; / 0x60 0x8 / long unsigned int xmit_more; / 0x68 0x8 / struct mlx4_bf bf; / 0x70 0x18 / / --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- / __be32 doorbell_qpn; / 0x88 0x4 / __be32 mr_key; / 0x8c 0x4 / u32 size; / 0x90 0x4 / u32 size_mask; / 0x94 0x4 / u32 full_size; / 0x98 0x4 / u32 buf_size; / 0x9c 0x4 / void buf; /* 0xa0 0x8 / struct mlx4_en_tx_info tx_info; /* 0xa8 0x8 / int qpn; / 0xb0 0x4 / u8 queue_index; / 0xb4 0x1 / bool bf_enabled; / 0xb5 0x1 / bool bf_alloced; / 0xb6 0x1 / u8 hwtstamp_tx_type; / 0xb7 0x1 / u8 bounce_buf; /* 0xb8 0x8 / / --- cacheline 3 boundary (192 bytes) --- / long unsigned int queue_stopped; / 0xc0 0x8 / struct mlx4_hwq_resources sp_wqres; / 0xc8 0x58 / / --- cacheline 4 boundary (256 bytes) was 32 bytes ago --- / struct mlx4_qp sp_qp; / 0x120 0x30 / / --- cacheline 5 boundary (320 bytes) was 16 bytes ago --- / struct mlx4_qp_context sp_context; / 0x150 0xf8 / / --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- / cpumask_t sp_affinity_mask; / 0x248 0x20 / enum mlx4_qp_state sp_qp_state; / 0x268 0x4 / u16 sp_stride; / 0x26c 0x2 / u16 sp_cqn; / 0x26e 0x2 / / size: 640, cachelines: 10, members: 36 / / sum members: 600, holes: 1, sum holes: 24 / / padding: 16 / }; Instead of this silly placement : struct mlx4_en_tx_ring { u32 last_nr_txbb; / 0 0x4 / u32 cons; / 0x4 0x4 / long unsigned int wake_queue; / 0x8 0x8 / / XXX 48 bytes hole, try to pack / / --- cacheline 1 boundary (64 bytes) --- / u32 prod; / 0x40 0x4 / / XXX 4 bytes hole, try to pack / long unsigned int bytes; / 0x48 0x8 / long unsigned int packets; / 0x50 0x8 / long unsigned int tx_csum; / 0x58 0x8 / long unsigned int tso_packets; / 0x60 0x8 / long unsigned int xmit_more; / 0x68 0x8 / unsigned int tx_dropped; / 0x70 0x4 / / XXX 4 bytes hole, try to pack / struct mlx4_bf bf; / 0x78 0x18 / / --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- / long unsigned int queue_stopped; / 0x90 0x8 / cpumask_t affinity_mask; / 0x98 0x10 / struct mlx4_qp qp; / 0xa8 0x30 / / --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- / struct mlx4_hwq_resources wqres; / 0xd8 0x58 / / --- cacheline 4 boundary (256 bytes) was 48 bytes ago --- / u32 size; / 0x130 0x4 / u32 size_mask; / 0x134 0x4 / u16 stride; / 0x138 0x2 / / XXX 2 bytes hole, try to pack / u32 full_size; / 0x13c 0x4 / / --- cacheline 5 boundary (320 bytes) --- / u16 cqn; / 0x140 0x2 / / XXX 2 bytes hole, try to pack / u32 buf_size; / 0x144 0x4 / __be32 doorbell_qpn; / 0x148 0x4 / __be32 mr_key; / 0x14c 0x4 / void buf; /* 0x150 0x8 / struct mlx4_en_tx_info tx_info; /* 0x158 0x8 / struct mlx4_en_rx_ring recycle_ring; /* 0x160 0x8 / u32 (free_tx_desc)(struct mlx4_en_priv , struct mlx4_en_tx_ring , int, u8, u64, int); /* 0x168 0x8 / u8 bounce_buf; /* 0x170 0x8 / struct mlx4_qp_context context; / 0x178 0xf8 / / --- cacheline 9 boundary (576 bytes) was 48 bytes ago --- / int qpn; / 0x270 0x4 / enum mlx4_qp_state qp_state; / 0x274 0x4 / u8 queue_index; / 0x278 0x1 / bool bf_enabled; / 0x279 0x1 / bool bf_alloced; / 0x27a 0x1 / / XXX 5 bytes hole, try to pack / / --- cacheline 10 boundary (640 bytes) --- / struct netdev_queue tx_queue; /* 0x280 0x8 / int hwtstamp_tx_type; / 0x288 0x4 / / size: 704, cachelines: 11, members: 36 / / sum members: 587, holes: 6, sum holes: 65 / / padding: 52 */ }; Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-24 16:03:37 -05:00
Tariq Toukan	15fca2c8eb	net/mlx4_en: Add ethtool statistics for XDP cases XDP statistics are reported in ethtool, in total and per ring, as follows: - xdp_drop: the number of packets dropped by xdp. - xdp_tx: the number of packets forwarded by xdp. - xdp_tx_full: the number of times an xdp forward failed due to a full tx xdp ring. In addition, all packets that are dropped/forwarded by XDP are no longer accounted in rx_packets/rx_bytes of the ring, so that they count traffic that is passed to the stack. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-02 15:07:11 -04:00
Tariq Toukan	67f8b1dcb9	net/mlx4_en: Refactor the XDP forwarding rings scheme Separately manage the two types of TX rings: regular ones, and XDP. Upon an XDP set, do not borrow regular TX rings and convert them into XDP ones, but allocate new ones, unless we hit the max number of rings. Which means that in systems with smaller #cores we will not consume the current TX rings for XDP, while we are still in the num TX limit. XDP TX rings counters are not shown in ethtool statistics. Instead, XDP counters will be added to the respective RX rings in a downstream patch. This has no performance implications. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-02 15:07:11 -04:00
Moshe Shemesh	7a61fc86af	net/mlx4_en: Fix panic on xmit while port is down When port is down, tx drop counter update is not needed. Updating the counter in this case can cause a kernel panic as when the port is down, ring can be NULL. Fixes: `63a664b7e9` ("net/mlx4_en: fix tx_dropped bug") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-11 19:40:26 -07:00
Brenden Blanco	9ecc2d8617	net/mlx4_en: add xdp forwarding and data write support A user will now be able to loop packets back out of the same port using a bpf program attached to xdp hook. Updates to the packet contents from the bpf program is also supported. For the packet write feature to work, the rx buffers are now mapped as bidirectional when the page is allocated. This occurs only when the xdp hook is active. When the program returns a TX action, enqueue the packet directly to a dedicated tx ring, so as to avoid completely any locking. This requires the tx ring to be allocated 1:1 for each rx ring, as well as the tx completion running in the same softirq. Upon tx completion, this dedicated tx ring recycles pages without unmapping directly back to the original rx ring. In steady state tx/drop workload, effectively 0 page allocs/frees will occur. In order to separate out the paths between free and recycle, a free_tx_desc func pointer is introduced that is optionally updated whenever recycle_ring is activated. By default the original free function is always initialized. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-19 21:46:33 -07:00
Brenden Blanco	224e92e02a	net/mlx4_en: break out tx_desc write into separate function In preparation for writing the tx descriptor from multiple functions, create a helper for both normal and blueflame access. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-19 21:46:33 -07:00
Eric Dumazet	63a664b7e9	net/mlx4_en: fix tx_dropped bug 1) mlx4_en_xmit() can increment priv->stats.tx_dropped, but this variable is overwritten in mlx4_en_DUMP_ETH_STATS(). 2) This increment was not SMP safe, as a port might have many TX queues. Add a per TX ring tx_dropped to fix these issues. This is u32 as mlx4_en_DUMP_ETH_STATS() will add a 32bit field. So lets avoid bugs with SNMP agents having to cope with partial overwraps. (One of these agents being bond_fold_stats()) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Cc: Eugenia Emantayev <eugenia@mellanox.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-25 22:15:49 -07:00
Haggai Abramovsky	73898db043	net/mlx4: Avoid wrong virtual mappings The dma_alloc_coherent() function returns a virtual address which can be used for coherent access to the underlying memory. On some architectures, like arm64, undefined behavior results if this memory is also accessed via virtual mappings that are not coherent. Because of their undefined nature, operations like virt_to_page() return garbage when passed virtual addresses obtained from dma_alloc_coherent(). Any subsequent mappings via vmap() of the garbage page values are unusable and result in bad things like bus errors (synchronous aborts in ARM64 speak). The mlx4 driver contains code that does the equivalent of: vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the device is opened. Prevent Ethernet driver to run this problematic code by forcing it to allocate contiguous memory. As for the Infiniband driver, at first we are trying to allocate contiguous memory, but in case of failure roll back to work with fragmented memory. Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reported-by: David Daney <david.daney@cavium.com> Tested-by: Sinan Kaya <okaya@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-05 23:23:05 -04:00
Alexander Duyck	09067122db	net/mlx4_en: Add support for inner IPv6 checksum offloads and TSO >From what I can tell the ConnectX-3 will support an inner IPv6 checksum and segmentation offload, however it cannot support outer IPv6 headers. This assumption is based on the fact that I could see the checksum being offloaded for inner header on IPv4 tunnels, but not on IPv6 tunnels. For this reason I am adding the feature to the hw_enc_features and adding an extra check to the features_check call that will disable GSO and checksum offload in the case that the encapsulated frame has an outer IP version of that is not 4. The check in mlx4_en_features_check could be removed if at some point in the future a fix is found that allows the hardware to offload segmentation/checksum on tunnels with an outer IPv6 header. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-04 13:32:27 -04:00
Eric Dumazet	fc96256c90	net/mlx4_en: fix spurious timestamping callbacks When multiple skb are TX-completed in a row, we might incorrectly keep a timestamp of a prior skb and cause extra work. Fixes: `ec693d4701` ("net/mlx4_en: Add HW timestamping (TS) support") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-26 01:13:18 -04:00
Jesper Dangaard Brouer	b4a53379a0	mlx4: use napi_consume_skb API to get bulk free operations Bulk free of SKBs happen transparently by the API call napi_consume_skb(). The napi budget parameter is usually needed by napi_consume_skb() to detect if called from netpoll. In this patch it has an extra meaning. For mlx4 driver, the mlx4_en_stop_port() call is done outside NAPI/softirq context, and cleanup the entire TX ring via mlx4_en_free_tx_buf(). The code mlx4_en_free_tx_desc() for freeing SKBs are shared with NAPI calls. To handle this shared use the zero budget indication is reused, and handled appropriately in napi_consume_skb(). To reflect this, variable is called napi_mode for the function call that needed this distinction. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-13 22:35:35 -04:00
Huy Nguyen	85743f1eb3	net/mlx4_core: Set UAR page size to 4KB regardless of system page size problem description: The current code sets UAR page size equal to system page size. The ConnectX-3 and ConnectX-3 Pro HWs require minimum 128 UAR pages. The mlx4 kernel drivers are not loaded if there is less than 128 UAR pages. solution: Always set UAR page to 4KB. This allows more UAR pages if the OS has PAGE_SIZE larger than 4KB. For example, PowerPC kernel use 64KB system page size, with 4MB uar region, there are 4MB/2/64KB = 32 uars (half for uar, half for blueflame). This does not meet minimum 128 UAR pages requirement. With 4KB UAR page, there are 4MB/2/4KB = 512 uars which meet the minimum requirement. Note that only codes in mlx4_core that deal with firmware know that uar page size is 4KB. Codes that deal with usr page in cq and qp context (mlx4_ib, mlx4_en and part of mlx4_core) still have the same assumption that uar page size equals to system page size. Note that with this implementation, on 64KB system page size kernel, there are 16 uars per system page but only one uars is used. The other 15 uars are ignored because of the above assumption. Regarding SR-IOV, mlx4_core in hypervisor will set the uar page size to 4KB and mlx4_core code in virtual OS will obtain the uar page size from firmware. Regarding backward compatibility in SR-IOV, if hypervisor has this new code, the virtual OS must be updated. If hypervisor has old code, and the virtual OS has this new code, the new code will be backward compatible with the old code. If the uar size is big enough, this new code in VF continues to work with 64 KB uar page size (on PowerPc kernel). If the uar size does not meet 128 uars requirement, this new code not loaded in VF and print the same error message as the old code in Hypervisor. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 10:29:27 -05:00
Jack Morgenstein	092bf0fc80	net/mlx4_en: Explicitly set no vlan tags in WQE ctrl segment when no vlan is present We do not set the ins_vlan field to zero when no vlan id is present in the packet. Since WQEs in the TX ring are not zeroed out between uses, this oversight could result in having vlan flags present in the WQE ctrl segment when no vlan is preset. Fixes: `e38af4faf0` ('net/mlx4_en: Add support for hardware accelerated 802.1ad vlan') Reported-by: Gideon Naim <gideonn@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-27 20:27:09 -07:00
Hadar Hen Zion	e38af4faf0	net/mlx4_en: Add support for hardware accelerated 802.1ad vlan To enable device support in accelerated 802.1ad vlan, the port capability "packet has vlan enable" (phv_en) should be set. Firmware won't work properly, in case phv_en is not set. The user can enable "phv_en" port capability with the new ethtool private flag phv-bit. The phv-bit private flag default value is OFF, users who are interested in 802.1ad hardware acceleration should turn ON the phv-bit private flag: $ ethtool --set-priv-flags eth1 phv-bit on Once the private flag is set, the device is ready for 802.1ad vlan acceleration. The user should also change the interface device features and turn on "tx-vlan-stag-hw-insert" which is off by default: $ ethtool -K eth1 tx-vlan-stag-hw-insert on "phv-bit" private flag setting is available only for Physical Functions(PF), the Virtual Function (VF) will be able to use the feature by setting "tx-vlan-stag-hw-insert" ethtool device feature only if the feature was enabled by the Hypervisor. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-07-27 15:00:37 -07:00
Hadar Hen Zion	e802f8e4c5	net/mlx4: Prepare VLAN macros for 802.1ad Hardware accelerated support To add Hardware accelerated support in 802.1ad vlan, replace Current VLAN macros to CVLAN. Replace: MLX4_WQE_CTRL_INS_VLAN MLX4_CQE_VLAN_PRESENT_MASK With: MLX4_WQE_CTRL_INS_CVLAN MLX4_CQE_CVLAN_PRESENT_MASK Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-07-27 15:00:37 -07:00
Ido Shamay	488a9b48e3	net/mlx4_en: Wake TX queues only when there's enough room Indication of a single completed packet, marked by txbbs_skipped being bigger then zero, in not enough in order to wake up a stopped TX queue. The completed packet may contain a single TXBB, while next packet to be sent (after the wake up) may have multiple TXBBs (LSO/TSO packets for example), causing overflow in queue followed by WQE corruption and TX queue timeout. Instead, wake the stopped queue only when there's enough room for the worst case (maximum sized WQE) packet that we should need to handle after the queue is opened again. Also created an helper routine - mlx4_en_is_tx_ring_full, which checks if the current TX ring is full or not. It provides better code readability and removes code duplication. Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-25 02:06:27 -07:00
Eran Ben Elisha	0eb08514fd	net/mlx4_en: Release TX QP when destroying TX ring TX ring QP wasn't released at mlx4_en_destroy_tx_ring. Instead, the code used the deprecated base_tx_qpn field. Move TX QP release to mlx4_en_destroy_tx_ring and remove the base_tx_qpn field. Fixes: `ddae0349fd` ('net/mlx4: Change QP allocation scheme') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-25 02:06:26 -07:00
Rusty Russell	f36963c9d3	cpumask_set_cpu_local_first => cpumask_local_spread, lament `da91309e0a` (cpumask: Utility function to set n'th cpu...) created a genuinely weird function. I never saw it before, it went through DaveM. (He only does this to make us other maintainers feel better about our own mistakes.) cpumask_set_cpu_local_first's purpose is say "I need to spread things across N online cpus, choose the ones on this numa node first"; you call it in a loop. It can fail. One of the two callers ignores this, the other aborts and fails the device open. It can fail in two ways: allocating the off-stack cpumask, or through a convoluted codepath which AFAICT can only occur if cpu_online_mask changes. Which shouldn't happen, because if cpu_online_mask can change while you call this, it could return a now-offline cpu anyway. It contains a nonsensical test "!cpumask_of_node(numa_node)". This was drawn to my attention by Geert, who said this causes a warning on Sparc. It sets a single bit in a cpumask instead of returning a cpu number, because that's what the callers want. It could be made more efficient by passing the previous cpu rather than an index, but that would be more invasive to the callers. Fixes: `da91309e0a` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (then rebased) Tested-by: Amir Vadai <amirv@mellanox.com> Acked-by: Amir Vadai <amirv@mellanox.com> Acked-by: David S. Miller <davem@davemloft.net>	2015-05-28 11:05:20 +09:30
Benjamin Poirier	42eab005a5	mlx4: Fix tx ring affinity_mask creation By default, the number of tx queues is limited by the number of online cpus in mlx4_en_get_profile(). However, this limit no longer holds after the ethtool .set_channels method has been called. In that situation, the driver may access invalid bits of certain cpumask variables when queue_index >= nr_cpu_ids. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Ido Shamay <idos@mellanox.com> Fixes: `d03a68f` ("net/mlx4_en: Configure the XPS queue mapping on driver load") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-29 15:16:57 -04:00
Alexander Duyck	12b3375f39	mlx4/mlx5: Use dma_wmb/rmb where appropriate This patch should help to improve the performance of the mlx4 and mlx5 on a number of architectures. For example, on x86 the dma_wmb/rmb equates out to a barrer() call as the architecture is already strong ordered, and on PowerPC the call works out to a lwsync which is significantly less expensive than the sync call that was being used for wmb. I placed the new barriers between any spots that seemed to be trying to order memory/memory reads or writes, if there are any spots that involved MMIO I left the existing wmb in place as the new barriers cannot order transactions between coherent and non-coherent memories. v2: Reduced the replacments to just the spots where I could clearly identify the usage pattern. Cc: Amir Vadai <amirv@mellanox.com> Cc: Ido Shamay <idos@mellanox.com> Cc: Eli Cohen <eli@mellanox.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-09 14:25:25 -04:00
Yishai Hadas	872bf2fb69	net/mlx4_core: Maintain a persistent memory for mlx4 device Maintain a persistent memory that should survive reset flow/PCI error. This comes as a preparation for coming series to support above flows. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-25 14:43:13 -08:00
Jiri Pirko	df8a39defa	net: rename vlan_tx_* helpers since "tx" is misleading there The same macros are used for rx as well. So rename it. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-13 17:51:08 -05:00
Amir Vadai	492f5add4b	net/mlx4_en: Doorbell is byteswapped in Little Endian archs iowrite32() will byteswap it's argument on big endian archs. iowrite32be() will byteswap on little endian archs. Since we don't want to do this unnecessary byteswap on the fast path, doorbell is stored in the NIC's native endianness. Using the right iowrite() according to the arch endianness. CC: Wei Yang <weiyang@linux.vnet.ibm.com> CC: David Laight <david.laight@aculab.com> Fixes: `6a4e812` ("net/mlx4_en: Avoid calling bswap in tx fast path") Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-22 16:33:10 -05:00
Eugenia Emantayev	ddae0349fd	net/mlx4: Change QP allocation scheme When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset. The current Ethernet driver code reserves a Tx QP range with 256b alignment. This is wrong because if there are more than 64 Tx QPs in use, QPNs >= base + 65 will have bits 6/7 set. This problem is not specific for the Ethernet driver, any entity that tries to reserve more than 64 BF-enabled QPs should fail. Also, using ranges is not necessary here and is wasteful. The new mechanism introduced here will support reservation for "Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs (when hypervisors support WC in VMs). The flow we use is: 1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation, and request "BF enabled QPs" if BF is supported for the function 2. In the ALLOC_RES FW command, change param1 to: a. param1[23:0] - number of QPs b. param1[31-24] - flags controlling QPs reservation Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits 6 and 7 unset in order to be used in Ethernet. Bits 24-30 of the flags are currently reserved. When a function tries to allocate a QP, it states the required attributes for this QP. Those attributes are considered "best-effort". If an attribute, such as Ethernet BF enabled QP, is a must-have attribute, the function has to check that attribute is supported before trying to do the allocation. In a lower layer of the code, mlx4_qp_reserve_range masks out the bits which are unsupported. If SRIOV is used, the PF validates those attributes and masks out unsupported attributes as well. In order to notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's mailbox is filled by the PF, which notifies which QP allocation attributes it supports. Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-11 14:47:35 -05:00
David S. Miller	55b42b5ca2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/phy/marvell.c Simple overlapping changes in drivers/net/phy/marvell.c Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-01 14:53:27 -04:00
Or Gerlitz	a4f2dacbf2	net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN For VXLAN/NVGRE encapsulation, the current HW doesn't support offloading both the outer UDP TX checksum and the inner TCP/UDP TX checksum. The driver doesn't advertize SKB_GSO_UDP_TUNNEL_CSUM, however we are wrongly telling the HW to offload the outer UDP checksum for encapsulated packets, fix that. Fixes: `837052d0cc` ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-30 19:48:58 -04:00
Eric Dumazet	477b35b44f	mlx4: use napi_schedule_irqoff() mlx4_en_rx_irq() and mlx4_en_tx_irq() run from hard interrupt context. They can use napi_schedule_irqoff() instead of napi_schedule() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-By: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-30 16:50:47 -04:00
Eric Dumazet	535114539b	net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers Add two helpers so that drivers do not have to care of BQL being available or not. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jim Davis <jim.epost@gmail.com> Fixes: `29d40c9032` ("net/mlx4_en: Use prefetch in tx path") Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-08 16:08:04 -04:00
Eric Dumazet	fe971b95c2	net/mlx4_en: remove NETDEV_TX_BUSY Drivers should avoid NETDEV_TX_BUSY as much as possible. They should stop the tx queue before qdisc even tries to push another packet, to avoid requeues. For a driver supporting skb->xmit_more, this is likely to be a prereq anyway, otherwise we could have a tx deadlock : We need to force a doorbell if TX ring is full. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-07 13:20:39 -04:00
Eric Dumazet	1556b8746e	net/mlx4_en: Use the new tx_copybreak to set inline threshold Instead of setting inline threshold using module parameter only on driver load, use set_tunable() to set it dynamically. No need to store the threshold per ring, using instead the netdev global priv->prof->inline_thold Initial value still is set using the module parameter, therefore backward compatability is kept. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-06 01:04:16 -04:00
Eric Dumazet	acea73d671	net/mlx4_en: Enable the compiler to make is_inline() inlined Reorganize code to call is_inline() once, so compiler can inline it Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-06 01:04:16 -04:00
Eric Dumazet	e70602a8b8	net/mlx4_en: tx_info->ts_requested was not cleared Properly clear tx_info->ts_requested Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-06 01:04:16 -04:00
Eric Dumazet	e533ac7ea0	net/mlx4_en: Use local var for skb_headlen(skb) Access skb_headlen() once in tx flow Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-06 01:04:15 -04:00
Eric Dumazet	b9d8839a44	net/mlx4_en: Use local var in tx flow for skb_shinfo(skb) Acces skb_shinfo(skb) once in tx flow. Also, rename @i variable to @i_frag to avoid confusion, as the "goto tx_drop_unmap;" relied on this @i variable. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-06 01:04:15 -04:00

1 2 3

121 Commits