linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 11:18:45 +07:00

Author	SHA1	Message	Date
Eric Dumazet	e3f42f8453	mlx4: reorganize struct mlx4_en_tx_ring Goal is to reorganize this critical structure to increase performance. ndo_start_xmit() should only dirty one cache line, and access as few cache lines as possible. Add sp_ (Slow Path) prefix to fields that are not used in fast path, to make clear what is going on. After this patch pahole reports something much better, as all ndo_start_xmit() needed fields are packed into two cache lines instead of seven or eight struct mlx4_en_tx_ring { u32 last_nr_txbb; /* 0 0x4 / u32 cons; / 0x4 0x4 / long unsigned int wake_queue; / 0x8 0x8 / struct netdev_queue tx_queue; /* 0x10 0x8 / u32 (free_tx_desc)(struct mlx4_en_priv , struct mlx4_en_tx_ring , int, u8, u64, int); /* 0x18 0x8 / struct mlx4_en_rx_ring recycle_ring; /* 0x20 0x8 / / XXX 24 bytes hole, try to pack / / --- cacheline 1 boundary (64 bytes) --- / u32 prod; / 0x40 0x4 / unsigned int tx_dropped; / 0x44 0x4 / long unsigned int bytes; / 0x48 0x8 / long unsigned int packets; / 0x50 0x8 / long unsigned int tx_csum; / 0x58 0x8 / long unsigned int tso_packets; / 0x60 0x8 / long unsigned int xmit_more; / 0x68 0x8 / struct mlx4_bf bf; / 0x70 0x18 / / --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- / __be32 doorbell_qpn; / 0x88 0x4 / __be32 mr_key; / 0x8c 0x4 / u32 size; / 0x90 0x4 / u32 size_mask; / 0x94 0x4 / u32 full_size; / 0x98 0x4 / u32 buf_size; / 0x9c 0x4 / void buf; /* 0xa0 0x8 / struct mlx4_en_tx_info tx_info; /* 0xa8 0x8 / int qpn; / 0xb0 0x4 / u8 queue_index; / 0xb4 0x1 / bool bf_enabled; / 0xb5 0x1 / bool bf_alloced; / 0xb6 0x1 / u8 hwtstamp_tx_type; / 0xb7 0x1 / u8 bounce_buf; /* 0xb8 0x8 / / --- cacheline 3 boundary (192 bytes) --- / long unsigned int queue_stopped; / 0xc0 0x8 / struct mlx4_hwq_resources sp_wqres; / 0xc8 0x58 / / --- cacheline 4 boundary (256 bytes) was 32 bytes ago --- / struct mlx4_qp sp_qp; / 0x120 0x30 / / --- cacheline 5 boundary (320 bytes) was 16 bytes ago --- / struct mlx4_qp_context sp_context; / 0x150 0xf8 / / --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- / cpumask_t sp_affinity_mask; / 0x248 0x20 / enum mlx4_qp_state sp_qp_state; / 0x268 0x4 / u16 sp_stride; / 0x26c 0x2 / u16 sp_cqn; / 0x26e 0x2 / / size: 640, cachelines: 10, members: 36 / / sum members: 600, holes: 1, sum holes: 24 / / padding: 16 / }; Instead of this silly placement : struct mlx4_en_tx_ring { u32 last_nr_txbb; / 0 0x4 / u32 cons; / 0x4 0x4 / long unsigned int wake_queue; / 0x8 0x8 / / XXX 48 bytes hole, try to pack / / --- cacheline 1 boundary (64 bytes) --- / u32 prod; / 0x40 0x4 / / XXX 4 bytes hole, try to pack / long unsigned int bytes; / 0x48 0x8 / long unsigned int packets; / 0x50 0x8 / long unsigned int tx_csum; / 0x58 0x8 / long unsigned int tso_packets; / 0x60 0x8 / long unsigned int xmit_more; / 0x68 0x8 / unsigned int tx_dropped; / 0x70 0x4 / / XXX 4 bytes hole, try to pack / struct mlx4_bf bf; / 0x78 0x18 / / --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- / long unsigned int queue_stopped; / 0x90 0x8 / cpumask_t affinity_mask; / 0x98 0x10 / struct mlx4_qp qp; / 0xa8 0x30 / / --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- / struct mlx4_hwq_resources wqres; / 0xd8 0x58 / / --- cacheline 4 boundary (256 bytes) was 48 bytes ago --- / u32 size; / 0x130 0x4 / u32 size_mask; / 0x134 0x4 / u16 stride; / 0x138 0x2 / / XXX 2 bytes hole, try to pack / u32 full_size; / 0x13c 0x4 / / --- cacheline 5 boundary (320 bytes) --- / u16 cqn; / 0x140 0x2 / / XXX 2 bytes hole, try to pack / u32 buf_size; / 0x144 0x4 / __be32 doorbell_qpn; / 0x148 0x4 / __be32 mr_key; / 0x14c 0x4 / void buf; /* 0x150 0x8 / struct mlx4_en_tx_info tx_info; /* 0x158 0x8 / struct mlx4_en_rx_ring recycle_ring; /* 0x160 0x8 / u32 (free_tx_desc)(struct mlx4_en_priv , struct mlx4_en_tx_ring , int, u8, u64, int); /* 0x168 0x8 / u8 bounce_buf; /* 0x170 0x8 / struct mlx4_qp_context context; / 0x178 0xf8 / / --- cacheline 9 boundary (576 bytes) was 48 bytes ago --- / int qpn; / 0x270 0x4 / enum mlx4_qp_state qp_state; / 0x274 0x4 / u8 queue_index; / 0x278 0x1 / bool bf_enabled; / 0x279 0x1 / bool bf_alloced; / 0x27a 0x1 / / XXX 5 bytes hole, try to pack / / --- cacheline 10 boundary (640 bytes) --- / struct netdev_queue tx_queue; /* 0x280 0x8 / int hwtstamp_tx_type; / 0x288 0x4 / / size: 704, cachelines: 11, members: 36 / / sum members: 587, holes: 6, sum holes: 65 / / padding: 52 */ }; Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-24 16:03:37 -05:00
Tariq Toukan	15fca2c8eb	net/mlx4_en: Add ethtool statistics for XDP cases XDP statistics are reported in ethtool, in total and per ring, as follows: - xdp_drop: the number of packets dropped by xdp. - xdp_tx: the number of packets forwarded by xdp. - xdp_tx_full: the number of times an xdp forward failed due to a full tx xdp ring. In addition, all packets that are dropped/forwarded by XDP are no longer accounted in rx_packets/rx_bytes of the ring, so that they count traffic that is passed to the stack. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-02 15:07:11 -04:00
Tariq Toukan	67f8b1dcb9	net/mlx4_en: Refactor the XDP forwarding rings scheme Separately manage the two types of TX rings: regular ones, and XDP. Upon an XDP set, do not borrow regular TX rings and convert them into XDP ones, but allocate new ones, unless we hit the max number of rings. Which means that in systems with smaller #cores we will not consume the current TX rings for XDP, while we are still in the num TX limit. XDP TX rings counters are not shown in ethtool statistics. Instead, XDP counters will be added to the respective RX rings in a downstream patch. This has no performance implications. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-02 15:07:11 -04:00
Tariq Toukan	ccc109b8ed	net/mlx4_en: Add TX_XDP for CQ types Support XDP CQ type, and refactor the CQ type enum. Rename the is_tx field to match the change. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-02 15:07:11 -04:00
David S. Miller	b20b378d49	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/qlogic/qed/qed_dcbx.c drivers/net/phy/Kconfig All conflicts were cases of overlapping commits. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-12 15:52:44 -07:00
Tariq Toukan	564ed9b187	net/mlx4_en: Fixes for DCBX This patch adds a capability check before enabling DCBX. In addition, it re-organizes the relevant data structures, and fixes a typo in a define. Fixes: `af7d518526` ("net/mlx4_en: Add DCB PFC support through CEE netlink commands") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-11 19:40:26 -07:00
Brenden Blanco	326fe02d1e	net/mlx4_en: protect ring->xdp_prog with rcu_read_lock Depending on the preempt mode, the bpf_prog stored in xdp_prog may be freed despite the use of call_rcu inside bpf_prog_put. The situation is possible when running in PREEMPT_RCU=y mode, for instance, since the rcu callback for destroying the bpf prog can run even during the bh handling in the mlx4 rx path. Several options were considered before this patch was settled on: Add a napi_synchronize loop in mlx4_xdp_set, which would occur after all of the rings are updated with the new program. This approach has the disadvantage that as the number of rings increases, the speed of update will slow down significantly due to napi_synchronize's msleep(1). Add a new rcu_head in bpf_prog_aux, to be used by a new bpf_prog_put_bh. The action of the bpf_prog_put_bh would be to then call bpf_prog_put later. Those drivers that consume a bpf prog in a bh context (like mlx4) would then use the bpf_prog_put_bh instead when the ring is up. This has the problem of complexity, in maintaining proper refcnts and rcu lists, and would likely be harder to review. In addition, this approach to freeing must be exclusive with other frees of the bpf prog, for instance a _bh prog must not be referenced from a prog array that is consumed by a non-_bh prog. The placement of rcu_read_lock in this patch is functionally the same as putting an rcu_read_lock in napi_poll. Actually doing so could be a potentially controversial change, but would bring the implementation in line with sk_busy_loop (though of course the nature of those two paths is substantially different), and would also avoid future copy/paste problems with future supporters of XDP. Still, this patch does not take that opinionated option. Testing was done with kernels in either PREEMPT_RCU=y or CONFIG_PREEMPT_VOLUNTARY=y+PREEMPT_RCU=n modes, with neither exhibiting any drawback. With PREEMPT_RCU=n, the extra call to rcu_read_lock did not show up in the perf report whatsoever, and with PREEMPT_RCU=y the overhead of rcu_read_lock (according to perf) was the same before/after. In the rx path, rcu_read_lock is eventually called for every packet from netif_receive_skb_internal, so the napi poll call's rcu_read_lock is easily amortized. v2: Remove extra rcu_read_lock in mlx4_en_process_rx_cq body Annotate xdp_prog with __rcu, and convert all usages to rcu_assign or rcu_dereference[_protected] as appropriate. Add explicit mutex lock around rcu_assign instead of xchg loop. Fixes: `d576acf0a2` ("net/mlx4_en: add page recycle to prepare rx ring for tx support") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-06 13:39:33 -07:00
David S. Miller	de0ba9a0d8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Just several instances of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-24 00:53:32 -04:00
Brenden Blanco	9ecc2d8617	net/mlx4_en: add xdp forwarding and data write support A user will now be able to loop packets back out of the same port using a bpf program attached to xdp hook. Updates to the packet contents from the bpf program is also supported. For the packet write feature to work, the rx buffers are now mapped as bidirectional when the page is allocated. This occurs only when the xdp hook is active. When the program returns a TX action, enqueue the packet directly to a dedicated tx ring, so as to avoid completely any locking. This requires the tx ring to be allocated 1:1 for each rx ring, as well as the tx completion running in the same softirq. Upon tx completion, this dedicated tx ring recycles pages without unmapping directly back to the original rx ring. In steady state tx/drop workload, effectively 0 page allocs/frees will occur. In order to separate out the paths between free and recycle, a free_tx_desc func pointer is introduced that is optionally updated whenever recycle_ring is activated. By default the original free function is always initialized. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-19 21:46:33 -07:00
Brenden Blanco	d576acf0a2	net/mlx4_en: add page recycle to prepare rx ring for tx support The mlx4 driver by default allocates order-3 pages for the ring to consume in multiple fragments. When the device has an xdp program, this behavior will prevent tx actions since the page must be re-mapped in TODEVICE mode, which cannot be done if the page is still shared. Start by making the allocator configurable based on whether xdp is running, such that order-0 pages are always used and never shared. Since this will stress the page allocator, add a simple page cache to each rx ring. Pages in the cache are left dma-mapped, and in drop-only stress tests the page allocator is eliminated from the perf report. Note that setting an xdp program will now require the rings to be reconfigured. Before: 26.91% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq 17.88% ksoftirqd/0 [mlx4_en] [k] mlx4_en_alloc_frags 6.00% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_frag 4.49% ksoftirqd/0 [kernel.vmlinux] [k] get_page_from_freelist 3.21% swapper [kernel.vmlinux] [k] intel_idle 2.73% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem 2.57% swapper [mlx4_en] [k] mlx4_en_process_rx_cq After: 31.72% swapper [kernel.vmlinux] [k] intel_idle 8.79% swapper [mlx4_en] [k] mlx4_en_process_rx_cq 7.54% swapper [kernel.vmlinux] [k] poll_idle 6.36% swapper [mlx4_core] [k] mlx4_eq_int 4.21% swapper [kernel.vmlinux] [k] tasklet_action 4.03% swapper [kernel.vmlinux] [k] cpuidle_enter_state 3.43% swapper [mlx4_en] [k] mlx4_en_prepare_rx_desc 2.18% swapper [kernel.vmlinux] [k] native_irq_return_iret 1.37% swapper [kernel.vmlinux] [k] menu_select 1.09% swapper [kernel.vmlinux] [k] bpf_map_lookup_elem Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-19 21:46:32 -07:00
Brenden Blanco	47a38e1550	net/mlx4_en: add support for fast rx drop bpf program Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver. In tc/socket bpf programs, helpers linearize skb fragments as needed when the program touches the packet data. However, in the pursuit of speed, XDP programs will not be allowed to use these slower functions, especially if it involves allocating an skb. Therefore, disallow MTU settings that would produce a multi-fragment packet that XDP programs would fail to access. Future enhancements could be done to increase the allowable MTU. The xdp program is present as a per-ring data structure, but as of yet it is not possible to set at that granularity through any ndo. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-19 21:46:32 -07:00
Eugenia Emantayev	ec25bc04ed	net/mlx4_en: Add resilience in low memory systems This patch fixes the lost of Ethernet port on low memory system, when driver frees its resources and fails to allocate new resources. Issue could happen while changing number of channels, rings size or changing the timestamp configuration. This fix is necessary because of removing vmap use in the code. When vmap was in use driver could allocate non-contiguous memory and make it contiguous with vmap. Now it could fail to allocate a large chunk of contiguous memory and lose the port. Current code tries to allocate new resources and then upon success frees the old resources. Fixes: `73898db043` ('net/mlx4: Avoid wrong virtual mappings') Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-19 16:44:11 -07:00
Rana Shahout	af7d518526	net/mlx4_en: Add DCB PFC support through CEE netlink commands This patch adds support for reading and updating priority flow control (PFC) attributes in the driver via netlink. Signed-off-by: Rana Shahout <ranas@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-23 15:18:50 -04:00
Alexander Duyck	a831274a13	mlx4_en: Replace ndo_add/del_vxlan_port with ndo_add/del_udp_enc_port This change replaces the network device operations for adding or removing a VXLAN port with operations that are more generically defined to be used for any UDP offload port but provide a type. As such by just adding a line to verify that the offload type is VXLAN we can maintain the same functionality. In addition I updated the socket address family check so that instead of excluding IPv6 we instead abort of type is not IPv4. This makes much more sense as we should only be supporting IPv4 outer addresses on this hardware. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-17 20:23:31 -07:00
Eric Dumazet	f73a6f439f	net/mlx4_en: get rid of private net_device_stats We simply can use the standard net_device stats. We do not need to clear fields that are already 0. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-25 22:15:50 -07:00
Eric Dumazet	9ed17db17f	net/mlx4_en: get rid of ret_stats mlx4 uses a private struct net_device_stats in a vain attempt to avoid races. This is buggy because multiple cpus could call mlx4_en_get_stats() at the same time, so ret_stats can not guarantee stable results. To fix this, we need to switch to ndo_get_stats64() as this method provides per-thread storage. This allows to reduce mlx4_en_priv bloat. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-25 22:15:50 -07:00
Eric Dumazet	63a664b7e9	net/mlx4_en: fix tx_dropped bug 1) mlx4_en_xmit() can increment priv->stats.tx_dropped, but this variable is overwritten in mlx4_en_DUMP_ETH_STATS(). 2) This increment was not SMP safe, as a port might have many TX queues. Add a per TX ring tx_dropped to fix these issues. This is u32 as mlx4_en_DUMP_ETH_STATS() will add a 32bit field. So lets avoid bugs with SNMP agents having to cope with partial overwraps. (One of these agents being bond_fold_stats()) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Cc: Eugenia Emantayev <eugenia@mellanox.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-25 22:15:49 -07:00
Haggai Abramovsky	73898db043	net/mlx4: Avoid wrong virtual mappings The dma_alloc_coherent() function returns a virtual address which can be used for coherent access to the underlying memory. On some architectures, like arm64, undefined behavior results if this memory is also accessed via virtual mappings that are not coherent. Because of their undefined nature, operations like virt_to_page() return garbage when passed virtual addresses obtained from dma_alloc_coherent(). Any subsequent mappings via vmap() of the garbage page values are unusable and result in bad things like bus errors (synchronous aborts in ARM64 speak). The mlx4 driver contains code that does the equivalent of: vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the device is opened. Prevent Ethernet driver to run this problematic code by forcing it to allocate contiguous memory. As for the Infiniband driver, at first we are trying to allocate contiguous memory, but in case of failure roll back to work with fragmented memory. Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reported-by: David Daney <david.daney@cavium.com> Tested-by: Sinan Kaya <okaya@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-05 23:23:05 -04:00
Eran Ben Elisha	d21ed3a311	net/mlx4_en: Split SW RX dropped counter per RX ring Count SW packet drops per RX ring instead of a global counter. This will allow monitoring the number of rx drops per ring. In addition, SW rx_dropped counter was overwritten by HW rx_dropped counter, sum both of them instead to show the accurate value. Fixes: `a3333b35da` ('net/mlx4_en: Moderate ethtool callback to [...] ') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reported-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-21 15:02:40 -04:00
David Decotigny	3d8f7cc78d	net: mlx4: use new ETHTOOL_G/SSETTINGS API Signed-off-by: David Decotigny <decot@googlers.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-25 22:06:47 -05:00
Eric Dumazet	868fdb0606	mlx4: remove mlx4_en_low_latency_recv() Busy polling can now be handled in generic NAPI poll infrastructure. This removes complexity and fast path overhead : mlx4 used two spin_lock()/spin_unlock() pair per napi->poll() call in mlx4_en_cq_lock_napi()/mlx4_en_cq_unlock_napi() Tested: Without busy polling : lpaa23:~# echo 0 >/proc/sys/net/core/busy_read lpaa24:~# echo 0 >/proc/sys/net/core/busy_read lpaa23:~# ./netperf -H lpaa24 -t TCP_RR MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 47330.78 With busy polling : lpaa23:~# echo 70 >/proc/sys/net/core/busy_read lpaa24:~# echo 70 >/proc/sys/net/core/busy_read lpaa23:~# ./netperf -H lpaa24 -t TCP_RR MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 97643.55 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-18 16:17:40 -05:00
Eric Dumazet	5865316c9d	mlx4: mlx4_en_low_latency_recv() called with BH disabled mlx4_en_low_latency_recv() is called with BH disabled, as other ndo_busy_poll() methods. No need for spin_lock_bh()/spin_unlock_bh() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-18 16:17:38 -05:00
Maor Gottlieb	74194fb9c8	net/mlx4_en: Implement mcast loopback prevention for ETH qps Set the mcast loopback prevention bit in the QPC for ETH MLX QPs (not RSS QPs), when the firmware supports this feature. In addition, all rx ring QPs need to be updated in order not to enforce loopback checks. This prevents getting packets we sent both from the network stack and the HCA. Loopback prevention is done by comparing the counter indices of the sent and receiving QPs. If they're equal, packets aren't loopback-ed. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-10-21 23:16:47 -04:00
Hadar Hen Zion	e38af4faf0	net/mlx4_en: Add support for hardware accelerated 802.1ad vlan To enable device support in accelerated 802.1ad vlan, the port capability "packet has vlan enable" (phv_en) should be set. Firmware won't work properly, in case phv_en is not set. The user can enable "phv_en" port capability with the new ethtool private flag phv-bit. The phv-bit private flag default value is OFF, users who are interested in 802.1ad hardware acceleration should turn ON the phv-bit private flag: $ ethtool --set-priv-flags eth1 phv-bit on Once the private flag is set, the device is ready for 802.1ad vlan acceleration. The user should also change the interface device features and turn on "tx-vlan-stag-hw-insert" which is off by default: $ ethtool -K eth1 tx-vlan-stag-hw-insert on "phv-bit" private flag setting is available only for Physical Functions(PF), the Virtual Function (VF) will be able to use the feature by setting "tx-vlan-stag-hw-insert" ethtool device feature only if the feature was enabled by the Hypervisor. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-07-27 15:00:37 -07:00
Ido Shamay	488a9b48e3	net/mlx4_en: Wake TX queues only when there's enough room Indication of a single completed packet, marked by txbbs_skipped being bigger then zero, in not enough in order to wake up a stopped TX queue. The completed packet may contain a single TXBB, while next packet to be sent (after the wake up) may have multiple TXBBs (LSO/TSO packets for example), causing overflow in queue followed by WQE corruption and TX queue timeout. Instead, wake the stopped queue only when there's enough room for the worst case (maximum sized WQE) packet that we should need to handle after the queue is opened again. Also created an helper routine - mlx4_en_is_tx_ring_full, which checks if the current TX ring is full or not. It provides better code readability and removes code duplication. Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-25 02:06:27 -07:00
Eran Ben Elisha	0eb08514fd	net/mlx4_en: Release TX QP when destroying TX ring TX ring QP wasn't released at mlx4_en_destroy_tx_ring. Instead, the code used the deprecated base_tx_qpn field. Move TX QP release to mlx4_en_destroy_tx_ring and remove the base_tx_qpn field. Fixes: `ddae0349fd` ('net/mlx4: Change QP allocation scheme') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-25 02:06:26 -07:00
Eran Ben Elisha	b42de4d012	net/mlx4_en: Show PF own statistics via ethtool Allow the user to observe the PF own statistics using ethtool with pf_ prefixed counter names. Those counters are the PF statistics out of the overall port statistics. Every PF QP is attached to a counter and the summary of those counters is the PF statistics. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-15 17:23:02 -07:00
Eran Ben Elisha	6de5f7f6a1	net/mlx4_core: Allocate default counter per port Default counter per port will be allocated at the mlx4 core driver load. Every QP opened by the Ethernet driver will be attached to the port's default counter. This is an infrastructure step to collect VF statistics from the PF. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-15 17:23:02 -07:00
Matan Barak	c66fa19c40	net/mlx4: Add EQ pool Previously, mlx4_en allocated EQs and used them exclusively. This affected RoCE performance, as applications which are events sensitive were limited to use only the legacy EQs. Change that by introducing an EQ pool. This pool is managed by mlx4_core. EQs are assigned to ports (when there are limited number of EQs, multiple ports could be assigned to the same EQs). An exception to this rule is the ASYNC EQ which handles various events. Legacy EQs are completely removed as all EQs could be shared. When a consumer (mlx4_ib/mlx4_en) requests an EQ, it asks for EQ serving on a specific port. The core driver calculates which EQ should be assigned to that request. Because IRQs are shared between IB and Ethernet modules, their names only include the PCI device BDF address. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 23:35:34 -07:00
Ido Shamay	07841f9d94	net/mlx4_en: Schedule napi when RX buffers allocation fails When system is out of memory, refilling of RX buffers fails while the driver continue to pass the received packets to the kernel stack. At some point, when all RX buffers deplete, driver may fall into a sleep, and not recover when memory for new RX buffers is once again availible. This is because hardware does not have valid descriptors, so no interrupt will be generated for the driver to return to work in napi context. Fix it by schedule the napi poll function from stats_task delayed workqueue, as long as the allocations fail. Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-30 16:47:50 -04:00
Ido Shamay	51af33cfed	net/mlx4_en: Add interface identify support Add support for the interface ethtool identify feature. Make the physical port LED to blink with green and yellow colors. The device handles the LED blink by itself (synchrous use of set_phys_id), by returning 0 to ETHTOOL_ID_ACTIVE command. Signed-off-by: Eyal Grossman <eyalgr@mellanox.com> Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-02 16:25:03 -04:00
Matan Barak	0b131561a7	net/mlx4_en: Add Flow control statistics display via ethtool Flow control per priority and Global pause counters are now visible via ethtool. The counters shows statistics regarding pauses in the device. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Shani Michaeli <shanim@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 16:36:51 -04:00
Eran Ben Elisha	3da8a36cc5	net/mlx4_en: Protect access to the statistics bitmap This will allow parallel access to the statistics bitmap. A pre-step for adding PFC counters, where the statistics bitmap can be dynamically changed when modifying the PFC setting. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 16:36:50 -04:00
Eran Ben Elisha	6fcd27354b	net/mlx4_en: Support general selective view of ethtool statistics The driver uses a bitmask to indicate which statistics should be displayed to the user in ethtool. The bitmask is u64, therefore we are limited for a selective view of up to 64 statistics. Extend the bitmap in order to show more than 64 statistics. In addition, add packet statistics to the ethtool display for PF. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 16:36:50 -04:00
Eran Ben Elisha	ffa88f37ff	net/mlx4_en: Move statistics bitmap setting to the Ethernet driver The statistics bitmap belongs to the Ethernet driver, move it there. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 16:36:50 -04:00
Eran Ben Elisha	b4b6e842fc	net/mlx4_en: Create new header file for all statistics info Add mlx4_stats.h file and move there all statistics structs and marcos. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 16:36:50 -04:00
David S. Miller	0fa74a4be4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/emulex/benet/be_main.c net/core/sysctl_net_core.c net/ipv4/inet_diag.c The be_main.c conflict resolution was really tricky. The conflict hunks generated by GIT were very unhelpful, to say the least. It split functions in half and moved them around, when the real actual conflict only existed solely inside of one function, that being be_map_pci_bars(). So instead, to resolve this, I checked out be_main.c from the top of net-next, then I applied the be_main.c changes from 'net' since the last time I merged. And this worked beautifully. The inet_diag.c and sysctl_net_core.c conflicts were simple overlapping changes, and were easily to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-20 18:51:09 -04:00
Eran Ben Elisha	a16f356570	net/mlx4_en: Fix off-by-one in ethtool statistics display NUM_PORT_STATS was 9 instead of 10, which caused off-by-one bug when displaying the statistics starting from tx_chksum_offload in ethtool. Fixes: `f8c6455bb0` ('net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 15:17:11 -04:00
Shani Michaeli	708b869bf5	net/mlx4_en: Add QCN parameters and statistics handling Implement the IEEE DCB handlers for set/get QCN parameters and statistics reading per TC. Signed-off-by: Shani Michaeli <shanim@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 21:50:02 -05:00
Moni Shoua	5da0354726	net/mlx4_en: Port aggregation configuration Capture NETDEV events generated by the bonding driver and based on that make decisions of how to configure port aggregation in the mlx4 core driver. This includes setting the V2P port table and re-creating the interested interfaces in bonded/non-bonded mode. Signed-off-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-04 16:14:25 -08:00
Eugenia Emantayev	ddae0349fd	net/mlx4: Change QP allocation scheme When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset. The current Ethernet driver code reserves a Tx QP range with 256b alignment. This is wrong because if there are more than 64 Tx QPs in use, QPNs >= base + 65 will have bits 6/7 set. This problem is not specific for the Ethernet driver, any entity that tries to reserve more than 64 BF-enabled QPs should fail. Also, using ranges is not necessary here and is wasteful. The new mechanism introduced here will support reservation for "Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs (when hypervisors support WC in VMs). The flow we use is: 1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation, and request "BF enabled QPs" if BF is supported for the function 2. In the ALLOC_RES FW command, change param1 to: a. param1[23:0] - number of QPs b. param1[31-24] - flags controlling QPs reservation Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits 6 and 7 unset in order to be used in Ethernet. Bits 24-30 of the flags are currently reserved. When a function tries to allocate a QP, it states the required attributes for this QP. Those attributes are considered "best-effort". If an attribute, such as Ethernet BF enabled QP, is a must-have attribute, the function has to check that attribute is supported before trying to do the allocation. In a lower layer of the code, mlx4_qp_reserve_range masks out the bits which are unsupported. If SRIOV is used, the PF validates those attributes and masks out unsupported attributes as well. In order to notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's mailbox is filled by the PF, which notifies which QP allocation attributes it supports. Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-11 14:47:35 -05:00
Eyal Perry	947cbb0ac2	net/mlx4_en: Support for configurable RSS hash function The ConnectX HW is capable of using one of the following hash functions: Toeplitz and an XOR hash function. This patch extends the implementation of the mlx4_en driver set/get_rxfh callbacks to support getting and setting the RSS hash function used by the device. Signed-off-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-08 21:07:10 -05:00
Eric Dumazet	bd635c354d	mlx4: fix mlx4_en_set_rxfh() mlx4_en_set_rxfh() can crash if no RSS indir table is provided. While we are at it, allow RSS key to be changed with ethtool -X Tested: myhost:~# cat /proc/sys/net/core/netdev_rss_key b6:89:91:f3:b2:c3:c2:90:11:e8:ce:45:e8:a9:9d:1c:f2:f6:d4:53:61:8b:26:3a:b3:9a:57:97:c3:b6:79:4d:2e:d9:66:5c:72:ed:b6:8e:c5:5d:4d:8c:22:67:30🆎8a:6e:c3:6a myhost:~# ethtool -x eth0 RX flow hash indirection table for eth0 with 8 RX ring(s): 0: 0 1 2 3 4 5 6 7 RSS hash key: b6:89:91:f3:b2:c3:c2:90:11:e8:ce:45:e8:a9:9d:1c:f2:f6:d4:53:61:8b:26:3a:b3:9a:57:97:c3:b6:79:4d:2e:d9:66:5c:72:ed:b6:8e myhost:~# ethtool -X eth0 hkey \ 03:0e:e2:43:fa:82:0e:73:14:2d:c0:68:21:9e:82:99:b9:84:d0:22:e2:b3:64:9f:4a:af:00:fa:cc:05:b4:4a:17:05:14:73:76:58:bd:2f myhost:~# ethtool -x eth0 RX flow hash indirection table for eth0 with 8 RX ring(s): 0: 0 1 2 3 4 5 6 7 RSS hash key: 03:0e:e2:43:fa:82:0e:73:14:2d:c0:68:21:9e:82:99:b9:84:d0:22:e2:b3:64:9f:4a:af:00:fa:cc:05:b4:4a:17:05:14:73:76:58:bd:2f Reported-by: Ben Hutchings <ben@decadent.org.uk> Fixes: `b9d1ab7eb4` ("mlx4: use netdev_rss_key_fill() helper") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-23 13:49:12 -05:00
Shani Michaeli	f8c6455bb0	net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE When processing received traffic, pass CHECKSUM_COMPLETE status to the stack, with calculated checksum for non TCP/UDP packets (such as GRE or ICMP). Although the stack expects checksum which doesn't include the pseudo header, the HW adds it. To address that, we are subtracting the pseudo header checksum from the checksum value provided by the HW. In the IPv6 case, we also compute/add the IP header checksum which is not added by the HW for such packets. Cc: Jerry Chu <hkchu@google.com> Signed-off-by: Shani Michaeli <shanim@mellanox.com> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-11 13:20:02 -05:00
Ido Shamay	5f6e980080	net/mlx4_en: Remove RX buffers alignment to IP_ALIGN When IP_ALIGN has a non zero value, hardware will write to a non aligned address. The only reader from this address is when copying the header from the first frag into the linear buffer (further access to the IP address will be from the linear buffer, in which the headers are aligned). Since the penalty of non align access by the hardware is greater than the software memcpy, changing the frag_align to always be 0. Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-03 12:28:13 -05:00
Saeed Mahameed	7787fa661b	net/mlx4_en: Add support for setting rxvlan offload OFF/ON Rename mlx4_en_timestamp_config to mlx4_en_reset_config and extend it to support choosing RX vlan offload configuration. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-28 17:18:01 -04:00
Saeed Mahameed	2c76267943	net/mlx4_en: Use PTYS register to query ethtool settings - If dev cap MLX4_DEV_CAP_FLAG2_ETH_PROT_CTRL is ON, query PTYS register to fill ethtool settings. else use default values. - Use autoneg port cap and dev backplane autoneg cap to reprort autoneg interface capbilities. - Fix typo in mlx4_en_port_state struct field (transciver to transceiver). Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-28 17:18:00 -04:00
Eric Dumazet	1556b8746e	net/mlx4_en: Use the new tx_copybreak to set inline threshold Instead of setting inline threshold using module parameter only on driver load, use set_tunable() to set it dynamically. No need to store the threshold per ring, using instead the netdev global priv->prof->inline_thold Initial value still is set using the module parameter, therefore backward compatability is kept. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-06 01:04:16 -04:00
Eric Dumazet	3d03641cb4	net/mlx4_en: Avoid a cache line miss in TX completion for single frag skb's Add frag0_dma/frag0_byte_count into mlx4_en_tx_info to avoid a cache line miss in TX completion for frames having one dma element. (We avoid reading back the tx descriptor) Note this could be extended to 2/3 dma elements later, as we have free room in mlx4_en_tx_info Also, mlx4_en_free_tx_desc() no longer accesses skb_shinfo(). We use a new nr_maps fields in mlx4_en_tx_info to avoid 2 or 3 cache misses. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-06 01:04:15 -04:00
Eric Dumazet	6a4e81211f	net/mlx4_en: Avoid calling bswap in tx fast path - doorbell_qpn is stored in the cpu_to_be32() way to avoid bswap() in fast path. - mdev->mr.key stored in ring->mr_key to also avoid bswap() and access to cold cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-06 01:04:15 -04:00

1 2 3

142 Commits