linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-24 00:10:10 +07:00

Author	SHA1	Message	Date
Huazhong Tan	6d4fab3953	net: hns3: Reset net device with rtnl_lock Since current locking was not covering certain code where netdev was being accessed or manipulated, this patch fixes it. Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 11:16:44 -07:00
Huazhong Tan	1b3725781a	net: hns3: Modify the order of initializing command queue register According to hardware's description, the head pointer register should be written before the tail pointer register while doing command queue initialization. Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 11:16:44 -07:00
David S. Miller	aea06eb276	Merge branch 'TLS-offload-rx-netdev-and-mlx5' Boris Pismenny says: ==================== TLS offload rx, netdev & mlx5 The following series provides TLS RX inline crypto offload. v5->v4: - Remove the Kconfig to mutually exclude both IPsec and TLS v4->v3: - Remove the iov revert for zero copy send flow v2->v3: - Fix typo - Adjust cover letter - Fix bug in zero copy flows - Use network byte order for the record number in resync - Adjust the sequence provided in resync v1->v2: - Fix bisectability problems due to variable name changes - Fix potential uninitialized return value This series completes the generic infrastructure to offload TLS crypto to a network devices. It enables the kernel TLS socket to skip decryption and authentication operations for SKBs marked as decrypted on the receive side of the data path. Leaving those computationally expensive operations to the NIC. This infrastructure doesn't require a TCP offload engine. Instead, the NIC decrypts a packet's payload if the packet contains the expected TCP sequence number. The TLS record authentication tag remains unmodified regardless of decryption. If the packet is decrypted successfully and it contains an authentication tag, then the authentication check has passed. Otherwise, if the authentication fails, then the packet is provided unmodified and the KTLS layer is responsible for handling it. Out-Of-Order TCP packets are provided unmodified. As a result, in the slow path some of the SKBs are decrypted while others remain as ciphertext. The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs. At the worst case a received TLS record consists of both plaintext and ciphertext packets. These partially decrypted records must be reencrypted, only to be decrypted. The notable differences between SW KTLS and NIC offloaded TLS implementations are as follows: 1. Partial decryption - Software must handle the case of a TLS record that was only partially decrypted by HW. This can happen due to packet reordering. 2. Resynchronization - tls_read_size calls the device driver to resynchronize HW whenever it lost track of the TLS record framing in the TCP stream. The infrastructure should be extendable to support various NIC offload implementations. However it is currently written with the implementation below in mind: The NIC identifies packets that should be offloaded according to the 5-tuple and the TCP sequence number. If these match and the packet is decrypted and authenticated successfully, then a syndrome is provided to software. Otherwise, the packet is unmodified. Decrypted and non-decrypted packets aren't coalesced by the network stack, and the KTLS layer decrypts and authenticates partially decrypted records. The NIC provides an indication whenever a resync is required. The resync operation is triggered by the KTLS layer while parsing TLS record headers. Finally, we measure the performance obtained by running single stream iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound) and KTLS-Offload running both in Tx and Rx. The results show that the performance of offload is comparable to TCP. \| Bandwidth (Gbps) \| CPU Tx (%) \| CPU rx (%) TCP \| 28.8 \| 5 \| 12 KTLS-Offload-Tx-Rx \| 28.6 \| 7 \| 14 Paper: https://netdevconf.org/2.2/papers/pismenny-tlscrypto-talk.pdf ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:40 -07:00
Boris Pismenny	b3ccf97813	net/mlx5e: IPsec, fix byte count in CQE This patch fixes the byte count indication in CQE for processed IPsec packets that contain a metadata header. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:28 -07:00
Boris Pismenny	10e71acca2	net/mlx5: Accel, add common metadata functions This patch adds common functions to handle mellanox metadata headers. These functions are used by IPsec and TLS to process FPGA metadata. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	790af90c00	net/mlx5e: TLS, build TLS netdev from capabilities This patch enables TLS Rx based on available HW capabilities. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	afd3baaa93	net/mlx5e: TLS, add software statistics This patch adds software statistics for TLS to count important events. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	00aebab27c	net/mlx5e: TLS, add Innova TLS rx data path Implement the TLS rx offload data path according to the requirements of the TLS generic NIC offload infrastructure. Special metadata ethertype is used to pass information to the hardware. When hardware loses synchronization a special resync request metadata message is used to request resync. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	ca942c78f3	net/mlx5e: TLS, add innova rx support Add the mlx5 implementation of the TLS Rx routines to add/del TLS contexts, also add the tls_dev_resync_rx routine to work with the TLS inline Rx crypto offload infrastructure. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	ab412e1dd7	net/mlx5: Accel, add TLS rx offload routines In Innova TLS, TLS contexts are added or deleted via a command message over the SBU connection. The HW then sends a response message over the same connection. Complete the implementation for Innova TLS (FPGA-based) hardware by adding support for rx inline crypto offload. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	0aadb2fc09	net/mlx5e: TLS, refactor variable names For symmetry, we rename mlx5e_tls_offload_context to mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Reviewed-by: Aviad Yehezkel <aviadye@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	4718799817	tls: Fix zerocopy_from_iter iov handling zerocopy_from_iter iterates over the message, but it doesn't revert the updates made by the iov iteration. This patch fixes it. Now, the iov can be used after calling zerocopy_from_iter. Fixes: `3c4d75591` ("tls: kernel TLS support") Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	4799ac81e5	tls: Add rx inline crypto offload This patch completes the generic infrastructure to offload TLS crypto to a network device. It enables the kernel to skip decryption and authentication of some skbs marked as decrypted by the NIC. In the fast path, all packets received are decrypted by the NIC and the performance is comparable to plain TCP. This infrastructure doesn't require a TCP offload engine. Instead, the NIC only decrypts packets that contain the expected TCP sequence number. Out-Of-Order TCP packets are provided unmodified. As a result, at the worst case a received TLS record consists of both plaintext and ciphertext packets. These partially decrypted records must be reencrypted, only to be decrypted. The notable differences between SW KTLS Rx and this offload are as follows: 1. Partial decryption - Software must handle the case of a TLS record that was only partially decrypted by HW. This can happen due to packet reordering. 2. Resynchronization - tls_read_size calls the device driver to resynchronize HW after HW lost track of TLS record framing in the TCP stream. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	b190a587c6	tls: Fill software context without allocation This patch allows tls_set_sw_offload to fill the context in case it was already allocated previously. We will use it in TLS_DEVICE to fill the RX software context. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	39f56e1a78	tls: Split tls_sw_release_resources_rx This patch splits tls_sw_release_resources_rx into two functions one which releases all inner software tls structures and another that also frees the containing structure. In TLS_DEVICE we will need to release the software structures without freeeing the containing structure, which contains other information. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Boris Pismenny	dafb67f3bb	tls: Split decrypt_skb to two functions Previously, decrypt_skb also updated the TLS context. Now, decrypt_skb only decrypts the payload using the current context, while decrypt_skb_update also updates the state. Later, in the tls_device Rx flow, we will use decrypt_skb directly. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:10 -07:00
Boris Pismenny	d80a1b9d18	tls: Refactor tls_offload variable names For symmetry, we rename tls_offload_context to tls_offload_context_tx before we add tls_offload_context_rx. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:12:09 -07:00
Boris Pismenny	41ed9c04aa	tcp: Don't coalesce decrypted and encrypted SKBs Prevent coalescing of decrypted and encrypted SKBs in GRO and TCP layer. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:12:09 -07:00
Boris Pismenny	16e4edc297	net: Add TLS rx resync NDO Add new netdev tls op for resynchronizing HW tls context Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:12:09 -07:00
Ilya Lesokhin	14136564c8	net: Add TLS RX offload feature This patch adds a netdev feature to configure TLS RX inline crypto offload. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:12:09 -07:00
Boris Pismenny	784abe24c9	net: Add decrypted field to skb The decrypted bit is propogated to cloned/copied skbs. This will be used later by the inline crypto receive side offload of tls. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:12:09 -07:00
David S. Miller	cc98419a57	Merge branch 'mvpp2-add-debugfs-interface' Maxime Chevallier says: ==================== net: mvpp2: add debugfs interface The PPv2 Header Parser and Classifier are not straightforward to debug, having easy access to some of the many lookup tables configuration is helpful during development and debug. This series adds a basic debugfs interface, allowing to read data from the Header Parser and some of the Classifier tables. For now, the interface is read-only, and contains only some basic info. This was actually used during RSS development, and might be useful to troubleshoot some issues we might find. The first patch of the series converts the mvpp2 files to SPDX, which eases adding the new debugfs dedicated file. The second patch adds the interface, and exposes basic Header Parser data. The 3rd patch adds a hit counter for the Header Parser TCAM. The 4th patch exposes classifier info. The 5th patch adds some hit counters for some of the classifier engines. Changes since V1: - Rebased on the lastest net-next - Made cls_flow_get non static so that it can be used in mvpp2_debugfs ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:10:01 -07:00
Maxime Chevallier	f9d30d5bd5	net: mvpp2: debugfs: add classifier hit counters The classification operations that are used for RSS make use of several lookup tables. Having hit counters for these tables is really helpful to determine what flows were matched by ingress traffic, and see the path of packets among all the classifier tables. This commit adds hit counters for the 3 tables used at the moment : - The decoding table (also called lookup_id table), that links flows identified by the Header Parser to the flow table. There's one entry per flow, located at : .../mvpp2/<controller>/flows/XX/dec_hits Note that there are 21 flows in the decoding table, whereas there are 52 flows in the Header Parser. That's because there are several kind of traffic that will match a given flow. Reading the hit counter from one sub-flow will clear all hit counter that have the same flow_id. This also applies to the flow_hits. - The flow table, that contains all the different lookups to be performed by the classifier for each packet of a given flow. The match is done on the first entry of the flow sequence. - The C2 engine entries, that are used to assign the default rx queue, and enable or disable RSS for a given port. There's one entry per flow, located at: .../mvpp2/<controller>/flows/XX/flow_hits There is one C2 entry per port, so the c2 hit counter is located at : .../mvpp2/<controller>/ethX/c2_hits All hit counter values are 16-bits clear-on-read values. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:10:01 -07:00
Maxime Chevallier	dba1d918da	net: mvpp2: debugfs: add entries for classifier flows The classifier configuration for RSS is quite complex, with several lookup tables being used. This commit adds useful info in debugfs to see how the different tables are configured : Added 2 new entries in the per-port directory : - .../eth0/default_rxq : The default rx queue on that port - .../eth0/rss_enable : Indicates if RSS is enabled in the C2 entry Added the 'flows' directory : It contains one entry per sub-flow. a 'sub-flow' is a unique path from Header Parser to the flow table. Multiple sub-flows can point to the same 'flow' (each flow has an id from 8 to 29, which is its index in the Lookup Id table) : - .../flows/00/... /01/... ... /51/id : The flow id. There are 21 unique flows. There's one flow per combination of the following parameters : - L4 protocol (TCP, UDP, none) - L3 protocol (IPv4, IPv6) - L3 parameters (Fragmented or not) - L2 parameters (Vlan tag presence or not) .../type : The flow type. This is an even higher level flow, that we manipulate with ethtool. It can be : "udp4" "tcp4" "udp6" "tcp6" "ipv4" "ipv6" "other". .../eth0/... .../eth1/engine : The hash generation engine used for this flow on the given port .../hash_opts : The hash generation options indicating on what data we base the hash (vlan tag, src IP, src port, etc.) Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:10:01 -07:00
Maxime Chevallier	1203341cc9	net: mvpp2: debugfs: add hit counter stats for Header Parser entries One helpful feature to help debug the Header Parser TCAM filter in PPv2 is to be able to see if the entries did match something when a packet comes in. This can be done by using the built-in hit counter for TCAM entries. This commit implements reading the counter, and exposing its value on debugfs for each filter entry. The counter is a 16-bits clear-on-read value, located at: .../mvpp2/<controller>/parser/XXX/hits Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:10:01 -07:00
Maxime Chevallier	21da57a231	net: mvpp2: add a debugfs interface for the Header Parser Marvell PPv2 Packer Header Parser has a TCAM based filter, that is not trivial to configure and debug. Being able to dump TCAM entries from userspace can be really helpful to help development of new features and debug existing ones. This commit adds a basic debugfs interface for the PPv2 driver, focusing on TCAM related features. <mnt>/mvpp2/ --- f2000000.ethernet \- f4000000.ethernet --- parser --- 000 ... \| \- 001 \| \- ... \| \- 255 --- ai \| \- header_data \| \- lookup_id \| \- sram \| \- valid \- eth1 ... \- eth2 --- mac_filter \- parser_entries \- vid_filter There's one directory per PPv2 instance, named after pdev->name to make sure names are uniques. In each of these directories, there's : - one directory per interface on the controller, each containing : - "mac_filter", which lists all filtered addresses for this port (based on TCAM, not on the kernel's uc / mc lists) - "parser_entries", which lists the indices of all valid TCAM entries that have this port in their port map - "vid_filter", which lists the vids allowed on this port, based on TCAM - one "parser" directory (the parser is common to all ports), containing : - one directory per TCAM entry (256 of them, from 0 to 255), each containing : - "ai" : Contains the 1 byte Additional Info field from TCAM, and - "header_data" : Contains the 8 bytes Header Data extracted from the packet - "lookup_id" : Contains the 4 bits LU_ID - "sram" : contains the raw SRAM data, which is the result of the TCAM lookup. This readonly at the moment. - "valid" : Indicates if the entry is valid of not. All entries are read-only, and everything is output in hex form. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:10:00 -07:00
Antoine Tenart	f1e37e3101	net: mvpp2: switch to SPDX identifiers Use the appropriate SPDX license identifiers and drop the license text. This patch is only cosmetic. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:10:00 -07:00
David S. Miller	2aa4a3378a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-15 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Various different arm32 JIT improvements in order to optimize code emission and make the JIT code itself more robust, from Russell. 2) Support simultaneous driver and offloaded XDP in order to allow for advanced use-cases where some work is offloaded to the NIC and some to the host. Also add ability for bpftool to load programs and maps beyond just the cgroup case, from Jakub. 3) Add BPF JIT support in nfp for multiplication as well as division. For the latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong. 4) Add BTF pretty print functionality to bpftool in plain and JSON output format, from Okash. 5) Add build and installation to the BPF helper man page into bpftool, from Quentin. 6) Add a TCP BPF callback for listening sockets which is triggered right after the socket transitions to TCP_LISTEN state, from Andrey. 7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup tree and prints all attached programs, from Roman. 8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged packets, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-14 18:47:44 -07:00
Daniel Borkmann	13f7432bdd	Merge branch 'bpf-tcp-listen-cb' Andrey Ignatov says: ==================== This patchset adds TCP-BPF callback for listening sockets. Patch 0001 provides more details and is the main patch in the set. Patch 0006 adds selftest for the new callback. Other patches are bug fixes and improvements in TCP-BPF selftest to make it easier to extend in 0006. ==================== Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-15 00:09:22 +02:00
Andrey Ignatov	78d8e26d46	selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB Cover new TCP-BPF callback in test_tcpbpf: when listen() is called on socket, set BPF_SOCK_OPS_STATE_CB_FLAG so that BPF_SOCK_OPS_STATE_CB callback can be called on future state transition, and when such a transition happens (TCP_LISTEN -> TCP_CLOSE), track it in the map and verify it in user space later. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-15 00:08:41 +02:00
Andrey Ignatov	2044e4ef0b	selftests/bpf: Better verification in test_tcpbpf Reduce amount of copy/paste for debug info when result is verified in the test and keep that info together with values being checked so that they won't get out of sync. It also improves debug experience: instead of checking manually what doesn't match in debug output for all fields, only unexpected field is printed. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-15 00:08:41 +02:00
Andrey Ignatov	c65267e5ff	selftests/bpf: Switch test_tcpbpf_user to cgroup_helpers Switch to cgroup_helpers to simplify the code and fix cgroup cleanup: before cgroup was not cleaned up after the test. It also removes SYSTEM macro, that only printed error, but didn't terminate the test. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-15 00:08:41 +02:00
Andrey Ignatov	04c1341115	selftests/bpf: Fix const'ness in cgroup_helpers Lack of const in cgroup helpers signatures forces to write ugly client code. Fix it. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-15 00:08:41 +02:00
Andrey Ignatov	060a7fccd3	bpf: Sync bpf.h to tools/ Sync BPF_SOCK_OPS_TCP_LISTEN_CB related UAPI changes to tools/. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-15 00:08:41 +02:00
Andrey Ignatov	f333ee0cdb	bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB Add new TCP-BPF callback that is called on listen(2) right after socket transition to TCP_LISTEN state. It fills the gap for listening sockets in TCP-BPF. For example BPF program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening and track later transition from TCP_LISTEN to TCP_CLOSE with BPF_SOCK_OPS_STATE_CB callback. Before there was no way to do it with TCP-BPF and other options were much harder to work with. E.g. socket state tracking can be done with tracepoints (either raw or regular) but they can't be attached to cgroup and their lifetime has to be managed separately. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-15 00:08:41 +02:00
David S. Miller	f5c64e566c	Merge branch 'mlxsw-VRRP' Ido Schimmel says: ==================== mlxsw: Add VRRP support When a router that is acting as the default gateway of a host stops functioning, the host will encounter packet loss until the router starts functioning again. To increase the reliability of the default gateway without performing reconfiguration on the host, a host can use a Virtual Router Redundancy Protocol (VRRP) Router. This virtual router is composed from several routers where only one is actually forwarding packets from the host (the master router) while the other routers act as backup routers. The election of the master router is determined by the VRRP protocol [1]. Packets addressed to the virtual router are always sent to the virtual router MAC address (IPv4: 00-00-5E-00-01-XX, IPv6: 00-00-5E-00-02-XX). Such packets can only be accepted by the master router and must be discarded by the backup routers. In Linux, VRRP is usually implemented by configuring a macvlan with the virtual router MAC on top of the router interface that is connected to the host / LAN. The macvlan on the master router is assigned the virtual IP (VIP) that the host uses as its gateway. In order to support VRRP in mlxsw, we first need to enable macvlan upper devices on top of mlxsw netdevs and their uppers. This is done by the first patch, which also takes care of sanitizing macvlan configurations that are not currently supported by the driver. The second patch directs packets with destination MAC addresses as the macvlans to the router so that they will undergo an L3 lookup. This is consistent with the kernel's behavior where the macvlan's Rx handler will re-inject such packets to the Rx path so that they will be picked up by the IPvX protocol handlers and undergo an L3 lookup. Note that the driver prevents the macvlans from being enslaved to other devices, to ensure the packets will be picked up by the protocol handler and not by another Rx handler. The third patch adds packet traps for VRRP control packets for both IPv4 and IPv6. Finally, the last patch optimizes the reception of VRRP MACs by potentially skipping one L2 lookup for them. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-14 11:23:26 -07:00
Ido Schimmel	c3a495409a	mlxsw: spectrum_router: Optimize processing of VRRP MACs Hosts using a VRRP router send their packets with a destination MAC of the VRRP router which is of the following form [1]: IPv4 - 00-00-5E-00-01-{VRID} IPv6 - 00-00-5E-00-02-{VRID} Where VRID is the ID of the virtual router. Such packets are directed to the router block in the ASIC by an FDB entry that was added in the previous patch. However, in certain cases it is possible to skip this FDB lookup and send such packets directly to the router. This is accomplished by adding these special MAC addresses to the RIF cache. If the cache is hit, the packet will skip the L2 lookup and ingress the router with the RIF specified in the cache entry. 1. https://tools.ietf.org/html/rfc5798#section-7.3 Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-14 11:23:26 -07:00
Ido Schimmel	11566d34f8	mlxsw: spectrum: Add VRRP traps Virtual Router Redundancy Protocol packets are used to communicate the state of the Master router associated with the virtual router ID (VRID). These are link-local multicast packets sent with IP protocol 112 that are trapped in the router block in the ASIC. Add a trap for these packets and mark the trapped packets to prevent them from potentially being re-flooded by the bridge driver. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-14 11:23:26 -07:00
Ido Schimmel	2db9937804	mlxsw: spectrum_router: Direct macvlans' MACs to router An IP packet received on a netdev with a macvlan upper whose MAC matches the packet's destination MAC will be re-injected to the Rx path as if it was received by the macvlan, and perform an L3 lookup. Reflect this functionality to the ASIC by programming FDB entries that will direct MACs of macvlan uppers to the router. In a similar fashion to router interfaces (RIFs) that are programmed upon the addition of the first IP address on an interface and destroyed upon the removal of the last IP address, the FDB entries for the macvlan are added and destroyed based on the addition of the first and removal of the last IP address on the macvlan. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-14 11:23:26 -07:00
Ido Schimmel	c55161852f	mlxsw: spectrum: Enable macvlan upper devices In order to allow more unicast MAC addresses (e.g., VRRP virtual MAC) to be directed to the router we need to enable macvlan uppers on top of mlxsw netdevs. Allow macvlan upper devices on top of mlxsw netdevs and sanitize configurations that can't work. For example, a macvlan can't be enslaved to a bridge as without ACLs the device doesn't take the destination MAC into account when classifying a packet to a bridge instance (i.e., a FID). Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-14 11:23:25 -07:00
Yafang Shao	ff0432e5a8	tcp: remove redundant rcv_nxt update tcp_rcv_nxt_update() is already executed in tcp_data_queue(). This line is redundant. See bellow, tcp_queue_rcv tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq); tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq); <<<< redundant Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-14 11:21:40 -07:00
Okash Khawaja	2d3feca8c4	bpf: btf: print map dump and lookup with btf info This patch augments the output of bpftool's map dump and map lookup commands to print data along side btf info, if the correspondin btf info is available. The outputs for each of map dump and map lookup commands are augmented in two ways: 1. when neither of -j and -p are supplied, btf-ful map data is printed whose aim is human readability. This means no commitments for json- or backward- compatibility. 2. when either -j or -p are supplied, a new json object named "formatted" is added for each key-value pair. This object contains the same data as the key-value pair, but with btf info. "formatted" object promises json- and backward- compatibility. Below is a sample output. $ bpftool map dump -p id 8 [{ "key": ["0x0f","0x00","0x00","0x00" ], "value": ["0x03", "0x00", "0x00", "0x00", ... ], "formatted": { "key": 15, "value": { "int_field": 3, ... } } } ] This patch calls btf_dumper introduced in previous patch to accomplish the above. Indeed, btf-ful info is only displayed if btf data for the given map is available. Otherwise existing output is displayed as-is. Signed-off-by: Okash Khawaja <osk@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-14 13:00:40 +02:00
Okash Khawaja	b12d6ec097	bpf: btf: add btf print functionality This consumes functionality exported in the previous patch. It does the main job of printing with BTF data. This is used in the following patch to provide a more readable output of a map's dump. It relies on json_writer to do json printing. Below is sample output where map keys are ints and values are of type struct A: typedef int int_type; enum E { E0, E1, }; struct B { int x; int y; }; struct A { int m; unsigned long long n; char o; int p[8]; int q[4][8]; enum E r; void *s; struct B t; const int u; int_type v; unsigned int w1: 3; unsigned int w2: 3; }; $ sudo bpftool map dump id 14 [{ "key": 0, "value": { "m": 1, "n": 2, "o": "c", "p": [15,16,17,18,15,16,17,18 ], "q": [[25,26,27,28,25,26,27,28 ],[35,36,37,38,35,36,37,38 ],[45,46,47,48,45,46,47,48 ],[55,56,57,58,55,56,57,58 ] ], "r": 1, "s": 0x7ffd80531cf8, "t": { "x": 5, "y": 10 }, "u": 100, "v": 20, "w1": 0x7, "w2": 0x3 } } ] This patch uses json's {} and [] to imply struct/union and array. More explicit information can be added later. For example, a command line option can be introduced to print whether a key or value is struct or union, name of a struct etc. This will however come at the expense of duplicating info when, for example, printing an array of structs. enums are printed as ints without their names. Signed-off-by: Okash Khawaja <osk@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-14 13:00:40 +02:00
Okash Khawaja	92b57121ca	bpf: btf: export btf types and name by offset from lib This patch introduces btf__resolve_type() function and exports two existing functions from libbpf. btf__resolve_type follows modifier types like const and typedef until it hits a type which actually takes up memory, and then returns it. This function follows similar pattern to btf__resolve_size but instead of computing size, it just returns the type. These functions will be used in the followig patch which parses information inside array of `struct btf_type *`. btf_name_by_offset is used for printing variable names. Signed-off-by: Okash Khawaja <osk@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-14 13:00:40 +02:00
Jakub Kicinski	db42a21a1e	tools: include reallocarray feature test in FEATURE_TESTS_BASIC perf propagates its feature check results to libbpf. This means features for which perf probes must be a superset of libbpf's required features. perf depends on FEATURE_TESTS_BASIC for its list of features. commit `531b014e7a` ("tools: bpf: make use of reallocarray") added reallocarray use to libbpf, make perf also perform the reallocarray feature check. Fixes: `531b014e7a` ("tools: bpf: make use of reallocarray") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-14 10:38:48 +02:00
kbuild test robot	9cee8c4375	net: mvpp2: mvpp2_cls_flow_get() can be static Fixes: `f9358e12a0` ("net: mvpp2: split ingress traffic into multiple flows") Signed-off-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-13 20:21:56 -07:00
Linus Walleij	6eb9c9dafd	of: mdio: Support fixed links in of_phy_get_and_connect() By a simple extension of of_phy_get_and_connect() drivers that have a fixed link on e.g. RGMII can support also fixed links, so in addition to: ethernet-port { phy-mode = "rgmii"; phy-handle = <&foo>; }; This setup with a fixed-link node and no phy-handle will now also work just fine: ethernet-port { phy-mode = "rgmii"; fixed-link { speed = <1000>; full-duplex; pause; }; }; This is very helpful for connecting random ethernet ports to e.g. DSA switches that typically reside on fixed links. The phy-mode is still there as the fixes link in this case is still an RGMII link. Tested on the Cortina Gemini driver with the Vitesse DSA router chip on a fixed 1Gbit link. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-13 18:25:14 -07:00
Vlad Buslov	01683a1469	net: sched: refactor flower walk to iterate over idr Extend struct tcf_walker with additional 'cookie' field. It is intended to be used by classifier walk implementations to continue iteration directly from particular filter, instead of iterating 'skip' number of times. Change flower walk implementation to save filter handle in 'cookie'. Each time flower walk is called, it looks up filter with saved handle directly with idr, instead of iterating over filter linked list 'skip' number of times. This change improves complexity of dumping flower classifier from quadratic to linearithmic. (assuming idr lookup has logarithmic complexity) Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reported-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-13 18:24:27 -07:00
Jesper Dangaard Brouer	d23b27c02f	samples/bpf: xdp_redirect_cpu handle parsing of double VLAN tagged packets People noticed that the code match on IEEE 802.1ad (ETH_P_8021AD) ethertype, and this implies Q-in-Q or double tagged VLANs. Thus, we better parse the next VLAN header too. It is even marked as a TODO. This is relevant for real world use-cases, as XDP cpumap redirect can be used when the NIC RSS hashing is broken. E.g. the ixgbe driver HW cannot handle double tagged VLAN packets, and places everything into a single RX queue. Using cpumap redirect, users can redistribute traffic across CPUs to solve this, which is faster than the network stacks RPS solution. It is left as an exerise how to distribute the packets across CPUs. It would be convenient to use the RX hash, but that is not _yet_ exposed to XDP programs. For now, users can code their own hash, as I've demonstrated in the Suricata code (where Q-in-Q is handled correctly). Reported-by: Florian Maury <florian.maury-cv@x-cli.eu> Reported-by: Marek Majkowski <marek@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-14 00:52:54 +02:00
Nikolay Aleksandrov	c921c2077b	net: ipmr: add support for passing full packet on wrong vif This patch adds support for IGMPMSG_WRVIFWHOLE which is used to pass full packet and real vif id when the incoming interface is wrong. While the RP and FHR are setting up state we need to be sending the registers encapsulated with all the data inside otherwise we lose it. The RP then decapsulates it and forwards it to the interested parties. Currently with WRONGVIF we can only be sending empty register packets and will lose that data. This behaviour can be enabled by using MRT_PIM with val == IGMPMSG_WRVIFWHOLE. This doesn't prevent IGMPMSG_WRONGVIF from happening, it happens in addition to it, also it is controlled by the same throttling parameters as WRONGVIF (i.e. 1 packet per 3 seconds currently). Both messages are generated to keep backwards compatibily and avoid breaking someone who was enabling MRT_PIM with val == 4, since any positive val is accepted and treated the same. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-13 14:21:16 -07:00

1 2 3 4 5 ...

768049 Commits