linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2025-01-27 14:30:44 +07:00

Author	SHA1	Message	Date
Russell King	7397005545	sfp: add SFP module support Add support for SFP hotpluggable modules via sfp-bus and phylink. This supports both copper and optical SFP modules, which require different Serdes modes in order to properly negotiate the link. Optical SFP modules typically require the Serdes link to be talking 1000BaseX mode - this is the gigabit ethernet mode defined by the 802.3 standard. Copper SFP modules typically integrate a PHY in the module to convert from Serdes to copper, and the PHY will be configured by the vendor to either present a 1000BaseX Serdes link (for fixed 1000BaseT) or a SGMII Serdes link. However, this is vendor defined, so we instead detect the PHY, switch the link to SGMII mode, and use traditional PHY based negotiation. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:29 -07:00
Russell King	da7c1862f0	phylink: add in-band autonegotiation support for 10GBase-KR mode. Add in-band autonegotation support for 10GBase-KR mode. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:29 -07:00
Russell King	ecbd87b843	phylink: add support for MII ioctl access to Clause 45 PHYs Add support for reading and writing the clause 45 MII registers. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:29 -07:00
Russell King	770a1ad557	phylink: add module EEPROM support Add support for reading module EEPROMs through phylink. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:29 -07:00
Russell King	ce0aa27ff3	sfp: add sfp-bus to bridge between network devices and sfp cages Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:29 -07:00
Russell King	9525ae8395	phylink: add phylink infrastructure The link between the ethernet MAC and its PHY has become more complex as the interface evolves. This is especially true with serdes links, where the part of the PHY is effectively integrated into the MAC. Serdes links can be connected to a variety of devices, including SFF modules soldered down onto the board with the MAC, a SFP cage with a hotpluggable SFP module which may contain a PHY or directly modulate the serdes signals onto optical media with or without a PHY, or even a classical PHY connection. Moreover, the negotiation information on serdes links comes in two varieties - SGMII mode, where the PHY provides its speed/duplex/flow control information to the MAC, and 1000base-X mode where both ends exchange their abilities and each resolve the link capabilities. This means we need a more flexible means to support these arrangements, particularly with the hotpluggable nature of SFP, where the PHY can be attached or detached after the network device has been brought up. Ethtool information can come from multiple sources: - we may have a PHY operating in either SGMII or 1000base-X mode, in which case we take ethtool/mii data directly from the PHY. - we may have a optical SFP module without a PHY, with the MAC operating in 1000base-X mode - the ethtool/mii data needs to come from the MAC. - we may have a copper SFP module with a PHY whic can't be accessed, which means we need to take ethtool/mii data from the MAC. Phylink aims to solve this by providing an intermediary between the MAC and PHY, providing a safe way for PHYs to be hotplugged, and allowing a SFP driver to reconfigure the serdes connection. Phylink also takes over support of fixed link connections, where the speed/duplex/flow control are fixed, but link status may be controlled by a GPIO signal. By avoiding the fixed-phy implementation, phylink can provide a faster response to link events: fixed-phy has to wait for phylib to operate its state machine, which can take several seconds. In comparison, phylink takes milliseconds. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> - remove sync status - rework supported and advertisment handling - add 1000base-x speed for fixed links - use functionality exported from phy-core, reworking __phylink_ethtool_ksettings_set for it Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:29 -07:00
Russell King	453d00defb	net: phy: add I2C mdio bus Add an I2C MDIO bus bridge library, to allow phylib to access PHYs which are connected to an I2C bus instead of the more conventional MDIO bus. Such PHYs can be found in SFP adapters and SFF modules. Since PHYs appear at I2C bus address 0x40..0x5f, and 0x50/0x51 are reserved for SFP EEPROMs/diagnostics, we must not allow the MDIO bus to access these I2C addresses. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:28 -07:00
Russell King	5e5758d9d8	net: phy: export phy_start_machine() for phylink phylink will need phy_start_machine exported, so lets export it as a GPL symbol. Documentation/networking/phy.txt indicates that this should be a PHY API function. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:28 -07:00
Russell King	a81497bee7	net: phy: provide a hook for link up/link down events Sometimes, we need to do additional work between the PHY coming up and marking the carrier present - for example, we may need to wait for the PHY to MAC link to finish negotiation. This changes phylib to provide a notification function pointer which avoids the built-in netif_carrier_on() and netif_carrier_off() functions. Standard ->adjust_link functionality is provided by hooking a helper into the new ->phy_link_change method. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:28 -07:00
Russell King	1f3645bb41	net: phy: add 1000Base-X to phy settings table Add the missing 1000Base-X entry to the phy settings table. This was not included because the original code could not cope with more than 32 bits of link mode mask. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:28 -07:00
Russell King	0ccb4fc65d	net: phy: move phy_lookup_setting() and guts of phy_supported_speeds() to phy-core phy_lookup_setting() provides useful functionality in ethtool code outside phylib. Move it to phy-core and allow it to be re-used (eg, in phylink) rather than duplicated elsewhere. Note that this supports the larger linkmode space. As we move the phy settings table, we also need to move the guts of phy_supported_speeds() as well. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:28 -07:00
Russell King	da4625ac26	net: phy: split out PHY speed and duplex string generation Other code would like to make use of this, so make the speed and duplex string generation visible, and place it in a separate file. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:28 -07:00
Russell King	c3ecbe757c	net: phy: allow settings table to support more than 32 link modes Allow the phy settings table to support more than 32 link modes by switching to the ethtool link mode bit number representation, rather than storing the mask. This will allow phylink and other ethtool code to share the settings table to look up settings. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:55:28 -07:00
David S. Miller	21e27f2d86	Merge branch 'IP-cleanup-LSRR-option-processing' Paolo Abeni says: ==================== IP: cleanup LSRR option processing The __ip_options_echo() function expect a valid dst entry in skb->dst; as result we sometimes need to preserve the dst entry for the whole IP RX path. The current usage of skb->dst looks more a relic from ancient past that a real functional constraint. This patchset tries to remove such usage, and than drops some hacks currently in place in the IP code to keep skb->dst around. __ip_options_echo() uses of skb->dst for two different purposes: retrieving the netns assicated with the skb, and modify the ingress packet LSRR address list. The first patch removes the code modifying the ingress packet, and the second one provides an explicit netns argument to __ip_options_echo(). The following patches cleanup the current code keeping arund skb->dst for __ip_options_echo's sake. Updating the __ip_options_echo() function has been previously discussed here: http://marc.info/?l=linux-netdev&m=150064533516348&w=2 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:51:12 -07:00
Paolo Abeni	3bdefdf9d9	udp: no need to preserve skb->dst __ip_options_echo() does not need anymore skb->dst, so we can avoid explicitly preserving it for its own sake. This is almost a revert of commit `0ddf3fb2c4` ("udp: preserve skb->dst if required for IP options processing") plus some lifting to fit later changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:51:12 -07:00
Paolo Abeni	61a1030bad	Revert "ipv4: keep skb->dst around in presence of IP options" ip_options_echo() does not use anymore the skb->dst and don't need to keep the dst around for options's sake only. This reverts commit `34b2cef20f`. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:51:12 -07:00
Paolo Abeni	91ed1e666a	ip/options: explicitly provide net ns to __ip_options_echo() __ip_options_echo() uses the current network namespace, and currently retrives it via skb->dst->dev. This commit adds an explicit 'net' argument to __ip_options_echo() and update all the call sites to provide it, usually via a simpler sock_net(). After this change, __ip_options_echo() no more needs to access skb->dst and we can drop a couple of hack to preserve such info in the rx path. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:51:12 -07:00
Paolo Abeni	a1e155ece1	IP: do not modify ingress packet IP option in ip_options_echo() While computing the response option set for LSRR, ip_options_echo() also changes the ingress packet LSRR addresses list, setting the last one to the dst specific address for the ingress packet - via memset(start[ ... The only visible effect of such change - beyond possibly damaging shared/cloned skbs - is modifying the data carried by ICMP replies changing the header information for reported the ingress packet, which violates RFC1122 3.2.2.6. All the others call sites just ignore the ingress packet IP options after calling ip_options_echo() Note that the last element in the LSRR option address list for the reply packet will be properly set later in the ip output path via ip_options_build(). This buggy memset() predates git history and apparently was present into the initial ip_options_echo() implementation in linux 1.3.30 but still looks wrong. The removal of the fib_compute_spec_dst() call will help completely dropping the skb->dst usage by __ip_options_echo() with a later patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-06 20:51:11 -07:00
Pavel Belous	a54df682e5	aquantia: Switch to use napi_gro_receive Add support for GRO (generic receive offload) for aQuantia Atlantic driver. This results in a perfomance improvement when GRO is enabled. Signed-off-by: Pavel Belous <pavel.belous@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 20:57:13 -07:00
John Fastabend	56ce097c1c	net: comment fixes against BPF devmap helper calls Update BPF comments to accurately reflect XDP usage. Fixes: `97f91a7cf0` ("bpf: add bpf_redirect_map helper routine") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:29:03 -07:00
David S. Miller	8f75222446	Merge branch 'net-sched-summer-cleanup-part-1-mainly-in-exts-area' Jiri Pirko says: ==================== net: sched: summer cleanup part 1, mainly in exts area This patchset is one of the couple cleanup patchsets I have in queue. The motivation aside the obvious need to "make things nicer" is also to prepare for shared filter blocks introduction. That requires tp->q removal, and therefore removal of all tp->q users. Patch 1 is just some small thing I spotted on the way Patch 2 removes one user of tp->q, namely tcf_em_tree_change Patches 3-8 do preparations for exts->nr_actions removal Patches 9-10 do simple renames of functions in cls* Patches 11-19 remove unnecessary calls of tcf_exts_change helper The last patch changes tcf_exts_change to don't take lock Tested by tools/testing/selftests/tc-testing v1->v2: - removed conversion of action array to list as noted by Cong - added the past patch instead - small rebases of patches 11-19 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:25 -07:00
Jiri Pirko	9b0d4446b5	net: sched: avoid atomic swap in tcf_exts_change tcf_exts_change is always called on newly created exts, which are not used on fastpath. Therefore, simple struct copy is enough. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	705c709126	net: sched: cls_u32: no need to call tcf_exts_change for newly allocated struct As the n struct was allocated right before u32_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	8c98d571bb	net: sched: cls_route: no need to call tcf_exts_change for newly allocated struct As the f struct was allocated right before route4_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	c09fc2e11e	net: sched: cls_flow: no need to call tcf_exts_change for newly allocated struct As the fnew struct just was allocated, so no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	8cc6251381	net: sched: cls_cgroup: no need to call tcf_exts_change for newly allocated struct As the new struct just was allocated, so no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	6839da326d	net: sched: cls_bpf: no need to call tcf_exts_change for newly allocated struct As the prog struct was allocated right before cls_bpf_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	ff1f8ca080	net: sched: cls_basic: no need to call tcf_exts_change for newly allocated struct As the f struct was allocated right before basic_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	a74cb36980	net: sched: cls_matchall: no need to call tcf_exts_change for newly allocated struct As the head struct was allocated right before mall_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	94611bff6e	net: sched: cls_fw: no need to call tcf_exts_change for newly allocated struct As the f struct was allocated right before fw_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	455075292b	net: sched: cls_flower: no need to call tcf_exts_change for newly allocated struct As the f struct was allocated right before fl_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	1e5003af37	net: sched: cls_fw: rename fw_change_attrs function Since the function name is misleading since it is not changing anything, name it similarly to other cls. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	6a725c481d	net: sched: cls_bpf: rename cls_bpf_modify_existing function The name cls_bpf_modify_existing is highly misleading, as it indeed does not modify anything existing. It does not modify at all. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	978dfd8d14	net: sched: use tcf_exts_has_actions instead of exts->nr_actions For check in tcf_exts_dump use tcf_exts_has_actions helper instead of exts->nr_actions for checking if there are any actions present. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:24 -07:00
Jiri Pirko	ec1a9cca0e	net: sched: remove check for number of actions in tcf_exts_exec Leave it to tcf_action_exec to return TC_ACT_OK in case there is no action present. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:23 -07:00
Jiri Pirko	af089e701a	net: sched: fix return value of tcf_exts_exec Return the defined TC_ACT_OK instead of 0. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:23 -07:00
Jiri Pirko	6fc6d06e53	net: sched: remove redundant helpers tcf_exts_is_predicative and tcf_exts_is_available These two helpers are doing the same as tcf_exts_has_actions, so remove them and use tcf_exts_has_actions instead. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:23 -07:00
Jiri Pirko	af69afc551	net: sched: use tcf_exts_has_actions in tcf_exts_exec Use the tcf_exts_has_actions helper instead or directly testing exts->nr_actions in tcf_exts_exec. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:23 -07:00
Jiri Pirko	3bcc0cec81	net: sched: change names of action number helpers to be aligned with the rest The rest of the helpers are named tcf_exts_*, so change the name of the action number helpers to be aligned. While at it, change to inline functions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:23 -07:00
Jiri Pirko	4ebc1e3cfc	net: sched: remove unneeded tcf_em_tree_change Since tcf_em_tree_validate could be always called on a newly created filter, there is no need for this change function. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:23 -07:00
Jiri Pirko	f7ebdff757	net: sched: sch_atm: use Qdisc_class_common structure Even if it is only for classid now, use this common struct a be aligned with the rest of the classful qdiscs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:21:23 -07:00
Lin Yun Sheng	967b2e2a76	net: hns: Fix for __udivdi3 compiler error This patch fixes the __udivdi3 undefined error reported by test robot. Fixes: `b8c17f7088` ("net: hns: Add self-adaptive interrupt coalesce support in hns driver") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 11:06:59 -07:00
Dan Carpenter	5987feb38a	net: phy: marvell: logical vs bitwise OR typo This was supposed to be a bitwise OR but there is a \|\| vs \| typo. Fixes: `864dc729d5` ("net: phy: marvell: Refactor m88e1121 RGMII delay configuration") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-04 10:55:54 -07:00
David S. Miller	35615994c1	Merge branch 'socket-sendmsg-zerocopy' Willem de Bruijn says: ==================== socket sendmsg MSG_ZEROCOPY Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg. Implement the feature for TCP initially, as large writes benefit most. On a send call with MSG_ZEROCOPY, the kernel pins user pages and links these directly into the skbuff frags[] array. Each send call with MSG_ZEROCOPY that transmits data will eventually queue a completion notification on the error queue: a per-socket u32 incremented on each such call. A request may have to revert to copy to succeed, for instance when a device cannot support scatter-gather IO. In that case a flag is passed along to notify that the operation succeeded without zerocopy optimization. The implementation extends the existing zerocopy infra for tuntap, vhost and xen with features needed for TCP, notably reference counting to handle cloning on retransmit and GSO. For more details, see also the netdev 2.1 paper and presentation at https://netdevconf.org/2.1/session.html?debruijn Changelog: v3 -> v4: - dropped UDP, RAW and PF_PACKET for now Without loopback support, datagrams are usually smaller than the ~8KB size threshold needed to benefit from zerocopy. - style: a few reverse chrismas tree - minor: SO_ZEROCOPY returns ENOTSUPP on unsupported protocols - minor: squashed SO_EE_CODE_ZEROCOPY_COPIED patch - minor: rebased on top of net-next with kmap_atomic fix v2 -> v3: - fix rebase conflict: SO_ZEROCOPY 59 -> 60 v1 -> v2: - fix (kbuild-bot): do not remove uarg until patch 5 - fix (kbuild-bot): move zerocopy_sg_from_iter doc with function - fix: remove unused extern in header file RFCv2 -> v1: - patch 2 - review comment: in skb_copy_ubufs, always allocate order-0 page, also when replacing compound source pages. - patch 3 - fix: always queue completion notification on MSG_ZEROCOPY, also if revert to copy. - fix: on syscall abort, correctly revert notification state - minor: skip queue notification on SOCK_DEAD - minor: replace BUG_ON with WARN_ON in recoverable error - patch 4 - new: add socket option SOCK_ZEROCOPY. only honor MSG_ZEROCOPY if set, ignore for legacy apps. - patch 5 - fix: clear zerocopy state on skb_linearize - patch 6 - fix: only coalesce if prev errqueue elem is zerocopy - minor: try coalescing with list tail instead of head - minor: merge bytelen limit patch - patch 7 - new: signal when data had to be copied - patch 8 (tcp) - optimize: avoid setting PSH bit when exceeding max frags. that limits GRO on the client. do not goto new_segment. - fix: fail on MSG_ZEROCOPY \| MSG_FASTOPEN - minor: do not wait for memory: does not work for optmem - minor: simplify alloc - patch 9 (udp) - new: add PF_INET6 - fix: attach zerocopy notification even if revert to copy - minor: simplify alloc size arithmetic - patch 10 (raw hdrinc) - new: add PF_INET6 - patch 11 (pf_packet) - minor: simplify slightly - patch 12 - new msg_zerocopy regression test: use veth pair to test all protocols: ipv4/ipv6/packet, tcp/udp/raw, cork all relevant ethtool settings: rx off, sg off all relevant packet lengths: 0, <MAX_HEADER, max size RFC -> RFCv2: - review comment: do not loop skb with zerocopy frags onto rx: add skb_orphan_frags_rx to orphan even refcounted frags call this in __netif_receive_skb_core, deliver_skb and tun: same as commit `1080e512d4` ("net: orphan frags on receive") - fix: hold an explicit sk reference on each notification skb. previously relied on the reference (or wmem) held by the data skb that would trigger notification, but this breaks on skb_orphan. - fix: when aborting a send, do not inc the zerocopy counter this caused gaps in the notification chain - fix: in packet with SOCK_DGRAM, pull ll headers before calling zerocopy_sg_from_iter - fix: if sock_zerocopy_realloc does not allow coalescing, do not fail, just allocate a new ubuf - fix: in tcp, check return value of second allocation attempt - chg: allocate notification skbs from optmem to avoid affecting tcp write queue accounting (TSQ) - chg: limit #locked pages (ulimit) per user instead of per process - chg: grow notification ids from 16 to 32 bit - pass range [lo, hi] through 32 bit fields ee_info and ee_data - chg: rebased to davem-net-next on top of v4.10-rc7 - add: limit notification coalescing sharing ubufs limits overhead, but delays notification until the last packet is released, possibly unbounded. Add a cap. - tests: add snd_zerocopy_lo pf_packet test - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug) Limitations / Known Issues: - TCP may build slightly smaller than max TSO packets due to exceeding MAX_SKB_FRAGS frags when zerocopy pages are unaligned. - All SKBTX_SHARED_FRAG may require additional __skb_linearize or skb_copy_ubufs calls in u32, skb_find_text, similar to skb_checksum_help. Notification skbuffs are allocated from optmem. For sockets that cannot effectively coalesce notifications, the optmem max may need to be increased to avoid hitting -ENOBUFS: sysctl -w net.core.optmem_max=1048576 In application load, copy avoidance shows a roughly 5% systemwide reduction in cycles when streaming large flows and a 4-8% reduction in wall clock time on early tensorflow test workloads. For the single-machine veth tests to succeed, loopback support has to be temporarily enabled by making skb_orphan_frags_rx map to skb_orphan_frags. * Performance The below table shows cycles reported by perf for a netperf process sending a single 10 Gbps TCP_STREAM. The first three columns show Mcycles spent in the netperf process context. The second three columns show time spent systemwide (-a -C A,B) on the two cpus that run the process and interrupt handler. Reported is the median of at least 3 runs. std is a standard netperf, zc uses zerocopy and % is the ratio. Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs are disabled and the kernel is booted with idle=halt. NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size perf stat -e cycles $NETPERF perf stat -C 2,3 -a -e cycles $NETPERF --process cycles-- ----cpu cycles---- std zc % std zc % 4K 27,609 11,217 41 49,217 39,175 79 16K 21,370 3,823 18 43,540 29,213 67 64K 20,557 2,312 11 42,189 26,910 64 256K 21,110 2,134 10 43,006 27,104 63 1M 20,987 1,610 8 42,759 25,931 61 Perf record indicates the main source of these differences. Process cycles only at 1M writes (perf record; perf report -n): std: Samples: 42K of event 'cycles', Event count (approx.): 21258597313 79.41% 33884 netperf [kernel.kallsyms] [k] copy_user_generic_string 3.27% 1396 netperf [kernel.kallsyms] [k] tcp_sendmsg 1.66% 694 netperf [kernel.kallsyms] [k] get_page_from_freelist 0.79% 325 netperf [kernel.kallsyms] [k] tcp_ack 0.43% 188 netperf [kernel.kallsyms] [k] __alloc_skb zc: Samples: 1K of event 'cycles', Event count (approx.): 1439509124 30.36% 584 netperf.zerocop [kernel.kallsyms] [k] gup_pte_range 14.63% 284 netperf.zerocop [kernel.kallsyms] [k] __zerocopy_sg_from_iter 8.03% 159 netperf.zerocop [kernel.kallsyms] [k] skb_zerocopy_add_frags_iter 4.84% 96 netperf.zerocop [kernel.kallsyms] [k] __alloc_skb 3.10% 60 netperf.zerocop [kernel.kallsyms] [k] kmem_cache_alloc_node * Safety The number of pages that can be pinned on behalf of a user with MSG_ZEROCOPY is bound by the locked memory ulimit. While the kernel holds process memory pinned, a process cannot safely reuse those pages for other purposes. Packets looped onto the receive stack and queued to a socket can be held indefinitely. Avoid unbounded notification latency by restricting user pages to egress paths only. skb_orphan_frags_rx() will create a private copy of pages even for refcounted packets when these are looped, as did skb_orphan_frags for the original tun zerocopy implementation. Pages are not remapped read-only. Processes can modify packet contents while packets are in flight in the kernel path. Bytes on which kernel control flow depends (headers) are copied to avoid TOCTTOU attacks. Datapath integrity does not otherwise depend on payload, with three exceptions: checksums, optional sk_filter/tc u32/.. and device + driver logic. The effect of wrong checksums is limited to the misbehaving process. TC filters that access contents may have to be excluded by adding an skb_orphan_frags_rx. Processes can also safely avoid OOM conditions by bounding the number of bytes passed with MSG_ZEROCOPY and by removing shared pages after transmission from their own memory map. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-03 21:37:30 -07:00
Willem de Bruijn	07b65c5b31	test: add msg_zerocopy test Introduce regression test for msg_zerocopy feature. Send traffic from one process to another with and without zerocopy. Evaluate tcp, udp, raw and packet sockets, including variants - udp: corking and corking with mixed copy/zerocopy calls - raw: with and without hdrincl - packet: at both raw and dgram level Test on both ipv4 and ipv6, optionally with ethtool changes to disable scatter-gather, tx checksum or tso offload. All of these can affect zerocopy behavior. The regression test can be run on a single machine if over a veth pair. Then skb_orphan_frags_rx must be modified to be identical to skb_orphan_frags to allow forwarding zerocopy locally. The msg_zerocopy.sh script will setup the veth pair in network namespaces and run all tests. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-03 21:37:30 -07:00
Willem de Bruijn	f214f915e7	tcp: enable MSG_ZEROCOPY Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are both supported. Only data sent to remote destinations is sent without copying. Packets looped onto a local destination have their payload copied to avoid unbounded latency. Tested: A 10x TCP_STREAM between two hosts showed a reduction in netserver process cycles by up to 70%, depending on packet size. Systemwide, savings are of course much less pronounced, at up to 20% best case. msg_zerocopy.sh 4 tcp: without zerocopy tx=121792 (7600 MB) txc=0 zc=n rx=60458 (7600 MB) with zerocopy tx=286257 (17863 MB) txc=286257 zc=y rx=140022 (17863 MB) This test opens a pair of sockets over veth, one one calls send with 64KB and optionally MSG_ZEROCOPY and on the other reads the initial bytes. The receiver truncates, so this is strictly an upper bound on what is achievable. It is more representative of sending data out of a physical NIC (when payload is not touched, either). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-03 21:37:30 -07:00
Willem de Bruijn	a91dbff551	sock: ulimit on MSG_ZEROCOPY pages Bound the number of pages that a user may pin. Follow the lead of perf tools to maintain a per-user bound on memory locked pages commit `789f90fcf6` ("perf_counter: per user mlock gift") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-03 21:37:30 -07:00
Willem de Bruijn	4ab6c99d99	sock: MSG_ZEROCOPY notification coalescing In the simple case, each sendmsg() call generates data and eventually a zerocopy ready notification N, where N indicates the Nth successful invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket. TCP and corked sockets can cause send() calls to append new data to an existing sk_buff and, thus, ubuf_info. In that case the notification must hold a range. odify ubuf_info to store a inclusive range [N..N+m] and add skb_zerocopy_realloc() to optionally extend an existing range. Also coalesce notifications in this common case: if a notification [1, 1] is about to be queued while [0, 0] is the queue tail, just modify the head of the queue to read [0, 1]. Coalescing is limited to a few TSO frames worth of data to bound notification latency. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-03 21:37:30 -07:00
Willem de Bruijn	1f8b977ab3	sock: enable MSG_ZEROCOPY Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with skb_zerocopy_clone() wherever needed due to skb split, merge, resize or clone. Split skb_orphan_frags into two variants. The split, merge, .. paths support reference counted zerocopy buffers, so do not do a deep copy. Add skb_orphan_frags_rx for paths that may loop packets to receive sockets. That is not allowed, as it may cause unbounded latency. Deep copy all zerocopy copy buffers, ref-counted or not, in this path. The exact locations to modify were chosen by exhaustively searching through all code that might modify skb_frag references and/or the the SKBTX_DEV_ZEROCOPY tx_flags bit. The changes err on the safe side, in two ways. (1) legacy ubuf_info paths virtio and tap are not modified. They keep a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags still call skb_copy_ubufs and thus copy frags in this case. (2) not all copies deep in the stack are addressed yet. skb_shift, skb_split and skb_try_coalesce can be refined to avoid copying. These are not in the hot path and this patch is hairy enough as is, so that is left for future refinement. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-03 21:37:30 -07:00
Willem de Bruijn	76851d1212	sock: add SOCK_ZEROCOPY sockopt The send call ignores unknown flags. Legacy applications may already unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a socket opts in to zerocopy. Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing. Processes can also query this socket option to detect kernel support for the feature. Older kernels will return ENOPROTOOPT. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-03 21:37:29 -07:00

1 2 3 4 5 ...

692789 Commits