linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-08 19:46:38 +07:00

Author	SHA1	Message	Date
David Ahern	5fcd266a9f	net/ipv4: Add support for dumping addresses for a specific device If an RTM_GETADDR dump request has ifa_index set in the ifaddrmsg header, then return only the addresses for that device. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-22 19:33:29 -07:00
David Ahern	1c98eca412	net/ipv4: Move loop over addresses on a device into in_dev_dump_addr Similar to IPv6 move the logic that walks over the ipv4 address list for a device into a helper. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-22 19:33:29 -07:00
David S. Miller	a19c59cc10	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-10-21 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Implement two new kind of BPF maps, that is, queue and stack map along with new peek, push and pop operations, from Mauricio. 2) Add support for MSG_PEEK flag when redirecting into an ingress psock sk_msg queue, and add a new helper bpf_msg_push_data() for insert data into the message, from John. 3) Allow for BPF programs of type BPF_PROG_TYPE_CGROUP_SKB to use direct packet access for __skb_buff, from Song. 4) Use more lightweight barriers for walking perf ring buffer for libbpf and perf tool as well. Also, various fixes and improvements from verifier side, from Daniel. 5) Add per-symbol visibility for DSO in libbpf and hide by default global symbols such as netlink related functions, from Andrey. 6) Two improvements to nfp's BPF offload to check vNIC capabilities in case prog is shared with multiple vNICs and to protect against mis-initializing atomic counters, from Jakub. 7) Fix for bpftool to use 4 context mode for the nfp disassembler, also from Jakub. 8) Fix a return value comparison in test_libbpf.sh and add several bpftool improvements in bash completion, documentation of bpf fs restrictions and batch mode summary print, from Quentin. 9) Fix a file resource leak in BPF selftest's load_kallsyms() helper, from Peng. 10) Fix an unused variable warning in map_lookup_and_delete_elem(), from Alexei. 11) Fix bpf_skb_adjust_room() signature in BPF UAPI helper doc, from Nicolas. 12) Add missing executables to .gitignore in BPF selftests, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-21 21:11:46 -07:00
David S. Miller	a4efbaf622	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree: 1) Use lockdep_is_held() in ipset_dereference_protected(), from Lance Roy. 2) Remove unused variable in cttimeout, from YueHaibing. 3) Add ttl option for nft_osf, from Fernando Fernandez Mancera. 4) Use xfrm family to deal with IPv6-in-IPv4 packets from nft_xfrm, from Florian Westphal. 5) Simplify xt_osf_match_packet(). 6) Missing ct helper alias definition in snmp_trap helper, from Taehee Yoo. 7) Remove unnecessary parameter in nf_flow_table_cleanup(), from Taehee Yoo. 8) Remove unused variable definitions in nft_{dup,fwd}, from Weongyo Jeong. 9) Remove empty net/netfilter/nfnetlink_log.h file, from Taehee Yoo. 10) Revert xt_quota updates remain option due to problems in the listing path for 32-bit arches, from Maze. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-20 12:32:44 -07:00
David S. Miller	2e2d6f0342	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net net/sched/cls_api.c has overlapping changes to a call to nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL to the 5th argument, and another (from 'net-next') added cb->extack instead of NULL to the 6th argument. net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to code which moved (to mr_table_dump)) in 'net-next'. Thanks to David Ahern for the heads up. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-19 11:03:06 -07:00
Eric Dumazet	79861919b8	tcp: fix TCP_REPAIR xmit queue setup Andrey reported the following warning triggered while running CRIU tests: tcp_clean_rtx_queue() ... last_ackt = tcp_skb_timestamp_us(skb); WARN_ON_ONCE(last_ackt == 0); This is caused by `5f6188a800` ("tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"), as we end up having skbs in retransmit queue with a zero skb->skb_mstamp_ns field. We could fix this bug in different ways, like making sure tp->tcp_wstamp_ns is not zero at socket creation, but as Neal pointed out, we also do not want that pacing status of a repaired socket could push tp->tcp_wstamp_ns far ahead in the future. So we prefer changing tcp_write_xmit() to not call tcp_update_skb_after_send() and instead do what is requested by TCP_REPAIR logic. Fixes: `5f6188a800` ("tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Vagin <avagin@openvz.org> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-18 16:51:02 -07:00
Nikolay Aleksandrov	eddf016b91	net: ipmr: fix unresolved entry dumps If the skb space ends in an unresolved entry while dumping we'll miss some unresolved entries. The reason is due to zeroing the entry counter between dumping resolved and unresolved mfc entries. We should just keep counting until the whole table is dumped and zero when we move to the next as we have a separate table counter. Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: `8fb472c09b` ("ipmr: improve hash scalability") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-17 22:35:42 -07:00
Neal Cardwell	cf33e25c0d	tcp_bbr: centralize code to set gains Centralize the code that sets gains used for computing cwnd and pacing rate. This simplifies the code and makes it easier to change the state machine or (in the future) dynamically change the gain values and ensure that the correct gain values are always used. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-17 22:22:53 -07:00
Neal Cardwell	a87c83d5ee	tcp_bbr: adjust TCP BBR for departure time pacing Adjust TCP BBR for the new departure time pacing model in the recent commit `ab408b6dc7` ("tcp: switch tcp and sch_fq to new earliest departure time model"). With TSQ and pacing at lower layers, there are often several skbs queued in the pacing layer, and thus there is less data "in the network" than "in flight". With departure time pacing at lower layers (e.g. fq or potential future NICs), the data in the pacing layer now has a pre-scheduled ("baked-in") departure time that cannot be changed, even if the congestion control algorithm decides to use a new pacing rate. This means that there can be a non-trivial lag between when BBR makes a pacing rate change and when the inter-skb pacing delays change. After a pacing rate change, the number of packets in the network can gradually evolve to be higher or lower, depending on whether the sending rate is higher or lower than the delivery rate. Thus ignoring this lag can cause significant overshoot, with the flow ending up with too many or too few packets in the network. This commit changes BBR to adapt its pacing rate based on the amount of data in the network that it estimates has already been "baked in" by previous departure time decisions. We estimate the number of our packets that will be in the network at the earliest departure time (EDT) for the next skb scheduled as: in_network_at_edt = inflight_at_edt - (EDT - now) * bw If we're increasing the amount of data in the network ("in_network"), then we want to know if the transmit of the EDT skb will push in_network above the target, so our answer includes bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing in_network, then we want to know if in_network will sink too low just before the EDT transmit, so our answer does not include the segments from the skb departing at EDT. Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case differently? The in_network curve is a step function: in_network goes up on transmits, and down on ACKs. To accurately predict when in_network will go beyond our target value, this will happen on different events, depending on whether we're concerned about in_network potentially going too high or too low: o if pushing in_network up (pacing_gain > 1.0), then in_network goes above target upon a transmit event o if pushing in_network down (pacing_gain < 1.0), then in_network goes below target upon an ACK event This commit changes the BBR state machine to use this estimated "packets in network" value to make its decisions. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-17 22:22:53 -07:00
John Fastabend	02c558b2d5	bpf: sockmap, support for msg_peek in sk_msg with redirect ingress This adds support for the MSG_PEEK flag when doing redirect to ingress and receiving on the sk_msg psock queue. Previously the flag was being ignored which could confuse applications if they expected the flag to work as normal. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-17 02:30:32 +02:00
John Fastabend	3f4c3127d3	bpf: sockmap, fix skmsg recvmsg handler to track size correctly When converting sockmap to new skmsg generic data structures we missed that the recvmsg handler did not correctly use sg.size and instead was using individual elements length. The result is if a sock is closed with outstanding data we omit the call to sk_mem_uncharge() and can get the warning below. [ 66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 sk_stream_kill_queues+0x1fa/0x210 To fix this correct the redirect handler to xfer the size along with the scatterlist and also decrement the size from the recvmsg handler. Now when a sock is closed the remaining 'size' will be decremented with sk_mem_uncharge(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-17 02:29:15 +02:00
Daniel Borkmann	aadd435591	tcp, ulp: remove socket lock assertion on ULP cleanup Eric reported that syzkaller triggered a splat in tcp_cleanup_ulp() where assertion sock_owned_by_me() failed. This happened through inet_csk_prepare_forced_close() first releasing the socket lock, then calling into tcp_done(newsk) which is called after the inet_csk_prepare_forced_close() and therefore without the socket lock held. The sock_owned_by_me() assertion can generally be removed as the only place where tcp_cleanup_ulp() is called from now is out of inet_csk_destroy_sock() -> sk->sk_prot->destroy() where socket is in dead state and unreachable. Therefore, add a comment why the check is not needed instead. Fixes: `8b9088f806` ("tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-16 12:38:41 -07:00
Taehee Yoo	95c97998aa	netfilter: nf_nat_snmp_basic: add missing helper alias name In order to upload helper module automatically, helper alias name is needed. so that MODULE_ALIAS_NFCT_HELPER() should be added. And unlike other nat helper modules, the nf_nat_snmp_basic can be used independently. helper name is "snmp_trap" so that alias name will be "nfct-helper-snmp_trap" by MODULE_ALIAS_NFCT_HELPER(snmp_trap) test command: %iptables -t raw -I PREROUTING -p udp -j CT --helper snmp_trap %lsmod \| grep nf_nat_snmp_basic We can see nf_nat_snmp_basic module is uploaded automatically. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-10-16 10:01:50 +02:00
David Ahern	e4e92fb160	net/ipv4: Bail early if user only wants prefix entries Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the flag is set in the dump request, just return. In the process of this change, move the CLONE check to use the new filter flags. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-16 00:14:08 -07:00
David Ahern	effe679266	net: Enable kernel side filtering of route dumps Update parsing of route dump request to enable kernel side filtering. Allow filtering results by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. These amount to the low hanging fruit, yet a huge improvement, for dumping routes. ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can be used to look up the device index without taking a reference. From there filter->dev is only used during dump loops with the lock still held. Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results have been filtered should no entries be returned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-16 00:14:07 -07:00
David Ahern	cb167893f4	net: Plumb support for filtering ipv4 and ipv6 multicast route dumps Implement kernel side filtering of routes by egress device index and table id. If the table id is given in the filter, lookup table and call mr_table_dump directly for it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-16 00:13:39 -07:00
David Ahern	e1cedae1ba	ipmr: Refactor mr_rtm_dumproute Move per-table loops from mr_rtm_dumproute to mr_table_dump and export mr_table_dump for dumps by specific table id. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-16 00:13:12 -07:00
David Ahern	18a8021a7b	net/ipv4: Plumb support for filtering route dumps Implement kernel side filtering of routes by table id, egress device index, protocol and route type. If the table id is given in the filter, lookup the table and call fib_table_dump directly for it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-16 00:13:12 -07:00
David Ahern	4724676d55	net: Add struct for fib dump filter Add struct fib_dump_filter for options on limiting which routes are returned in a dump request. The current list is table id, protocol, route type, rtm_flags and nexthop device index. struct net is needed to lookup the net_device from the index. Declare the filter for each route dump handler and plumb the new arguments from dump handlers to ip_valid_fib_dump_req. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-16 00:13:12 -07:00
David S. Miller	e85679511e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-10-16 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable sk_msg BPF integration for the latter, from Daniel and John. 2) Enable BPF syscall side to indicate for maps that they do not support a map lookup operation as opposed to just missing key, from Prashant. 3) Add bpftool map create command which after map creation pins the map into bpf fs for further processing, from Jakub. 4) Add bpftool support for attaching programs to maps allowing sock_map and sock_hash to be used from bpftool, from John. 5) Improve syscall BPF map update/delete path for map-in-map types to wait a RCU grace period for pending references to complete, from Daniel. 6) Couple of follow-up fixes for the BPF socket lookup to get it enabled also when IPv6 is compiled as a module, from Joe. 7) Fix a generic-XDP bug to handle the case when the Ethernet header was mangled and thus update skb's protocol and data, from Jesper. 8) Add a missing BTF header length check between header copies from user space, from Wenwen. 9) Minor fixups in libbpf to use __u32 instead u32 types and include proper perf_event.h uapi header instead of perf internal one, from Yonghong. 10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS to bpftool's build, from Jiri. 11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install with_addr.sh script from flow dissector selftest, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-15 23:21:07 -07:00
Eric Dumazet	825e1c523d	tcp: cdg: use tcp high resolution clock cache We store in tcp socket a cache of most recent high resolution clock, there is no need to call local_clock() again, since this cache is good enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-15 22:56:42 -07:00
Neal Cardwell	97ec3eb33d	tcp_bbr: fix typo in bbr_pacing_margin_percent There was a typo in this parameter name. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-15 22:56:42 -07:00
Eric Dumazet	864e5c0907	tcp: optimize tcp internal pacing When TCP implements its own pacing (when no fq packet scheduler is used), it is arming high resolution timer after a packet is sent. But in many cases (like TCP_RR kind of workloads), this high resolution timer expires before the application attempts to write the following packet. This overhead also happens when the flow is ACK clocked and cwnd limited instead of being limited by the pacing rate. This leads to extra overhead (high number of IRQ) Now tcp_wstamp_ns is reserved for the pacing timer only (after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"), we can setup the timer only when a packet is about to be sent, and if tcp_wstamp_ns is in the future. This leads to a ~10% performance increase in TCP_RR workloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-15 22:56:42 -07:00
Eric Dumazet	a7a2563064	tcp: mitigate scheduling jitter in EDT pacing model In commit `fefa569a9d` ("net_sched: sch_fq: account for schedule/timers drifts") we added a mitigation for scheduling jitter in fq packet scheduler. This patch does the same in TCP stack, now it is using EDT model. Note that this mitigation is valid for both external (fq packet scheduler) or internal TCP pacing. This uses the same strategy than the above commit, allowing a time credit of half the packet currently sent. Consider following case : An skb is sent, after an idle period of 300 usec. The air-time (skb->len/pacing_rate) is 500 usec Instead of setting the pacing timer to now+500 usec, it will use now+min(500/2, 300) -> now+250usec This is like having a token bucket with a depth of half an skb. Tested: tc qdisc replace dev eth0 root pfifo_fast Before netperf -P0 -H remote -- -q 1000000000 # 8000Mbit 540000 262144 262144 10.00 7710.43 After : netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit 540000 262144 262144 10.00 7999.75 # Much closer to 8000Mbit target Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-15 22:56:42 -07:00
Eric Dumazet	76a9ebe811	net: extend sk_pacing_rate to unsigned long sk_pacing_rate has beed introduced as a u32 field in 2013, effectively limiting per flow pacing to 34Gbit. We believe it is time to allow TCP to pace high speed flows on 64bit hosts, as we now can reach 100Gbit on one TCP flow. This patch adds no cost for 32bit kernels. The tcpi_pacing_rate and tcpi_max_pacing_rate were already exported as 64bit, so iproute2/ss command require no changes. Unfortunately the SO_MAX_PACING_RATE socket option will stay 32bit and we will need to add a new option to let applications control high pacing rates. State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741 timer:(on,003ms,0) ino:91863 sk:2 <-> skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0) ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448 rcvmss:536 advmss:1448 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177 segs_in:3916318 data_segs_out:177279175 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2) send 28045.5Mbps lastrcv:73333 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps busy:73333ms unacked:135 retrans:0/157 rcv_space:14480 notsent:2085120 minrtt:0.013 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-15 22:56:42 -07:00
Eric Dumazet	5f6188a800	tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh In EDT design, I made the mistake of using tcp_wstamp_ns to store the last tcp_clock_ns() sample and to store the pacing virtual timer. This causes major regressions at high speed flows. Introduce tcp_clock_cache to store last tcp_clock_ns(). This is needed because some arches have slow high-resolution kernel time service. tcp_wstamp_ns is only updated when a packet is sent. Note that we can remove tcp_mstamp in the future since tcp_mstamp is essentially tcp_clock_cache/1000, so the apparent socket size increase is temporary. Fixes: `9799ccb0e9` ("tcp: add tcp_wstamp_ns socket field") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-15 22:56:41 -07:00
Daniel Borkmann	604326b41a	bpf, sockmap: convert to generic sk_msg interface Add a generic sk_msg layer, and convert current sockmap and later kTLS over to make use of it. While sk_buff handles network packet representation from netdevice up to socket, sk_msg handles data representation from application to socket layer. This means that sk_msg framework spans across ULP users in the kernel, and enables features such as introspection or filtering of data with the help of BPF programs that operate on this data structure. Latter becomes in particular useful for kTLS where data encryption is deferred into the kernel, and as such enabling the kernel to perform L7 introspection and policy based on BPF for TLS connections where the record is being encrypted after BPF has run and came to a verdict. In order to get there, first step is to transform open coding of scatter-gather list handling into a common core framework that subsystems can use. The code itself has been split and refactored into three bigger pieces: i) the generic sk_msg API which deals with managing the scatter gather ring, providing helpers for walking and mangling, transferring application data from user space into it, and preparing it for BPF pre/post-processing, ii) the plain sock map itself where sockets can be attached to or detached from; these bits are independent of i) which can now be used also without sock map, and iii) the integration with plain TCP as one protocol to be used for processing L7 application data (later this could e.g. also be extended to other protocols like UDP). The semantics are the same with the old sock map code and therefore no change of user facing behavior or APIs. While pursuing this work it also helped finding a number of bugs in the old sockmap code that we've fixed already in earlier commits. The test_sockmap kselftest suite passes through fine as well. Joint work with John. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-10-15 12:23:19 -07:00
Daniel Borkmann	1243a51f6c	tcp, ulp: remove ulp bits from sockmap In order to prepare sockmap logic to be used in combination with kTLS we need to detangle it from ULP, and further split it in later commits into a generic API. Joint work with John. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-10-15 12:23:19 -07:00
Daniel Borkmann	8b9088f806	tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup Whenever the ULP data on the socket is mangled, enforce that the caller has the socket lock held as otherwise things may race with initialization and cleanup callbacks from ulp ops as both would mangle internal socket state. Joint work with John. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-10-15 12:23:19 -07:00
David S. Miller	d864991b22	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts were easy to resolve using immediate context mostly, except the cls_u32.c one where I simply too the entire HEAD chunk. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-12 21:38:46 -07:00
David Ahern	859bd2ef1f	net: Evict neighbor entries on carrier down When a link's carrier goes down it could be a sign of the port changing networks. If the new network has overlapping addresses with the old one, then the kernel will continue trying to use neighbor entries established based on the old network until the entries finally age out - meaning a potentially long delay with communications not working. This patch evicts neighbor entries on carrier down with the exception of those marked permanent. Permanent entries are managed by userspace (either an admin or a routing daemon such as FRR). Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-12 09:47:39 -07:00
Sabrina Dubroca	28d35bcdd3	net: ipv4: don't let PMTU updates increase route MTU When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is received, we must clamp its value. However, we can receive a PMTU exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an increase in PMTU. To fix this, take the smallest of the old MTU and ip_rt_min_pmtu. Before this patch, in case of an update, the exception's MTU would always change. Now, an exception can have only its lock flag updated, but not the MTU, so we need to add a check on locking to the following "is this exception getting updated, or close to expiring?" test. Fixes: `d52e5a7e7c` ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:44:46 -07:00
Sabrina Dubroca	af7d6cce53	net: ipv4: update fnhe_pmtu when first hop's MTU changes Since commit `5aad1de5ea` ("ipv4: use separate genid for next hop exceptions"), exceptions get deprecated separately from cached routes. In particular, administrative changes don't clear PMTU anymore. As Stefano described in commit `e9fa1495d7` ("ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routes"), the PMTU discovered before the local MTU change can become stale: - if the local MTU is now lower than the PMTU, that PMTU is now incorrect - if the local MTU was the lowest value in the path, and is increased, we might discover a higher PMTU Similarly to what commit `e9fa1495d7` did for IPv6, update PMTU in those cases. If the exception was locked, the discovered PMTU was smaller than the minimal accepted PMTU. In that case, if the new local MTU is smaller than the current PMTU, let PMTU discovery figure out if locking of the exception is still needed. To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU notifier. By the time the notifier is called, dev->mtu has been changed. This patch adds the old MTU as additional information in the notifier structure, and a new call_netdevice_notifiers_u32() function. Fixes: `5aad1de5ea` ("ipv4: use separate genid for next hop exceptions") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:44:46 -07:00
Yuchung Cheng	ffd177dea5	tcp: refactor DCTCP ECN ACK handling DCTCP has two parts - a new ECN signalling mechanism and the response function to it. The first part can be used by other congestion control for DCTCP-ECN deployed networks. This patch moves that part into a separate tcp_dctcp.h to be used by other congestion control module (like how Yeah uses Vegas algorithmas). For example, BBR is experimenting such ECN signal currently https://tinyurl.com/ietf-102-iccrg-bbr2 Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:26:00 -07:00
David S. Miller	9000a457a0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree: 1) Support for matching on ipsec policy already set in the route, from Florian Westphal. 2) Split set destruction into deactivate and destroy phase to make it fit better into the transaction infrastructure, also from Florian. This includes a patch to warn on imbalance when setting the new activate and deactivate interfaces. 3) Release transaction list from the workqueue to remove expensive synchronize_rcu() from configuration plane path. This speeds up configuration plane quite a bit. From Florian Westphal. 4) Add new xfrm/ipsec extension, this new extension allows you to match for ipsec tunnel keys such as source and destination address, spi and reqid. From Máté Eckl and Florian Westphal. 5) Add secmark support, this includes connsecmark too, patches from Christian Gottsche. 6) Allow to specify remaining bytes in xt_quota, from Chenbo Feng. One follow up patch to calm a clang warning for this one, from Nathan Chancellor. 7) Flush conntrack entries based on layer 3 family, from Kristian Evensen. 8) New revision for cgroups2 to shrink the path field. 9) Get rid of obsolete need_conntrack(), as a result from recent demodularization works. 10) Use WARN_ON instead of BUG_ON, from Florian Westphal. 11) Unused exported symbol in nf_nat_ipv4_fn(), from Florian. 12) Remove superfluous check for timeout netlink parser and dump functions in layer 4 conntrack helpers. 13) Unnecessary redundant rcu read side locks in NAT redirect, from Taehee Yoo. 14) Pass nf_hook_state structure to error handlers, patch from Florian Westphal. 15) Remove ->new() interface from layer 4 protocol trackers. Place them in the ->packet() interface. From Florian. 16) Place conntrack ->error() handling in the ->packet() interface. Patches from Florian Westphal. 17) Remove unused parameter in the pernet initialization path, also from Florian. 18) Remove additional parameter to specify layer 3 protocol when looking up for protocol tracker. From Florian. 19) Shrink array of layer 4 protocol trackers, from Florian. 20) Check for linear skb only once from the ALG NAT mangling codebase, from Taehee Yoo. 21) Use rhashtable_walk_enter() instead of deprecated rhashtable_walk_init(), also from Taehee. 22) No need to flush all conntracks when only one single address is gone, from Tan Hu. 23) Remove redundant check for NAT flags in flowtable code, from Taehee Yoo. 24) Use rhashtable_lookup() instead of rhashtable_lookup_fast() from netfilter codebase, since rcu read lock side is already assumed in this path. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 21:28:55 -07:00
David Ahern	addd383f5a	net: Update netconf dump handlers for strict data checking Update inet_netconf_dump_devconf, inet6_netconf_dump_devconf, and mpls_netconf_dump_devconf for strict data checking. If the flag is set, the dump request is expected to have an netconfmsg struct as the header. The struct only has the family member and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	e8ba330ac0	rtnetlink: Update fib dumps for strict data checking Add helper to check netlink message for route dumps. If the strict flag is set the dump request is expected to have an rtmsg struct as the header. All elements of the struct are expected to be 0 with the exception of rtm_flags (which is used by both ipv4 and ipv6 dumps) and no attributes can be appended. rtm_flags can only have RTM_F_CLONED and RTM_F_PREFIX set. Update inet_dump_fib, inet6_dump_fib, mpls_dump_routes, ipmr_rtm_dumproute, and ip6mr_rtm_dumproute to call this helper if strict data checking is enabled. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	14fc5bb29f	rtnetlink: Update ipmr_rtm_dumplink for strict data checking Update ipmr_rtm_dumplink for strict data checking. If the flag is set, the dump request is expected to have an ifinfomsg struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	c33078e3df	net/ipv4: Update inet_dump_ifaddr for strict data checking Update inet_dump_ifaddr for strict data checking. If the flag is set, the dump request is expected to have an ifaddrmsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the IFA_TARGET_NETNSID attribute is supported. Follow on patches can support for other fields (e.g., honor ifa_index and only return data for the given device index). Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	dac9c9790e	net: Add extack to nlmsg_parse Make sure extack is passed to nlmsg_parse where easy to do so. Most of these are dump handlers and leveraging the extack in the netlink_callback. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
Jiri Kosina	7e823644b6	udp: Unbreak modules that rely on external __skb_recv_udp() availability Commit `2276f58ac5` ("udp: use a separate rx queue for packet reception") turned static inline __skb_recv_udp() from being a trivial helper around __skb_recv_datagram() into a UDP specific implementaion, making it EXPORT_SYMBOL_GPL() at the same time. There are external modules that got broken by __skb_recv_udp() not being visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL(). Rationale (one of those) why this is actually "technically correct" thing to do: __skb_recv_udp() used to be an inline wrapper around __skb_recv_datagram(), which itself (still, and correctly so, I believe) is EXPORT_SYMBOL(). Cc: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Fixes: `2276f58ac5` ("udp: use a separate rx queue for packet reception") Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-07 20:33:10 -07:00
Willem de Bruijn	f2e9de210d	udp: gro behind static key Avoid the socket lookup cost in udp_gro_receive if no socket has a udp tunnel callback configured. udp_sk(sk)->gro_receive requires a registration with setup_udp_tunnel_sock, which enables the static key. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 11:52:38 -07:00
David Ahern	1620a33695	net: Move free of dst_metrics to helper Move the refcounting and potential free of dst metrics associated for ipv4 and ipv6 to a common helper. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:54:25 -07:00
David Ahern	e1255ed4b6	net: common metrics init helper for dst_entry ipv4 and ipv6 both use refcounted metrics if FIB entries have metrics set. Move the common initialization code to a helper and use for both protocols. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:54:19 -07:00
David Ahern	cc5f0eb216	net: Move free of fib_metrics to helper Move the refcounting and potential free of dst metrics associated with a fib entry to a helper and use it in both ipv4 and ipv6. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:54:10 -07:00
David Ahern	767a221753	net: common metrics init helper for FIB entries Consolidate initialization of ipv4 and ipv6 metrics when fib entries are created into a single helper, ip_fib_metrics_init, that handles the call to ip_metrics_convert. If no metrics are defined for the fib entry, then the metrics is set to dst_default_metrics. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:54:03 -07:00
David S. Miller	6f41617bf2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net' overlapped the renaming of a netlink attribute in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-03 21:00:17 -07:00
Eric Dumazet	64199fc0a4	ipv4: fix use-after-free in ip_cmsg_recv_dstaddr() Caching ip_hdr(skb) before a call to pskb_may_pull() is buggy, do not do it. Fixes: `2efd4fca70` ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:32:05 -07:00
Robert Shearman	854da99173	ipv4: Allow sending multicast packets on specific i/f using VRF socket It is useful to be able to use the same socket for listening in a specific VRF, as for sending multicast packets out of a specific interface. However, the bound device on the socket currently takes precedence and results in the packets not being sent. Relax the condition on overriding the output interface to use for sending packets out of UDP, raw and ping sockets to allow multicast packets to be sent using the specified multicast interface. Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:28:17 -07:00
Eric Dumazet	8873c064d1	tcp: do not release socket ownership in tcp_close() syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk)); in tcp_close() While a socket is being closed, it is very possible other threads find it in rtnetlink dump. tcp_get_info() will acquire the socket lock for a short amount of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);), enough to trigger the warning. Fixes: `67db3e4bfb` ("tcp: no longer hold ehash lock while calling tcp_get_info()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:17:35 -07:00

1 2 3 4 5 ...

9304 Commits