Split assignment and initialization from one into two functions.
This is required by followup patches that add Datacenter TCP
(DCTCP) congestion control algorithm - we need to be able to
determine if the connection is moderated by DCTCP before the
3WHS has finished.
As we walk the available congestion control list during the
assignment, we are always guaranteed to have Reno present as
it's fixed compiled-in. Therefore, since we're doing the
early assignment, we don't have a real use for the Reno alias
tcp_init_congestion_ops anymore and can thus remove it.
Actual usage of the congestion control operations are being
made after the 3WHS has finished, in some cases however we
can access get_info() via diag if implemented, therefore we
need to zero out the private area for those modules.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We do not wish to disturb dropwatch or perf drop profiles with an ARP
we will ignore.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2014-09-25
1) Remove useless hash_resize_mutex in xfrm_hash_resize().
This mutex is used only there, but xfrm_hash_resize()
can't be called concurrently at all. From Ying Xue.
2) Extend policy hashing to prefixed policies based on
prefix lenght thresholds. From Christophe Gouault.
3) Make the policy hash table thresholds configurable
via netlink. From Christophe Gouault.
4) Remove the maximum authentication length for AH.
This was needed to limit stack usage. We switched
already to allocate space, so no need to keep the
limit. From Herbert Xu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: commit f187bc6efb ("ipv4: No need to set generic neighbour pointer")
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup which follows the idea in commit e11ecddf51 (tcp: use
TCP_SKB_CB(skb)->tcp_flags in input path),
and it may reduce register pressure since skb->cb[] access is fast,
bacause skb is probably in a register.
v2: remove variable th
v3: reword the changelog
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our goal is to access no more than one cache line access per skb in
a write or receive queue when doing the various walks.
After recent TCP_SKB_CB() reorganizations, it is almost done.
Last part is tcp_skb_pcount() which currently uses
skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
3 cache lines in current kernel (skb->head, skb->end, and
shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
This very simple patch reuses space currently taken by tcp_tw_isn
only in input path, as tcp_skb_pcount is only needed for skb stored in
write queue.
This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
to get SKBTX_ACK_TSTAMP, which seems possible.
This also speeds up all sack processing in general.
This speeds up tcp_sendmsg() because it no longer has to access/dirty
shinfo.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP maintains lists of skb in write queue, and in receive queues
(in order and out of order queues)
Scanning these lists both in input and output path usually requires
access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq
These fields are currently in two different cache lines, meaning we
waste lot of memory bandwidth when these queues are big and flows
have either packet drops or packet reorders.
We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
this header is not used in fast path. This allows TCP to search much faster
in the skb lists.
Even with regular flows, we save one cache line miss in fast path.
Thanks to Christoph Paasch for noticing we need to cleanup
skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
Lets break this assumption, but provide a helper to not change all call points.
ip_send_unicast_reply() gets a new struct ip_options pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_dma was the only external user so this can become local to tcp.c
again.
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that tcp_dma_try_early_copy() is gone nothing ever sets
copied_early.
Also reverts "53240c208776 tcp: Fix possible double-ack w/ user dma"
since it is no longer necessary.
Cc: Ali Saidi <saidi@engin.umich.edu>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Neal Cardwell <ncardwell@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.
This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.
Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():
https://lkml.org/lkml/2014/9/3/177
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
While profiling TCP stack, I noticed one useless atomic operation
in tcp_sendmsg(), caused by skb_header_release().
It turns out all current skb_header_release() users have a fresh skb,
that no other user can see, so we can avoid one atomic operation.
Introduce __skb_header_release() to clearly document this.
This gave me a 1.5 % improvement on TCP_RR workload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we try to add an already existing tunnel, we don't return
an error. Instead we continue and call ip_tunnel_update().
This means that we can change existing tunnels by adding
the same tunnel multiple times. It is even possible to change
the tunnel endpoints of the fallback device.
We fix this by returning an error if we try to add an existing
tunnel.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The send_check logic was only interesting in cases of TCP offload and
UDP UFO where the checksum needed to be initialized to the pseudo
header checksum. Now we've moved that logic into the related
gso_segment functions so gso_send_check is no longer needed.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In udp[46]_ufo_send_check the UDP checksum initialized to the pseudo
header checksum. We can move this logic into udp[46]_ufo_fragment.
After this change udp[64]_ufo_send_check is a no-op.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp_v[46]_gso_send_check the TCP checksum is initialized to the
pseudo header checksum using __tcp_v[46]_send_check. We can move this
logic into new tcp[46]_gso_segment functions to be done when
ip_summed != CHECKSUM_PARTIAL (ip_summed == CHECKSUM_PARTIAL should be
the common case, possibly always true when taking GSO path). After this
change tcp_v[46]_gso_send_check is no-op.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
In order to make TCP more resilient in presence of reorders, we need
to allow coalescing to happen when skbs from out of order queue are
transferred into receive queue. LRO/GRO can be completely canceled
in some pathological cases, like per packet load balancing on aggregated
links.
I had to move tcp_try_coalesce() up in the file above tcp_ofo_queue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
protected by a lock, meaning that hosts can be stuck hard if all cpus
want to check ICMP limits.
When say a DNS or NTP server process is restarted, inetpeer tree grows
quick and machine comes to its knees.
iptables can not help because the bottleneck happens before ICMP
messages are even cooked and sent.
This patch adds a new global limitation, using a token bucket filter,
controlled by two new sysctl :
icmp_msgs_per_sec - INTEGER
Limit maximal number of ICMP packets sent per second from this host.
Only messages whose type matches icmp_ratemask are
controlled by this limit.
Default: 1000
icmp_msgs_burst - INTEGER
icmp_msgs_per_sec controls number of ICMP packets sent per second,
while icmp_msgs_burst controls the burst size of these packets.
Default: 50
Note that if we really want to send millions of ICMP messages per
second, we might extend idea and infra added in commit 04ca6973f7
("ip: make IP identifiers less predictable") :
add a token bucket in the ip_idents hash and no longer rely on inetpeer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
this_cpu_ptr() in preemptible context is generally bad
Sep 22 05:05:55 br kernel: [ 94.608310] BUG: using smp_processor_id()
in
preemptible [00000000] code: ip/2261
Sep 22 05:05:55 br kernel: [ 94.608316] caller is
tunnel_dst_set.isra.28+0x20/0x60 [ip_tunnel]
Sep 22 05:05:55 br kernel: [ 94.608319] CPU: 3 PID: 2261 Comm: ip Not
tainted
3.17.0-rc5 #82
We can simply use raw_cpu_ptr(), as preemption is safe in these
contexts.
Should fix https://bugzilla.kernel.org/show_bug.cgi?id=84991
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Joe <joe9mail@gmail.com>
Fixes: 9a4aa9af44 ("ipv4: Use percpu Cache route in IP tunnels")
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
or more if some sysctl (eg tcp_retries2) are changed.
Better use 64bit to perform icsk_rto << icsk_backoff operations
As Joe Perches suggested, add a helper for this.
Yuchung spotted the tcp_v4_err() case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added netlink attrs to configure FOU encapsulation for GRE, netlink
handling of these flags, and properly adjust MTU for encapsulation.
ip_tunnel_encap is called from ip_tunnel_xmit to actually perform FOU
encapsulation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink handling for IP tunnel encapsulation parameters and
and adjustment of MTU for encapsulation. ip_tunnel_encap is called
from ip_tunnel_xmit to actually perform FOU encapsulation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes IP tunnel to support (secondary) encapsulation,
Foo-over-UDP. Changes include:
1) Adding tun_hlen as the tunnel header length, encap_hlen as the
encapsulation header length, and hlen becomes the grand total
of these.
2) Added common netlink define to support FOU encapsulation.
3) Routines to perform FOU encapsulation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement fou_gro_receive and fou_gro_complete, and populate these
in the correponsing udp_offloads for the socket. Added ipproto to
udp_offloads and pass this from UDP to the fou GRO routine in proto
field of napi_gro_cb structure.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides a receive path for foo-over-udp. This allows
direct encapsulation of IP protocols over UDP. The bound destination
port is used to map to an IP protocol, and the XFRM framework
(udp_encap_rcv) is used to receive encapsulated packets. Upon
reception, the encapsulation header is logically removed (pointer
to transport header is advanced) and the packet is reinjected into
the receive path with the IP protocol indicated by the mapping.
Netlink is used to configure FOU ports. The configuration information
includes the port number to bind to and the IP protocol corresponding
to that port.
This should support GRE/UDP
(http://tools.ietf.org/html/draft-yong-tsvwg-gre-in-udp-encap-02),
as will as the other IP tunneling protocols (IPIP, SIT).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Want to be able to use these in foo-over-udp offloads, etc.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we no longer rely on having tcp headers for skbs in receive queue,
tcp repair do not need to build fake ones.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added a few more UDP tunnel APIs that can be shared by UDP based
tunnel protocol implementation. The main ones are highlighted below.
setup_udp_tunnel_sock() configures UDP listener socket for
receiving UDP encapsulated packets.
udp_tunnel_xmit_skb() and upd_tunnel6_xmit_skb() transmit skb
using UDP encapsulation.
udp_tunnel_sock_release() closes the UDP tunnel listener socket.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ip6_udp_tunnel.c for ipv6 UDP tunnel functions to avoid ifdefs
in udp_tunnel.c
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While tracking down the MAX_AH_AUTH_LEN crash in an old kernel
I thought that this limit was rather arbitrary and we should
just get rid of it.
In fact it seems that we've already done all the work needed
to remove it apart from actually removing it. This limit was
there in order to limit stack usage. Since we've already
switched over to allocating scratch space using kmalloc, there
is no longer any need to limit the authentication length.
This patch kills all references to it, including the BUG_ONs
that led me here.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Currently we genarate a blackhole route route whenever we have
matching policies but can not resolve the states. Here we assume
that dst_output() is called to kill the balckholed packets.
Unfortunately this assumption is not true in all cases, so
it is possible that these packets leave the system unwanted.
We fix this by generating blackhole routes only from the
route lookup functions, here we can guarantee a call to
dst_output() afterwards.
Fixes: 2774c131b1 ("xfrm: Handle blackhole route creation via afinfo.")
Reported-by: Konstantinos Kolelis <k.kolelis@sirrix.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
tcp_collapse() wants to shrink skb so that the overhead is minimal.
Now we store tcp flags into TCP_SKB_CB(skb)->tcp_flags, we no longer
need to keep around full headers.
Whole available space is dedicated to the payload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can allow a segment with FIN to be aggregated,
if we take care to add tcp flags,
and if skb_try_coalesce() takes care of zero sized skbs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags,
which is only used in output path.
tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue,
and its unfortunate because this bit is located in a cache line right before
the payload.
We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags.
This patch does so, and avoids the cache line miss in tcp_recvmsg()
Following patches will
- allow a segment with FIN being coalesced in tcp_try_coalesce()
- simplify tcp_collapse() by not copying the headers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2abb7cdc0d ("udp: Add support for doing checksum unnecessary
conversion") caused napi_gro_cb structs with the "flush" field zero to
take the "udp_gro_receive" path rather than the "set flush to 1" path
that they would previously take. As a result I saw booting from an NFS
root hang shortly after starting userspace, with "server not
responding" messages.
This change to the handling of "flush == 0" packets appears to be
incidental to the goal of adding new code in the case where
skb_gro_checksum_validate_zero_check() returns zero. Based on that and
the fact that it breaks things, I'm assuming that it is unintentional.
Fixes: 2abb7cdc0d ("udp: Add support for doing checksum unnecessary conversion")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users are starting to test nf_tables with no x_tables support. Therefore,
masquerading needs to be indenpendent of it from Kconfig.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that we have masquerading support in nf_tables, the NAT chain can
be use with it, not only for SNAT/DNAT. So make this chain type
independent of it.
While at it, move it inside the scope of 'if NF_NAT_IPV*' to simplify
dependencies.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
nf-next pull request
The following patchset contains Netfilter/IPVS updates for your
net-next tree. Regarding nf_tables, most updates focus on consolidating
the NAT infrastructure and adding support for masquerading. More
specifically, they are:
1) use __u8 instead of u_int8_t in arptables header, from
Mike Frysinger.
2) Add support to match by skb->pkttype to the meta expression, from
Ana Rey.
3) Add support to match by cpu to the meta expression, also from
Ana Rey.
4) A smatch warning about IPSET_ATTR_MARKMASK validation, patch from
Vytas Dauksa.
5) Fix netnet and netportnet hash types the range support for IPv4,
from Sergey Popovich.
6) Fix missing-field-initializer warnings resolved, from Mark Rustad.
7) Dan Carperter reported possible integer overflows in ipset, from
Jozsef Kadlecsick.
8) Filter out accounting objects in nfacct by type, so you can
selectively reset quotas, from Alexey Perevalov.
9) Move specific NAT IPv4 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
10) Use the new NAT IPv4 functions from nft_chain_nat_ipv4.
11) Move specific NAT IPv6 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
12) Use the new NAT IPv6 functions from nft_chain_nat_ipv6.
13) Refactor code to add nft_delrule(), which can be reused in the
enhancement of the NFT_MSG_DELTABLE to remove a table and its
content, from Arturo Borrero.
14) Add a helper function to unregister chain hooks, from
Arturo Borrero.
15) A cleanup to rename to nft_delrule_by_chain for consistency with
the new nft_*() functions, also from Arturo.
16) Add support to match devgroup to the meta expression, from Ana Rey.
17) Reduce stack usage for IPVS socket option, from Julian Anastasov.
18) Remove unnecessary textsearch state initialization in xt_string,
from Bojan Prtvar.
19) Add several helper functions to nf_tables, more work to prepare
the enhancement of NFT_MSG_DELTABLE, again from Arturo Borrero.
20) Enhance NFT_MSG_DELTABLE to delete a table and its content, from
Arturo Borrero.
21) Support NAT flags in the nat expression to indicate the flavour,
eg. random fully, from Arturo.
22) Add missing audit code to ebtables when replacing tables, from
Nicolas Dichtel.
23) Generalize the IPv4 masquerading code to allow its re-use from
nf_tables, from Arturo.
24) Generalize the IPv6 masquerading code, also from Arturo.
25) Add the new masq expression to support IPv4/IPv6 masquerading
from nf_tables, also from Arturo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add inet_gro_receive and inet_gro_complete to ipip_offload to
support GRO.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/udp_offload.c:339:5: warning: symbol 'udp4_gro_complete' was
not declared. Should it be static?
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Fixes: 57c67ff4bd ("udp: additional GRO support")
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove one sparse warning :
net/ipv4/ip_sockglue.c:328:22: warning: incorrect type in assignment (different address spaces)
net/ipv4/ip_sockglue.c:328:22: expected struct ip_ra_chain [noderef] <asn:4>*next
net/ipv4/ip_sockglue.c:328:22: got struct ip_ra_chain *[assigned] ra
And replace one rcu_assign_ptr() by RCU_INIT_POINTER() where applicable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck reported high false sharing on dst refcount in tcp stack
when prequeue is used. prequeue is the mechanism used when a thread is
blocked in recvmsg()/read() on a TCP socket, using a blocking model
rather than select()/poll()/epoll() non blocking one.
We already try to use RCU in input path as much as possible, but we were
forced to take a refcount on the dst when skb escaped RCU protected
region. When/if the user thread runs on different cpu, dst_release()
will then touch dst refcount again.
Commit 093162553c (tcp: force a dst refcount when prequeue packet)
was an example of a race fix.
It turns out the only remaining usage of skb->dst for a packet stored
in a TCP socket prequeue is IP early demux.
We can add a logic to detect when IP early demux is probably going
to use skb->dst. Because we do an optimistic check rather than duplicate
existing logic, we need to guard inet_sk_rx_dst_set() and
inet6_sk_rx_dst_set() from using a NULL dst.
Many thanks to Alexander for providing a nice bug report, git bisection,
and reproducer.
Tested using Alexander script on a 40Gb NIC, 8 RX queues.
Hosts have 24 cores, 48 hyper threads.
echo 0 >/proc/sys/net/ipv4/tcp_autocorking
for i in `seq 0 47`
do
for j in `seq 0 2`
do
netperf -H $DEST -t TCP_STREAM -l 1000 \
-c -C -T $i,$i -P 0 -- \
-m 64 -s 64K -D &
done
done
Before patch : ~6Mpps and ~95% cpu usage on receiver
After patch : ~9Mpps and ~35% cpu usage on receiver.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net.ipv4.ip_nonlocal_bind sysctl was global to all network
namespaces. This patch allows to set a different value for each
network namespace.
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nft_masq expression is intended to perform NAT in the masquerade flavour.
We decided to have the masquerade functionality in a separated expression other
than nft_nat.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Let's refactor the code so we can reach the masquerade functionality
from outside the xt context (ie. nftables).
The patch includes the addition of an atomic counter to the masquerade
notifier: the stuff to be done by the notifier is the same for xt and
nftables. Therefore, only one notification handler is needed.
This factorization only involves IPv4; a similar patch follows to
handle IPv6.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
inetpeer sequence numbers are no longer incremented, so no need to
check and flush the tree. The function that increments the sequence
number was already dead code and removed in in "ipv4: remove unused
function" (068a6e18). Remove the code that checks for a change, too.
Verifying that v4_seq and v6_seq are never incremented and thus that
flush_check compares bp->flush_seq to 0 is trivial.
The second part of the change removes flush_check completely even
though bp->flush_seq is exactly !0 once, at initialization. This
change is correct because the time this branch is true is when
bp->root == peer_avl_empty_rcu, in which the branch and
inetpeer_invalidate_tree are a NOOP.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE assumes that the GRE header is at skb_network_header +
ip_hrdlen(skb). It is more general to use skb_transport_header
and this allows the possbility of inserting additional header
between IP and GRE (which is what we will done in Generic UDP
Encapsulation for GRE).
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.
We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
The TCP_SKB_CB(skb)->when field no longer exists as of recent change
7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when"). And in any case,
tcp_fragment() is called on already-transmitted packets from the
__tcp_retransmit_skb() call site, so copying timestamps of any kind
in this spot is quite sensible.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 740b0f1841 ("tcp: switch rtt estimations to usec resolution"),
we no longer need to maintain timestamps in two different fields.
TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP_SKB_CB(skb)->when has different meaning in output and input paths.
In output path, it contains a timestamp.
In input path, it contains an ISN, chosen by tcp_timewait_state_process()
Lets add a different name to ease code comprehension.
Note that 'when' field will disappear in following patch,
as skb_mstamp already contains timestamp, the anonymous
union will promptly disappear as well.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since sock_efree and sock_demux are essentially the same code for non-TCP
sockets and the case where CONFIG_INET is not defined we can combine the
code or replace the call to sock_edemux in several spots. As a result we
can avoid a bit of unnecessary code or code duplication.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lets make this hash function a bit secure, as ICMP attacks are still
in the wild.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nh_exceptions is effectively used under rcu, but lacks proper
barriers. Between kzalloc() and setting of nh->nh_exceptions(),
we need a proper memory barrier.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 4895c771c7 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: David S. Miller <davem@davemloft.net>
As in IPv6 people might increase the igmp query robustness variable to
make sure unsolicited state change reports aren't lost on the network. Add
and document this new knob to igmp code.
RFCs allow tuning this parameter back to first IGMP RFC, so we also use
this setting for all counters, including source specific multicast.
Also take over sysctl value when upping the interface and don't reuse
the last one seen on the interface.
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the exported IPv4 NAT functions that are provided by the core. This
removes duplicated code so iptables and nft use the same NAT codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move the specific NAT IPv4 core functions that are called from the
hooks from iptable_nat.c to nf_nat_l3proto_ipv4.c. This prepares the
ground to allow iptables and nft to use the same NAT engine code that
comes in a follow up patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
sk->sk_error_queue is dequeued in four locations. All share the
exact same logic. Deduplicate.
Also collapse the two critical sections for dequeue (at the top of
the recv handler) and signal (at the bottom).
This moves signal generation for the next packet forward, which should
be harmless.
It also changes the behavior if the recv handler exits early with an
error. Previously, a signal for follow-up packets on the errqueue
would then not be scheduled. The new behavior, to always signal, is
arguably a bug fix.
For rxrpc, the change causes the same function to be called repeatedly
for each queued packet (because the recv handler == sk_error_report).
It is likely that all packets will fail for the same reason (e.g.,
memory exhaustion).
This code runs without sk_lock held, so it is not safe to trust that
sk->sk_err is immutable inbetween releasing q->lock and the subsequent
test. Introduce int err just to avoid this potential race.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_checksum_try_convert and skb_gro_checksum_try_convert
after checksum is found present and validated in the GRE header
for normal and GRO paths respectively.
In GRO path, call skb_gro_checksum_try_convert
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for doing CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE
conversion in UDP tunneling path.
In the normal UDP path, we call skb_checksum_try_convert after locating
the UDP socket. The check is that checksum conversion is enabled for
the socket (new flag in UDP socket) and that checksum field is
non-zero.
In the UDP GRO path, we call skb_gro_checksum_try_convert after
checksum is validated and checksum field is non-zero. Since this is
already in GRO we assume that checksum conversion is always wanted.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix places where there is space before tab, long lines, and
awkward if(){, double spacing etc. Add blank line after declaration/initialization.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow GRO path to "consume" checksums provided in CHECKSUM_UNNECESSARY
and to report new checksums verfied for use in fallback to normal
path.
Change GRO checksum path to track csum_level using a csum_cnt field
in NAPI_GRO_CB. On GRO initialization, if ip_summed is
CHECKSUM_UNNECESSARY set NAPI_GRO_CB(skb)->csum_cnt to
skb->csum_level + 1. For each checksum verified, decrement
NAPI_GRO_CB(skb)->csum_cnt while its greater than zero. If a checksum
is verfied and NAPI_GRO_CB(skb)->csum_cnt == 0, we have verified a
deeper checksum than originally indicated in skbuf so increment
csum_level (or initialize to CHECKSUM_UNNECESSARY if ip_summed is
CHECKSUM_NONE or CHECKSUM_COMPLETE).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch:
- Clarifies the specific requirements of devices returning
CHECKSUM_UNNECESSARY (comments in skbuff.h).
- Adds csum_level field to skbuff. This is used to express how
many checksums are covered by CHECKSUM_UNNECESSARY (stores n - 1).
This replaces the overloading of skb->encapsulation, that field is
is now only used to indicate inner headers are valid.
- Change __skb_checksum_validate_needed to "consume" each checksum
as indicated by csum_level as layers of the the packet are parsed.
- Remove skb_pop_rcv_encapsulation, no longer needed in the new
csum_level model.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace uses of get_cpu_var for address calculation through this_cpu_ptr.
Cc: netdev@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In GRE demux if the GRE checksum pop rcv encapsulation so that any
encapsulated checksums are treated as tunnel checksums.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement GRO for UDPv6. Add UDP checksum verification in gro_receive
for both UDP4 and UDP6 calling skb_gro_checksum_validate_zero_check.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp[64]_gro_receive call skb_gro_checksum_validate to validate TCP
checksum in the gro context.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon timeout, undo (via both timestamps/Eifel and DSACKs) was
disabled if any retransmits were still in flight. The concern was
perhaps that spurious retransmission sent in a previous recovery
episode may trigger DSACKs to falsely undo the current recovery.
However, this inadvertently misses undo opportunities (using either
TCP timestamps or DSACKs) when timeout occurs during a loss episode,
i.e. recurring timeouts or timeout during fast recovery. In these
cases some retransmissions will be in flight but we should allow
undo. Furthermore, we should only reset undo_marker and undo_retrans
upon timeout if we are starting a new recovery episode. Finally,
when we do reset our undo state, we now do so in a manner similar
to tcp_enter_recovery(), so that we require a DSACK for each of
the outstsanding retransmissions. This will achieve the original
goal by requiring that we receive the same number of DSACKs as
retransmissions.
This patch increases the undo events by 50% on Google servers.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions time_before, time_before_eq, time_after, and time_after_eq
are more robust for comparing jiffies against other values.
A simplified version of the Coccinelle semantic patch making this change
is as follows:
@change@
expression E1,E2;
@@
- jiffies - E1 < E2
+ time_before(jiffies, E1+E2)
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "rcu_dereference()" call is used directly in a condition.
Since its return value is never dereferenced it is recommended to use
"rcu_access_pointer()" instead of "rcu_dereference()".
Therefore, this patch makes the replacement.
The following Coccinelle semantic patch was used:
@@
@@
(
if(
(<+...
- rcu_dereference
+ rcu_access_pointer
(...)
...+>)) {...}
|
while(
(<+...
- rcu_dereference
+ rcu_access_pointer
(...)
...+>)) {...}
)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7a9bc9b81a ("ipv4: Elide fib_validate_source() completely when possible.")
introduced a short-circuit to avoid calling fib_validate_source when not
needed. That change took rp_filter into account, but not accept_local.
This resulted in a change of behaviour: with rp_filter and accept_local
off, incoming packets with a local address in the source field should be
dropped.
Here is how to reproduce the change pre/post 7a9bc9b81a commit:
-configure the same IPv4 address on hosts A and B.
-try to send an ARP request from B to A.
-The ARP request will be dropped before that commit, but accepted and answered
after that commit.
This adds a check for ACCEPT_LOCAL, to maintain full
fib validation in case it is 0. We also leave __fib_validate_source() earlier
when possible, based on the same check as fib_validate_source(), once the
accept_local stuff is verified.
Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the NAT configs depend on iptables and ip6tables. However,
users should be capable of enabling NAT for nft without having to
switch on iptables.
Fix this by adding new specific IP_NF_NAT and IP6_NF_NAT config
switches for iptables and ip6tables NAT support. I have also moved
the original NF_NAT_IPV4 and NF_NAT_IPV6 configs out of the scope
of iptables to make them independent of it.
This patch also adds NETFILTER_XT_NAT which selects the xt_nat
combo that provides snat/dnat for iptables. We cannot use NF_NAT
anymore since nf_tables can select this.
Reported-by: Matteo Croce <technoboy85@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix TCP FRTO logic so that it always notices when snd_una advances,
indicating that any RTO after that point will be a new and distinct
loss episode.
Previously there was a very specific sequence that could cause FRTO to
fail to notice a new loss episode had started:
(1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
(2) receiver ACKs packet 1
(3) FRTO sends 2 more packets
(4) RTO timer fires again (should start a new loss episode)
The problem was in step (3) above, where tcp_process_loss() returned
early (in the spot marked "Step 2.b"), so that it never got to the
logic to clear icsk_retransmits. Thus icsk_retransmits stayed
non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
icsk_retransmits, decide that this RTO is not a new episode, and
decide not to cut ssthresh and remember the current cwnd and ssthresh
for undo.
There were two main consequences to the bug that we have
observed. First, ssthresh was not decreased in step (4). Second, when
there was a series of such FRTO (1-4) sequences that happened to be
followed by an FRTO undo, we would restore the cwnd and ssthresh from
before the entire series started (instead of the cwnd and ssthresh
from before the most recent RTO). This could result in cwnd and
ssthresh being restored to values much bigger than the proper values.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: e33099f96d ("tcp: implement RFC5682 F-RTO")
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
ordering of incoming connections and teardowns without the need to
hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
the last measured RTO. To do so, we keep the last seen timestamp in a
per-host indexed data structure and verify if the incoming timestamp
in a connection request is strictly greater than the saved one during
last connection teardown. Thus we can verify later on that no old data
packets will be accepted by the new connection.
During moving a socket to time-wait state we already verify if timestamps
where seen on a connection. Only if that was the case we let the
time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
will be used. But we don't verify this on incoming SYN packets. If a
connection teardown was less than TCP_PAWS_MSL seconds in the past we
cannot guarantee to not accept data packets from an old connection if
no timestamps are present. We should drop this SYN packet. This patch
closes this loophole.
Please note, this patch does not make tcp_tw_recycle in any way more
usable but only adds another safety check:
Sporadic drops of SYN packets because of reordering in the network or
in the socket backlog queues can happen. Users behing NAT trying to
connect to a tcp_tw_recycle enabled server can get caught in blackholes
and their connection requests may regullary get dropped because hosts
behind an address translator don't have synchronized tcp timestamp clocks.
tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.
In general, use of tcp_tw_recycle is disadvised.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure we use the correct address-family-specific function for
handling MTU reductions from within tcp_release_cb().
Previously AF_INET6 sockets were incorrectly always using the IPv6
code path when sometimes they were handling IPv4 traffic and thus had
an IPv4 dst.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Fixes: 563d34d057 ("tcp: dont drop MTU reduction indications")
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bytestream timestamps are correlated with a single byte in the skbuff,
recorded in skb_shinfo(skb)->tskey. When fragmenting skbuffs, ensure
that the tskey is set for the fragment in which the tskey falls
(seqno <= tskey < end_seqno).
The original implementation did not address fragmentation in
tcp_fragment or tso_fragment. Add code to inspect the sequence numbers
and move both tskey and the relevant tx_flags if necessary.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ACK timestamps are generated in tcp_clean_rtx_queue. The TSO datapath
can break out early, causing the timestamp code to be skipped. Move
the code up before the break.
Reported-by: David S. Miller <davem@davemloft.net>
Also fix a boundary condition: tp->snd_una is the next unacknowledged
byte and between tests inclusive (a <= b <= c), so generate a an ACK
timestamp if (prior_snd_una <= tskey <= tp->snd_una - 1).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since fib_lookup cannot return ESRCH no longer,
checking for this error code is no longer neccesary.
Signed-off-by: Niv Yehezkel <executerx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge incoming from Andrew Morton:
- Various misc things.
- arch/sh updates.
- Part of ocfs2. Review is slow.
- Slab updates.
- Most of -mm.
- printk updates.
- lib/ updates.
- checkpatch updates.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits)
checkpatch: update $declaration_macros, add uninitialized_var
checkpatch: warn on missing spaces in broken up quoted
checkpatch: fix false positives for --strict "space after cast" test
checkpatch: fix false positive MISSING_BREAK warnings with --file
checkpatch: add test for native c90 types in unusual order
checkpatch: add signed generic types
checkpatch: add short int to c variable types
checkpatch: add for_each tests to indentation and brace tests
checkpatch: fix brace style misuses of else and while
checkpatch: add --fix option for a couple OPEN_BRACE misuses
checkpatch: use the correct indentation for which()
checkpatch: add fix_insert_line and fix_delete_line helpers
checkpatch: add ability to insert and delete lines to patch/file
checkpatch: add an index variable for fixed lines
checkpatch: warn on break after goto or return with same tab indentation
checkpatch: emit a warning on file add/move/delete
checkpatch: add test for commit id formatting style in commit log
checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
checkpatch: improve "no space after cast" test
checkpatch: allow multiple const * types
...
All other add functions for lists have the new item as first argument
and the position where it is added as second argument. This was changed
for no good reason in this function and makes using it unnecessary
confusing.
The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup
doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is
no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these
calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock:
from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A set of small fixes pointed out just after the merge:
- make tcp_tx_timestamp static
- make tcp_gso_tstamp static
- use before() to compare TCP seqno, instead of cast to u64
- add tstamp to tx_flags in GSO, instead of overwrite tx_flags
- record skb_shinfo(skb)->tskey for all timestamps, also HW.
- optimization in tcp_tx_timestamp:
call sock_tx_timestamp only if a tstamp option is set.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Highlights:
1) Steady transitioning of the BPF instructure to a generic spot so
all kernel subsystems can make use of it, from Alexei Starovoitov.
2) SFC driver supports busy polling, from Alexandre Rames.
3) Take advantage of hash table in UDP multicast delivery, from David
Held.
4) Lighten locking, in particular by getting rid of the LRU lists, in
inet frag handling. From Florian Westphal.
5) Add support for various RFC6458 control messages in SCTP, from
Geir Ola Vaagland.
6) Allow to filter bridge forwarding database dumps by device, from
Jamal Hadi Salim.
7) virtio-net also now supports busy polling, from Jason Wang.
8) Some low level optimization tweaks in pktgen from Jesper Dangaard
Brouer.
9) Add support for ipv6 address generation modes, so that userland
can have some input into the process. From Jiri Pirko.
10) Consolidate common TCP connection request code in ipv4 and ipv6,
from Octavian Purdila.
11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.
12) Generic resizable RCU hash table, with intial users in netlink and
nftables. From Thomas Graf.
13) Maintain a name assignment type so that userspace can see where a
network device name came from (enumerated by kernel, assigned
explicitly by userspace, etc.) From Tom Gundersen.
14) Automatic flow label generation on transmit in ipv6, from Tom
Herbert.
15) New packet timestamping facilities from Willem de Bruijn, meant to
assist in measuring latencies going into/out-of the packet
scheduler, latency from TCP data transmission to ACK, etc"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
cxgb4 : Disable recursive mailbox commands when enabling vi
net: reduce USB network driver config options.
tg3: Modify tg3_tso_bug() to handle multiple TX rings
amd-xgbe: Perform phy connect/disconnect at dev open/stop
amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
net: sun4i-emac: fix memory leak on bad packet
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
Revert "net: phy: Set the driver when registering an MDIO bus device"
cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
team: Simplify return path of team_newlink
bridge: Update outdated comment on promiscuous mode
net-timestamp: ACK timestamp for bytestreams
net-timestamp: TCP timestamping
net-timestamp: SCHED timestamp on entering packet scheduler
net-timestamp: add key to disambiguate concurrent datagrams
net-timestamp: move timestamp flags out of sk_flags
net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
cxgb4i : Move stray CPL definitions to cxgb4 driver
tcp: reduce spurious retransmits due to transient SACK reneging
qlcnic: Initialize dcbnl_ops before register_netdev
...
Pull security subsystem updates from James Morris:
"In this release:
- PKCS#7 parser for the key management subsystem from David Howells
- appoint Kees Cook as seccomp maintainer
- bugfixes and general maintenance across the subsystem"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
X.509: Need to export x509_request_asymmetric_key()
netlabel: shorter names for the NetLabel catmap funcs/structs
netlabel: fix the catmap walking functions
netlabel: fix the horribly broken catmap functions
netlabel: fix a problem when setting bits below the previously lowest bit
PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
tpm: simplify code by using %*phN specifier
tpm: Provide a generic means to override the chip returned timeouts
tpm: missing tpm_chip_put in tpm_get_random()
tpm: Properly clean sysfs entries in error path
tpm: Add missing tpm_do_selftest to ST33 I2C driver
PKCS#7: Use x509_request_asymmetric_key()
Revert "selinux: fix the default socket labeling in sock_graft()"
X.509: x509_request_asymmetric_keys() doesn't need string length arguments
PKCS#7: fix sparse non static symbol warning
KEYS: revert encrypted key change
ima: add support for measuring and appraising firmware
firmware_class: perform new LSM checks
security: introduce kernel_fw_from_file hook
PKCS#7: Missing inclusion of linux/err.h
...
Conflicts:
drivers/net/Makefile
net/ipv6/sysctl_net_ipv6.c
Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.
In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
in the send() call is acknowledged. It implements the feature for TCP.
The timestamp is generated when the TCP socket cumulative ACK is moved
beyond the tracked seqno for the first time. The feature ignores SACK
and FACK, because those acknowledge the specific byte, but not
necessarily the entire contents of the buffer up to that byte.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP timestamping extends SO_TIMESTAMPING to bytestreams.
Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.
The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.
This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.
This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.
If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.
The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.
Implementation details:
- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.
- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.
Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.
The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.
The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.
When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.
Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.
To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.
A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.
We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:
(1) 90% of inter-ACK delays were less than 10ms
(2) 99% of inter-ACK delays were less than RTT/2
In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcpm_key is an array inside struct tcp_md5sig, there is no need to check it
against NULL.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.
- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.
- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.
- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.
All the changes are to the new interface and no behavior changed for
the multiple hierarchies"
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...
Use kmem_cache to allocate/free inet_frag_queue objects since they're
all the same size per inet_frags user and are alloced/freed in high volumes
thus making it a perfect case for kmem_cache.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have INET_FRAG_EVICTED we might as well use it to stop
sending icmp messages in the "frag_expire" functions instead of
stripping INET_FRAG_FIRST_IN from their flags when evicting.
Also fix the comment style in ip6_expire_frag_queue().
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>