This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
an out of bounds access due to what appears to be a typo. I believe this
change should resolve the issue by replacing the access to RTAX_CWND with
TCP_METRIC_CWND.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incorrect start markers, wrapped summary lines, missing section
breaks, incorrect separators, and some name mismatches.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing every writes to ipv4 metrics any longer.
PMTU is stored in rt->rt_pmtu.
Dynamic TCP metrics are stored in a special TCP metrics cache,
completely outside of the routes.
Therefore ->cow_metrics() can simply nothing more than a WARN_ON
trigger so we can catch anyone who tries to add new writes to
ipv4 route metrics.
Signed-off-by: David S. Miller <davem@davemloft.net>
Blackhole routes have a COW metrics operation that returns NULL
always, therefore this dst_copy_metrics() call did absolutely
nothing.
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed. TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics. Therefore there is no
longer any reason to pre-cow metrics in this way.
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't maintain it dynamically any longer, so reporting it would
be extremely misleading. Report zero instead.
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain a local hash table of TCP dynamic metrics blobs.
Computed TCP metrics are no longer maintained in the route metrics.
The table uses RCU and an extremely simple hash so that it has low
latency and low overhead. A simple hash is legitimate because we only
make metrics blobs for fully established connections.
Some tweaking of the default hash table sizes, metric timeouts, and
the hash chain length limit certainly could use some tweaking. But
the basic design seems sound.
With help from Eric Dumazet and Joe Perches.
Signed-off-by: David S. Miller <davem@davemloft.net>
If the user hasn't actually installed any custom rules, or fiddled
with the default ones, don't go through the whole FIB rules layer.
It's just pure overhead.
Instead do what we do with CONFIG_IP_MULTIPLE_TABLES disabled, check
the individual tables by hand, one by one.
Also, move fib_num_tclassid_users into the ipv4 network namespace.
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_options_compile() can avoid calling fib_compute_spec_dst()
by default, and perform the call only if needed.
David suggested to add a helper to make the call only once.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Causes the handler to use the daddr in the ipv4/ipv6 header when
the route gateway is unspecified (local subnet).
Signed-off-by: David S. Miller <davem@davemloft.net>
When a dst_confirm() happens, mark the confirmation as pending in the
dst. Then on the next packet out, when we have the neigh in-hand, do
the update.
This removes the dependency in dst_confirm() of dst's having an
attached neigh.
While we're here, remove the explicit 'dst' NULL check, all except 2
or 3 call sites ensure it's not NULL. So just fix those cases up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon routes will not have a cached neigh attached, nor will we
be able to necessarily go directly to a neigh from an arbitrary
route.
Signed-off-by: David S. Miller <davem@davemloft.net>
The spec_dst uses should be guarded by skb_rtable() being non-NULL
not just the SKB being non-null.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch generalizes nf_ct_l4proto_net by splitting it into chunks and
moving the corresponding protocol part to where it really belongs to.
To clarify, note that we follow two different approaches to support per-net
depending if it's built-in or run-time loadable protocol tracker.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
This patch adds the following structure:
struct netlink_kernel_cfg {
unsigned int groups;
void (*input)(struct sk_buff *skb);
struct mutex *cb_mutex;
};
That can be passed to netlink_kernel_create to set optional configurations
for netlink kernel sockets.
I've populated this structure by looking for NULL and zero parameters at the
existing code. The remaining parameters that always need to be set are still
left in the original interface.
That includes optional parameters for the netlink socket creation. This allows
easy extensibility of this interface in the future.
This patch also adapts all callers to use this new interface.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If rpfilter is off (or the SKB has an IPSEC path) and there are not
tclassid users, we don't have to do anything at all when
fib_validate_source() is invoked besides setting the itag to zero.
We monitor tclassid uses with a counter (modified only under RTNL and
marked __read_mostly) and we protect the fib_validate_source() real
work with a test against this counter and whether rpfilter is to be
done.
Having a way to know whether we need no tclassid processing or not
also opens the door for future optimized rpfilter algorithms that do
not perform full FIB lookups.
Signed-off-by: David S. Miller <davem@davemloft.net>
Checking for in_dev being NULL is pointless.
In fact, all of our callers have in_dev precomputed already,
so just pass it in and remove the NULL checking.
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon feedback from Julian Anastasov.
1) Use route flags to determine multicast/broadcast, not the
packet flags.
2) Leave saddr unspecified in flow key.
3) Adjust how we invoke inet_select_addr(). Pass ip_hdr(skb)->saddr as
second arg, and if it was zeronet use link scope.
4) Use loopback as input interface in flow key.
Signed-off-by: David S. Miller <davem@davemloft.net>
The specific destination is the host we direct unicast replies to.
Usually this is the original packet source address, but if we are
responding to a multicast or broadcast packet we have to use something
different.
Specifically we must use the source address we would use if we were to
send a packet to the unicast source of the original packet.
The routing cache precomputes this value, but we want to remove that
precomputation because it creates a hard dependency on the expensive
rpfilter source address validation which we'd like to make cheaper.
There are only three places where this matters:
1) ICMP replies.
2) pktinfo CMSG
3) IP options
Now there will be no real users of rt->rt_spec_dst and we can simply
remove it altogether.
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename it to ip_send_unicast_reply() and add explicit 'saddr'
argument.
This removed one of the few users of rt->rt_spec_dst.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c074da2810.
This change has several unwanted side effects:
1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
thus never create a real cached route.
2) All TCP traffic will use DST_NOCACHE and never use the routing
cache at all.
Signed-off-by: David S. Miller <davem@davemloft.net>
dropwatch wrongly diagnose all received UDP packets as drops.
This patch removes trace_kfree_skb() done in skb_free_datagram_locked().
Locations calling skb_free_datagram_locked() should do it on their own.
As a result, drops are accounted on the right function.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also fix a needless skb tailroom check for a 4 bytes area
after after each rtnexthop block.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, no need to trim on nlmsg_put() failure, nothing has been added
yet. We also want to use nlmsg_end(), nlmsg_new() and nlmsg_free().
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
DDOS synflood attacks hit badly IP route cache.
On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.
rt_garbage_collect() triggers and tries to cleanup things.
Eventually route cache is disabled but machine is under fire and might
OOM and crash.
This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.
This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)
SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split sysctl function into smaller chucks to cleanup code and prepare
patches to reduce ifdef pollution.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
l4proto->init contain quite redundant code. We can simplify this
by adding a new parameter l3proto.
This patch prepares that code simplification.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>