Commit Graph

5697 Commits

Author SHA1 Message Date
Tomasz Bursztyka
eb31628e37 netfilter: nf_tables: Add support for IPv6 NAT
This patch generalizes the NAT expression to support both IPv4 and IPv6
using the existing IPv4/IPv6 NAT infrastructure. This also adds the
NAT chain type for IPv6.

This patch collapses the following patches that were posted to the
netfilter-devel mailing list, from Tomasz:

* nf_tables: Change NFTA_NAT_ attributes to better semantic significance
* nf_tables: Split IPv4 NAT into NAT expression and IPv4 NAT chain
* nf_tables: Add support for IPv6 NAT expression
* nf_tables: Add support for IPv6 NAT chain
* nf_tables: Fix up build issue on IPv6 NAT support

And, from Pablo Neira Ayuso:

* fix missing dependencies in nft_chain_nat

Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:00:58 +02:00
Pablo Neira Ayuso
0ca743a559 netfilter: nf_tables: add compatibility layer for x_tables
This patch adds the x_tables compatibility layer. This allows you
to use existing x_tables matches and targets from nf_tables.

This compatibility later allows us to use existing matches/targets
for features that are still missing in nf_tables. We can progressively
replace them with native nf_tables extensions. It also provides the
userspace compatibility software that allows you to express the
rule-set using the iptables syntax but using the nf_tables kernel
components.

In order to get this compatibility layer working, I've done the
following things:

* add NFNL_SUBSYS_NFT_COMPAT: this new nfnetlink subsystem is used
to query the x_tables match/target revision, so we don't need to
use the native x_table getsockopt interface.

* emulate xt structures: this required extending the struct nft_pktinfo
to include the fragment offset, which is already obtained from
ip[6]_tables and that is used by some matches/targets.

* add support for default policy to base chains, required to emulate
  x_tables.

* add NFTA_CHAIN_USE attribute to obtain the number of references to
  chains, required by x_tables emulation.

* add chain packet/byte counters using per-cpu.

* support 32-64 bits compat.

For historical reasons, this patch includes the following patches
that were posted in the netfilter-devel mailing list.

From Pablo Neira Ayuso:
* nf_tables: add default policy to base chains
* netfilter: nf_tables: add NFTA_CHAIN_USE attribute
* nf_tables: nft_compat: private data of target and matches in contiguous area
* nf_tables: validate hooks for compat match/target
* nf_tables: nft_compat: release cached matches/targets
* nf_tables: x_tables support as a compile time option
* nf_tables: fix alias for xtables over nftables module
* nf_tables: add packet and byte counters per chain
* nf_tables: fix per-chain counter stats if no counters are passed
* nf_tables: don't bump chain stats
* nf_tables: add protocol and flags for xtables over nf_tables
* nf_tables: add ip[6]t_entry emulation
* nf_tables: move specific layer 3 compat code to nf_tables_ipv[4|6]
* nf_tables: support 32bits-64bits x_tables compat
* nf_tables: fix compilation if CONFIG_COMPAT is disabled

From Patrick McHardy:
* nf_tables: move policy to struct nft_base_chain
* nf_tables: send notifications for base chain policy changes

From Alexander Primak:
* nf_tables: remove the duplicate NF_INET_LOCAL_OUT

From Nicolas Dichtel:
* nf_tables: fix compilation when nf-netlink is a module

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:00:04 +02:00
Pablo Neira Ayuso
9370761c56 netfilter: nf_tables: convert built-in tables/chains to chain types
This patch converts built-in tables/chains to chain types that
allows you to deploy customized table and chain configurations from
userspace.

After this patch, you have to specify the chain type when
creating a new chain:

 add chain ip filter output { type filter hook input priority 0; }
                              ^^^^ ------

The existing chain types after this patch are: filter, route and
nat. Note that tables are just containers of chains with no specific
semantics, which is a significant change with regards to iptables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:16:11 +02:00
Patrick McHardy
ef1f7df917 netfilter: nf_tables: expression ops overloading
Split the expression ops into two parts and support overloading of
the runtime expression ops based on the requested function through
a ->select_ops() callback.

This can be used to provide optimized implementations, for instance
for loading small aligned amounts of data from the packet or inlining
frequently used operations into the main evaluation loop.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:16:08 +02:00
Patrick McHardy
96518518cc netfilter: add nftables
This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.

In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:

* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
  registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.

Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.

nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).

This patch includes the following components:

* the netlink API: net/netfilter/nf_tables_api.c and
  include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
  net/ipv4/netfilter/nf_tables_ipv4.c
  net/ipv6/netfilter/nf_tables_ipv6.c
  net/ipv4/netfilter/nf_tables_arp.c
  net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
  net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
  net/ipv4/netfilter/nf_table_route_ipv4.c
  net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
  include/net/netfilter/nf_tables.h
  include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
  net/netfilter/nft_expr_template.c
  and the preliminary implementation of the meta target
  net/netfilter/nft_meta_target.c

It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.

This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:

From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps

From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release

From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation

From Florian Westphal:
* nft_log: group is u16, snaplen u32

From Phil Oester:
* nf_tables: operational limit match

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:15:48 +02:00
Patrick McHardy
795aa6ef6a netfilter: pass hook ops to hookfn
Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 11:29:31 +02:00
Eric Dumazet
ccdbb6e96b tcp: tcp_transmit_skb() optimizations
1) We need to take a timestamp only for skb that should be cloned.

Other skbs are not in write queue and no rtt estimation is done on them.

2) the unlikely() hint is wrong for receivers (they send pure ACK)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: MF Nowlan <fitz@cs.yale.edu>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-11 17:48:18 -04:00
Christophe Gouault
7263a5187f vti: get rid of nf mark rule in prerouting
This patch fixes and improves the use of vti interfaces (while
lightly changing the way of configuring them).

Currently:

- it is necessary to identify and mark inbound IPsec
  packets destined to each vti interface, via netfilter rules in
  the mangle table at prerouting hook.

- the vti module cannot retrieve the right tunnel in input since
  commit b9959fd3: vti tunnels all have an i_key, but the tunnel lookup
  is done with flag TUNNEL_NO_KEY, so there no chance to retrieve them.

- the i_key is used by the outbound processing as a mark to lookup
  for the right SP and SA bundle.

This patch uses the o_key to store the vti mark (instead of i_key) and
enables:

- to avoid the need for previously marking the inbound skbuffs via a
  netfilter rule.
- to properly retrieve the right tunnel in input, only based on the IPsec
  packet outer addresses.
- to properly perform an inbound policy check (using the tunnel o_key
  as a mark).
- to properly perform an outbound SPD and SAD lookup (using the tunnel
  o_key as a mark).
- to keep the current mark of the skbuff. The skbuff mark is neither
  used nor changed by the vti interface. Only the vti interface o_key
  is used.

SAs have a wildcard mark.
SPs have a mark equal to the vti interface o_key.

The vti interface must be created as follows (i_key = 0, o_key = mark):

   ip link add vti1 mode vti local 1.1.1.1 remote 2.2.2.2 okey 1

The SPs attached to vti1 must be created as follows (mark = vti1 o_key):

   ip xfrm policy add dir out mark 1 tmpl src 1.1.1.1 dst 2.2.2.2 \
      proto esp mode tunnel
   ip xfrm policy add dir in  mark 1 tmpl src 2.2.2.2 dst 1.1.1.1 \
      proto esp mode tunnel

The SAs are created with the default wildcard mark. There is no
distinction between global vs. vti SAs. Just their addresses will
possibly link them to a vti interface:

   ip xfrm state add src 1.1.1.1 dst 2.2.2.2 proto esp spi 1000 mode tunnel \
                 enc "cbc(aes)" "azertyuiopqsdfgh"

   ip xfrm state add src 2.2.2.2 dst 1.1.1.1 proto esp spi 2000 mode tunnel \
                 enc "cbc(aes)" "sqbdhgqsdjqjsdfh"

To avoid matching "global" (not vti) SPs in vti interfaces, global SPs
should no use the default wildcard mark, but explicitly match mark 0.

To avoid a double SPD lookup in input and output (in global and vti SPDs),
the NOPOLICY and NOXFRM options should be set on the vti interfaces:

   echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_policy
   echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_xfrm

The outgoing traffic is steered to vti1 by a route via the vti interface:

   ip route add 192.168.0.0/16 dev vti1

The incoming IPsec traffic is steered to vti1 because its outer addresses
match the vti1 tunnel configuration.

Signed-off-by: Christophe Gouault <christophe.gouault@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-11 14:52:17 -04:00
Eric Dumazet
b44084c2c8 inet: rename ir_loc_port to ir_num
In commit 634fb979e8 ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :

skc_dport is __be16 (network order), but skc_num is __u16 (host order)

So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16

Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 14:37:35 -04:00
Eric Dumazet
ba537427d7 tcp: use ACCESS_ONCE() in tcp_update_pacing_rate()
sk_pacing_rate is read by sch_fq packet scheduler at any time,
with no synchronization, so make sure we update it in a
sensible way. ACCESS_ONCE() is how we instruct compiler
to not do stupid things, like using the memory location
as a temporary variable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
Eric Dumazet
634fb979e8 inet: includes a sock_common in request_sock
TCP listener refactoring, part 5 :

We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.

This patch includes the needed struct sock_common in front
of struct request_sock

This means there is no more inet6_request_sock IPv6 specific
structure.

Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.

loc_port   -> ir_loc_port
loc_addr   -> ir_loc_addr
rmt_addr   -> ir_rmt_addr
rmt_port   -> ir_rmt_port
iif        -> ir_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
baker.zhang
4c60f1d67f fib_trie: only calc for the un-first node
This is a enhancement.

for the first node in fib_trie, newpos is 0, bit is 1.
Only for the leaf or node with unmatched key need calc pos.

Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
David S. Miller
f606385068 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
1) We used the wrong netlink attribute to verify the
   lenght of the replay window on async events. Fix this by
   using the right netlink attribute.

2) Policy lookups can not match the output interface on forwarding.
   Add the needed informations to the flow informations.

3) We update the pmtu when we receive a ICMPV6_DEST_UNREACH message
   on IPsec with ipv6. This is wrong and leads to strange fragmented
   packets, only ICMPV6_PKT_TOOBIG messages should update the pmtu.
   Fix this by removing the ICMPV6_DEST_UNREACH check from the IPsec
   protocol error handlers.

4) The legacy IPsec anti replay mechanism supports anti replay
   windows up to 32 packets. If a user requests for a bigger
   anti replay window, we use 32 packets but pretend that we use
   the requested window size. Fix from Fan Du.

5) If asynchronous events are enabled and replay_maxdiff is set to
   zero, we generate an async event for every received packet instead
   of checking whether a timeout occurred. Fix from Thomas Egerer.

6) Policies need a refcount when the state resolution timer is armed.
   Otherwise the timer can fire after the policy is deleted.

7) We might dreference a NULL pointer if the hold_queue is empty,
   add a check to avoid this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 13:41:45 -04:00
Eric Dumazet
c2bb06db59 net: fix build errors if ipv6 is disabled
CONFIG_IPV6=n is still a valid choice ;)

It appears we can remove dead code.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 13:04:03 -04:00
Eric Dumazet
f69b923a75 udp: fix a typo in __udp4_lib_mcast_demux_lookup
At this point sk might contain garbage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 01:51:57 -04:00
Eric Dumazet
efe4208f47 ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :

To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common

Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.

Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).

inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.

inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.

We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 00:01:25 -04:00
Eric Dumazet
05dbc7b594 tcp/dccp: remove twchain
TCP listener refactoring, part 3 :

Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.

Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.

As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.

If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.

[ INET_TW_MATCH() is no longer needed ]

I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()

This way, SYN_RECV pseudo sockets will be supported the same.

A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].

Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()

Before patch :

dmesg | grep "TCP established"

TCP established hash table entries: 524288 (order: 11, 8388608 bytes)

After patch :

TCP established hash table entries: 524288 (order: 10, 4194304 bytes)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 23:19:24 -04:00
David S. Miller
53af53ae83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	include/linux/netdevice.h
	net/core/sock.c

Trivial merge issues.

Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.

Two parallel line additions in net/core/sock.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 23:07:53 -04:00
Shawn Bohrer
fbf8866d65 net: ipv4 only populate IP_PKTINFO when needed
The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received.  This has introduced a performance regression for some
UDP workloads.

This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.

Benchmark results from a netperf UDP_RR test:
Before 89789.68 transactions/s
After  90587.62 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.63us RTT
After  12.48us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 16:27:33 -04:00
Shawn Bohrer
421b3885bf udp: ipv4: Add udp early demux
The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.

For UDP multicast we can only cache the dst if there is only one
receiving socket on the host.  Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.

For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups.  Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.

Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After  89789.68 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After  12.63us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 16:27:33 -04:00
Shawn Bohrer
005ec97433 udp: Only allow busy read/poll on connected sockets
UDP sockets can receive packets from multiple endpoints and thus may be
received on multiple receive queues.  Since packets packets can arrive
on multiple receive queues we should not mark the napi_id for all
packets.  This makes busy read/poll only work for connected UDP sockets.

This additionally enables busy read/poll for UDP multicast packets as
long as the socket is connected by moving the check into
__udp_queue_rcv_skb().

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 16:27:33 -04:00
Jiri Benc
0a7e226090 ipv4: fix ineffective source address selection
When sending out multicast messages, the source address in inet->mc_addr is
ignored and rewritten by an autoselected one. This is caused by a typo in
commit 813b3b5db8 ("ipv4: Use caller's on-stack flowi as-is in output
route lookups").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07 15:26:46 -04:00
Eric Dumazet
5e8a402f83 tcp: do not forget FIN in tcp_shifted_skb()
Yuchung found following problem :

 There are bugs in the SACK processing code, merging part in
 tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
 skbs FIN flag. When a receiver first SACK the FIN sequence, and later
 throw away ofo queue (e.g., sack-reneging), the sender will stop
 retransmitting the FIN flag, and hangs forever.

Following packetdrill test can be used to reproduce the bug.

$ cat sack-merge-bug.pkt
`sysctl -q net.ipv4.tcp_fack=0`

// Establish a connection and send 10 MSS.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+.000 bind(3, ..., ...) = 0
+.000 listen(3, 1) = 0

+.050 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.001 < . 1:1(0) ack 1 win 1024
+.000 accept(3, ..., ...) = 4

+.100 write(4, ..., 12000) = 12000
+.000 shutdown(4, SHUT_WR) = 0
+.000 > . 1:10001(10000) ack 1
+.050 < . 1:1(0) ack 2001 win 257
+.000 > FP. 10001:12001(2000) ack 1
+.050 < . 1:1(0) ack 2001 win 257 <sack 10001:11001,nop,nop>
+.050 < . 1:1(0) ack 2001 win 257 <sack 10001:12002,nop,nop>
// SACK reneg
+.050 < . 1:1(0) ack 12001 win 257
+0 %{ print "unacked: ",tcpi_unacked }%
+5 %{ print "" }%

First, a typo inverted left/right of one OR operation, then
code forgot to advance end_seq if the merged skb carried FIN.

Bug was added in 2.6.29 by commit 832d11c5cd
("tcp: Try to restore large SKBs while SACK processing")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 14:16:36 -04:00
Eric Dumazet
96f817fede tcp: shrink tcp6_timewait_sock by one cache line
While working on tcp listener refactoring, I found that it
would really make things easier if sock_common could include
the IPv6 addresses needed in the lookups, instead of doing
very complex games to get their values (depending on sock
being SYN_RECV, ESTABLISHED, TIME_WAIT)

For this to happen, I need to be sure that tcp6_timewait_sock
and tcp_timewait_sock consume same number of cache lines.

This is possible if we only use 32bits for tw_ttd, as we remove
one 32bit hole in inet_timewait_sock

inet_tw_time_stamp() is defined and used, even if its current
implementation looks like tcp_time_stamp : We might need finer
resolution for tcp_time_stamp in the future.

Before patch : sizeof(struct tcp6_timewait_sock) = 0xc8

After patch : sizeof(struct tcp6_timewait_sock) = 0xc0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 17:43:39 -04:00
Peter Senna Tschudin
34a6eda163 net: ipv4: Change variable type to bool
The variable fully_acked is only assigned the values true and false.
Change its type to bool.

The simplified semantic patch that find this problem is as
follows (http://coccinelle.lip6.fr/):

@exists@
type T;
identifier b;
@@
- T
+ bool
  b = ...;
  ... when any
  b = \(true\|false\)

Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:40:34 -04:00
Eric Dumazet
5080546682 inet: consolidate INET_TW_MATCH
TCP listener refactoring, part 2 :

We can use a generic lookup, sockets being in whatever state, if
we are sure all relevant fields are at the same place in all socket
types (ESTABLISH, TIME_WAIT, SYN_RECV)

This patch removes these macros :

 inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair

And adds :

 sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr

Then, INET_TW_MATCH() is really the same than INET_MATCH()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:33:35 -04:00
Eric Dumazet
80ad1d61e7 net: do not call sock_put() on TIMEWAIT sockets
commit 3ab5aee7fe ("net: Convert TCP & DCCP hash tables to use RCU /
hlist_nulls") incorrectly used sock_put() on TIMEWAIT sockets.

We should instead use inet_twsk_put()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 17:05:54 -04:00
Eric Dumazet
6ae705323b tcp: sndbuf autotuning improvements
tcp_fixup_sndbuf() is underestimating initial send buffer requirements.

It was not noticed because big GSO packets were escaping the limitation,
but with smaller TSO packets (or TSO/GSO/SG off), application hits
sk_sndbuf before having a chance to fill enough packets in socket write
queue.

- initial cwnd can be bigger than 10 for specific routes

- SKB_TRUESIZE() is a bit under real needs in some cases,
  because of power-of-two rounding in kmalloc()

- Fast Recovery (RFC 5681 3.2) : Cubic needs 70% factor

- Extra cushion (application might react slowly to POLLOUT)

tcp_v4_conn_req_fastopen() needs to call tcp_init_metrics() before
calling tcp_init_buffer_space()

Then we realize tcp_new_space() should call tcp_fixup_sndbuf()
instead of duplicating this stuff.

Rename tcp_fixup_sndbuf() to tcp_sndbuf_expand() to be more
descriptive.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 16:45:17 -04:00
baker.zhang
bbe34cf8a1 fib_trie: avoid a redundant bit judgement in inflate
Because 'node' is the i'st child of 'oldnode',
thus, here 'i' equals
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits)

we just get 1 more bit,
and need not care the detail value of this bits.

I apologize for the mistake.

I generated the patch on a branch version,
and did not notice the put_child has been changed.

I have redone the test on HEAD version with my patch.

two cases are used.
case 1. inflate a node which has a leaf child node.
case 2: inflate a node which has a an child node with skipped bits

test env:
  ip link set eth0 up
  ip a add dev eth0 192.168.11.1/32
here, we just focus on route table(MAIN),
so I use a "192.168.11.1/32" address to simplify the test case.

call trace:
+ fib_insert_node
+ + trie_rebalance
+ + + resize
+ + + + inflate

Test case 1:  inflate a node which has a leaf child node.

===========================================================
step 1. prepare a fib trie
------------------------------------------
  ip r a 192.168.0.0/24 via 192.168.11.1
  ip r a 192.168.1.0/24 via 192.168.11.1

we get a fib trie.
root@baker:~# cat /proc/net/fib_trie
Main:
  +-- 192.168.0.0/23 1 0 0
   |-- 192.168.0.0
    /24 universe UNICAST
   |-- 192.168.1.0
    /24 universe UNICAST
Local:
.....

step 2. Add the third route
------------------------------------------
root@baker:~# ip r a 192.168.2.0/24 via 192.168.11.1

A fib_trie leaf will be inserted in fib_insert_node before trie_rebalance.

For function 'inflate':
'inflate' is called with following trie.
  +-- 192.168.0.0/22 1 1 0 <=== tn node
    +-- 192.168.0.0/23 1 0 0    <== node a
        |-- 192.168.0.0
          /24 universe UNICAST
        |-- 192.168.1.0
          /24 universe UNICAST
      |-- 192.168.2.0          <== leaf(node b)

When process node b, which is a leaf. here:
i is 1,
node key "192.168.2.0"
oldnode is (pos:22, bits:1)

unpatch source:
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,2,0", 22 + 1, 1)

thus got 0, and call put_child(tn, 2*i, node); <== 2*i=2.

patched source:
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits + 1),
tkey_extract_bits("192.168,2,0", 22, 1 + 1)  <== get 2.

Test case 2:  inflate a node which has a an child node with skipped bits
==========================================================================
step 1. prepare a fib trie.
  ip link set eth0 up
  ip a add dev eth0 192.168.11.1/32
  ip r a 192.168.128.0/24 via 192.168.11.1
  ip r a 192.168.0.0/24  via 192.168.11.1
  ip r a 192.168.16.0/24   via 192.168.11.1
  ip r a 192.168.32.0/24  via 192.168.11.1
  ip r a 192.168.48.0/24  via 192.168.11.1
  ip r a 192.168.144.0/24   via 192.168.11.1
  ip r a 192.168.160.0/24   via 192.168.11.1
  ip r a 192.168.176.0/24   via 192.168.11.1

check:
root@baker:~# cat /proc/net/fib_trie
Main:
  +-- 192.168.0.0/16 1 0 0
     +-- 192.168.0.0/18 2 0 0
        |-- 192.168.0.0
           /24 universe UNICAST
        |-- 192.168.16.0
           /24 universe UNICAST
        |-- 192.168.32.0
           /24 universe UNICAST
        |-- 192.168.48.0
           /24 universe UNICAST
     +-- 192.168.128.0/18 2 0 0
        |-- 192.168.128.0
           /24 universe UNICAST
        |-- 192.168.144.0
           /24 universe UNICAST
        |-- 192.168.160.0
           /24 universe UNICAST
        |-- 192.168.176.0
           /24 universe UNICAST
Local:
  ...

step 2. add a route to trigger inflate.
  ip r a 192.168.96.0/24   via 192.168.11.1

This command will call serveral times inflate.
In the first time, the fib_trie is:
________________________
+-- 192.168.128.0/(16, 1) <== tn node
 +-- 192.168.0.0/(17, 1)  <== node a
  +-- 192.168.0.0/(18, 2)
   |-- 192.168.0.0
   |-- 192.168.16.0
   |-- 192.168.32.0
   |-- 192.168.48.0
  |-- 192.168.96.0
 +-- 192.168.128.0/(18, 2) <== node b.
  |-- 192.168.128.0
  |-- 192.168.144.0
  |-- 192.168.160.0
  |-- 192.168.176.0

NOTE: node b is a interal node with skipped bits.
here,
i:1,
node->key "192.168.128.0",
oldnode:(pos:16, bits:1)
so
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16 + 1, 1) <=== 0

tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16, 1+1) <=== 2

2*i + 0 == 2, so the result is same.

Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 16:37:15 -04:00
Andi Kleen
5843ef4213 tcp: Always set options to 0 before calling tcp_established_options
tcp_established_options assumes opts->options is 0 before calling,
as it read modify writes it.

For the tcp_current_mss() case the opts structure is not zeroed,
so this can be done with uninitialized values.

This is ok, because ->options is not read in this path.
But it's still better to avoid the operation on the uninitialized
field. This shuts up a static code analyzer, and presumably
may help the optimizer.

Cc: netdev@vger.kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 16:32:43 -04:00
Mathias Krause
278f2b3e2a netfilter: ipt_ULOG: fix info leaks
The ulog messages leak heap bytes by the means of padding bytes and
incompletely filled string arrays. Fix those by memset(0)'ing the
whole struct before filling it.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-02 17:28:36 +02:00
David S. Miller
4fbef95af4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be.h
	drivers/net/usb/qmi_wwan.c
	drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
	include/net/netfilter/nf_conntrack_synproxy.h
	include/net/secure_seq.h

The conflicts are of two varieties:

1) Conflicts with Joe Perches's 'extern' removal from header file
   function declarations.  Usually it's an argument signature change
   or a function being added/removed.  The resolutions are trivial.

2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds
   a new value, another changes an existing value.  That sort of
   thing.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 17:06:14 -04:00
Steffen Klassert
cfe4a53692 ip_tunnel: Remove double unregister of the fallback device
When queueing the netdevices for removal, we queue the
fallback device twice in ip_tunnel_destroy(). The first
time when we queue all netdevices in the namespace and
then again explicitly. Fix this by removing the explicit
queueing of the fallback device.

Bug was introduced when network namespace support was added
with commit 6c742e714d ("ipip: add x-netns support").

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 12:42:16 -04:00
Steffen Klassert
78a3694d44 ip_tunnel_core: Change __skb_push back to skb_push
Git commit 0e6fbc5b ("ip_tunnels: extend iptunnel_xmit()")
moved the IP header installation to iptunnel_xmit() and
changed skb_push() to __skb_push(). This makes possible
bugs hard to track down, so change it back to skb_push().

Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 12:42:16 -04:00
Steffen Klassert
6701328262 ip_tunnel: Add fallback tunnels to the hash lists
Currently we can not update the tunnel parameters of
the fallback tunnels because we don't find them in the
hash lists. Fix this by adding them on initialization.

Bug was introduced with commit c544193214
("GRE: Refactor GRE tunneling code.")

Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 12:42:16 -04:00
Steffen Klassert
3e08f4a72f ip_tunnel: Fix a memory corruption in ip_tunnel_xmit
We might extend the used aera of a skb beyond the total
headroom when we install the ipip header. Fix this by
calling skb_cow_head() unconditionally.

Bug was introduced with commit c544193214
("GRE: Refactor GRE tunneling code.")

Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 12:42:16 -04:00
David S. Miller
e024bdc051 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter/IPVS fixes for your net
tree, they are:

* Fix BUG_ON splat due to malformed TCP packets seen by synproxy, from
  Patrick McHardy.

* Fix possible weight overflow in lblc and lblcr schedulers due to
  32-bits arithmetics, from Simon Kirby.

* Fix possible memory access race in the lblc and lblcr schedulers,
  introduced when it was converted to use RCU, two patches from
  Julian Anastasov.

* Fix hard dependency on CPU 0 when reading per-cpu stats in the
  rate estimator, from Julian Anastasov.

* Fix race that may lead to object use after release, when invoking
  ipvsadm -C && ipvsadm -R, introduced when adding RCU, from Julian
  Anastasov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 12:39:35 -04:00
Salam Noureddine
e2401654dd ipv4 igmp: use in_dev_put in timer handlers instead of __in_dev_put
It is possible for the timer handlers to run after the call to
ip_mc_down so use in_dev_put instead of __in_dev_put in the handler
function in order to do proper cleanup when the refcnt reaches 0.
Otherwise, the refcnt can reach zero without the in_device being
destroyed and we end up leaking a reference to the net_device and
see messages like the following,

unregister_netdevice: waiting for eth0 to become free. Usage count = 1

Tested on linux-3.4.43.

Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 22:28:56 -07:00
Eric W. Biederman
0bbf87d852 net ipv4: Convert ipv4.ip_local_port_range to be per netns v3
- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
  of the callers.
- Move the initialization of sysctl_local_ports into
   sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c

v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next

v3:
- Compile fixes of strange callers of inet_get_local_port_range.
  This patch now successfully passes an allmodconfig build.
  Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range

Originally-by: Samya <samya@twitter.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 21:59:38 -07:00
Eric Dumazet
c9eeec26e3 tcp: TSQ can use a dynamic limit
When TCP Small Queues was added, we used a sysctl to limit amount of
packets queues on Qdisc/device queues for a given TCP flow.

Problem is this limit is either too big for low rates, or too small
for high rates.

Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
auto sizing, it can better control number of packets in Qdisc/device
queues.

New limit is two packets or at least 1 to 2 ms worth of packets.

Low rates flows benefit from this patch by having even smaller
number of packets in queues, allowing for faster recovery,
better RTT estimations.

High rates flows benefit from this patch by allowing more than 2 packets
in flight as we had reports this was a limiting factor to reach line
rate. [ In particular if TX completion is delayed because of coalescing
parameters ]

Example for a single flow on 10Gbp link controlled by FQ/pacing

14 packets in flight instead of 2

$ tc -s -d qd
qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
buckets 1024 quantum 3028 initial_quantum 15140
 Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
requeues 6822476)
 rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
  2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
  2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit

Note that sk_pacing_rate is currently set to twice the actual rate, but
this might be refined in the future when a flow is in congestion
avoidance.

Additional change : skb->destructor should be set to tcp_wfree().

A future patch (for linux 3.13+) might remove tcp_limit_output_bytes

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 20:41:57 -07:00
David S. Miller
7b77d161ce Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Conflicts:
	include/net/xfrm.h

Simple conflict between Joe Perches "extern" removal for function
declarations in header files and the changes in Steffen's tree.

Steffen Klassert says:

====================
Two patches that are left from the last development cycle.
Manual merging of include/net/xfrm.h is needed. The conflict
can be solved as it is currently done in linux-next.

1) We announce the creation of temporary acquire state via an asyc event,
   so the deletion should be annunced too. From Nicolas Dichtel.

2) The VTI tunnels do not real tunning, they just provide a routable
   IPsec tunnel interface. So introduce and use xfrm_tunnel_notifier
   instead of xfrm_tunnel for xfrm tunnel mode callback. From Fan Du.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 15:24:57 -04:00
Pravin B Shelar
d4a71b155c ip_tunnel: Do not use stale inner_iph pointer.
While sending packet skb_cow_head() can change skb header which
invalidates inner_iph pointer to skb header. Following patch
avoid using it. Found by code inspection.

This bug was introduced by commit 0e6fbc5b6c (ip_tunnels: extend
iptunnel_xmit()).

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 15:05:07 -04:00
Patrick McHardy
f4a87e7bd2 netfilter: synproxy: fix BUG_ON triggered by corrupt TCP packets
TCP packets hitting the SYN proxy through the SYNPROXY target are not
validated by TCP conntrack. When th->doff is below 5, an underflow happens
when calculating the options length, causing skb_header_pointer() to
return NULL and triggering the BUG_ON().

Handle this case gracefully by checking for NULL instead of using BUG_ON().

Reported-by: Martin Topholm <mph@one.com>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-09-30 12:44:38 +02:00
Eric Dumazet
62748f32d5 net: introduce SO_MAX_PACING_RATE
As mentioned in commit afe4fd0624 ("pkt_sched: fq: Fair Queue packet
scheduler"), this patch adds a new socket option.

SO_MAX_PACING_RATE offers the application the ability to cap the
rate computed by transport layer. Value is in bytes per second.

u32 val = 1000000;
setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));

To be effectively paced, a flow must use FQ packet scheduler.

Note that a packet scheduler takes into account the headers for its
computations. The effective payload rate depends on MSS and retransmits
if any.

I chose to make this pacing rate a SOL_SOCKET option instead of a
TCP one because this can be used by other protocols.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 15:35:41 -07:00
Francesco Fusco
aa66158145 ipv4: processing ancillary IP_TOS or IP_TTL
If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 15:21:52 -07:00
Francesco Fusco
f02db315b8 ipv4: IP_TOS and IP_TTL can be specified as ancillary data
This patch enables the IP_TTL and IP_TOS values passed from userspace to
be stored in the ipcm_cookie struct. Three fields are added to the struct:

- the TTL, expressed as __u8.
  The allowed values are in the [1-255].
  A value of 0 means that the TTL is not specified.

- the TOS, expressed as __s16.
  The allowed values are in the range [0,255].
  A value of -1 means that the TOS is not specified.

- the priority, expressed as a char and computed when
  handling the ancillary data.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 15:21:51 -07:00
Eric Dumazet
9a3bab6b05 net: net_secret should not depend on TCP
A host might need net_secret[] and never open a single socket.

Problem added in commit aebda156a5
("net: defer net_secret[] initialization")

Based on prior patch from Hannes Frederic Sowa.

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@strressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 15:19:40 -07:00
Eric Dumazet
b0983d3c9b tcp: fix dynamic right sizing
Dynamic Right Sizing (DRS) is supposed to open TCP receive window
automatically, but suffers from two bugs, presented by order
of importance.

1) tcp_rcv_space_adjust() fix :

Using twice the last received amount is very pessimistic,
because it doesn't allow fast recovery or proper slow start
ramp up, if sender wants to increase cwin by 100% every RTT.

copied = bytes received in previous RTT

2*copied = bytes we expect to receive in next RTT

4*copied = bytes we need to advertise in rwin at end of next RTT

DRS is one RTT late, it needs a 4x factor.

If sender is not using ABC, and increases cwin by 50% every rtt,
then we needed 1.5*1.5 = 2.25 factor.
This is probably why this bug was not really noticed.

2) There is no window adjustment after first RTT. DRS triggers only
  after the second RTT.
  DRS needs two RTT to initialize, so tcp_fixup_rcvbuf() should setup
  sk_rcvbuf to allow proper window grow for first two RTT.

This patch increases TCP efficiency particularly for large RTT flows
when autotuning is used at the receiver, and more particularly
in presence of packet losses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-24 11:07:32 -04:00
Florian Westphal
086293542b tcp: syncookies: reduce mss table to four values
Halve mss table size to make blind cookie guessing more difficult.
This is sad since the tables were already small, but there
is little alternative except perhaps adding more precise mss information
in the tcp timestamp.  Timestamps are unfortunately not ubiquitous.

Guessing all possible cookie values still has 8-in 2**32 chance.

Reported-by: Jakob Lell <jakob@jakoblell.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-24 10:39:58 -04:00
Florian Westphal
8c27bd75f0 tcp: syncookies: reduce cookie lifetime to 128 seconds
We currently accept cookies that were created less than 4 minutes ago
(ie, cookies with counter delta 0-3).  Combined with the 8 mss table
values, this yields 32 possible values (out of 2**32) that will be valid.

Reducing the lifetime to < 2 minutes halves the guessing chance while
still providing a large enough period.

While at it, get rid of jiffies value -- they overflow too quickly on
32 bit platforms.

getnstimeofday is used to create a counter that increments every 64s.
perf shows getnstimeofday cost is negible compared to sha_transform;
normal tcp initial sequence number generation uses getnstimeofday, too.

Reported-by: Jakob Lell <jakob@jakoblell.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-24 10:39:58 -04:00
Duan Jiong
8d65b1190d net: raw: do not report ICMP redirects to user space
Redirect isn't an error condition, it should leave
the error handler without touching the socket.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-24 10:15:49 -04:00
Duan Jiong
1a462d1892 net: udp: do not report ICMP redirects to user space
Redirect isn't an error condition, it should leave
the error handler without touching the socket.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-24 10:15:49 -04:00
Linus Torvalds
b75ff5e84b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) If the local_df boolean is set on an SKB we have to allocate a
    unique ID even if IP_DF is set in the ipv4 headers, from Ansis
    Atteka.

 2) Some fixups for the new chipset support that went into the sfc
    driver, from Ben Hutchings.

 3) Because SCTP bypasses a good chunk of, and actually duplicates, the
    logic of the ipv6 output path, some IPSEC things don't get done
    properly.  Integrate SCTP better into the ipv6 output path so that
    these problems are fixed and such issues don't get missed in the
    future either.  From Daniel Borkmann.

 4) Fix skge regressions added by the DMA mapping error return checking
    added in v3.10, from Mikulas Patocka.

 5) Kill some more IRQF_DISABLED references, from Michael Opdenacker.

 6) Fix races and deadlocks in the bridging code, from Hong Zhiguo.

 7) Fix error handling in tun_set_iff(), in particular don't leak
    resources.  From Jason Wang.

 8) Prevent format-string injection into xen-netback driver, from Kees
    Cook.

 9) Fix regression added to netpoll ARP packet handling, in particular
    check for the right ETH_P_ARP protocol code.  From Sonic Zhang.

10) Try to deal with AMD IOMMU errors when using r8169 chips, from
    Francois Romieu.

11) Cure freezes due to recent changes in the rt2x00 wireless driver,
    from Stanislaw Gruszka.

12) Don't do SPI transfers (which can sleep) in interrupt context in
    cw1200 driver, from Solomon Peachy.

13) Fix LEDs handling bug in 5720 tg3 chips already handled for 5719.
    From Nithin Sujir.

14) Make xen_netbk_count_skb_slots() count the actual number of slots
    that will be used, taking into consideration packing and other
    issues that the transmit path will run into.  From David Vrabel.

15) Use the correct maximum age when calculating the bridge
    message_age_timer, from Chris Healy.

16) Get rid of memory leaks in mcs7780 IRDA driver, from Alexey
    Khoroshilov.

17) Netfilter conntrack extensions were converted to RCU but are not
    always freed properly using kfree_rcu().  Fix from Michal Kubecek.

18) VF reset recovery not being done correctly in qlcnic driver, from
    Manish Chopra.

19) Fix inverted test in ATM nicstar driver, from Andy Shevchenko.

20) Missing workqueue destroy in cxgb4 error handling, from Wei Yang.

21) Internal switch not initialized properly in bgmac driver, from Rafał
    Miłecki.

22) Netlink messages report wrong local and remote addresses in IPv6
    tunneling, from Ding Zhi.

23) ICMP redirects should not generate socket errors in DCCP and SCTP.
    We're still working out how this should be handled for RAW and UDP
    sockets.  From Daniel Borkmann and Duan Jiong.

24) We've had several bugs wherein the network namespace's loopback
    device gets accessed after it is free'd, NULL it out so that we can
    catch these problems more readily.  From Eric W Biederman.

25) Fix regression in TCP RTO calculations, from Neal Cardwell.

26) Fix too early free of xen-netback network device when VIFs still
    exist.  From Paul Durrant.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
  netconsole: fix a deadlock with rtnl and netconsole's mutex
  netpoll: fix NULL pointer dereference in netpoll_cleanup
  skge: fix broken driver
  ip: generate unique IP identificator if local fragmentation is allowed
  ip: use ip_hdr() in __ip_make_skb() to retrieve IP header
  xen-netback: Don't destroy the netdev until the vif is shut down
  net:dccp: do not report ICMP redirects to user space
  cnic: Fix crash in cnic_bnx2x_service_kcq()
  bnx2x, cnic, bnx2i, bnx2fc: Fix bnx2i and bnx2fc regressions.
  vxlan: Avoid creating fdb entry with NULL destination
  tcp: fix RTO calculated from cached RTT
  drivers: net: phy: cicada.c: clears warning Use #include <linux/io.h> instead of <asm/io.h>
  net loopback: Set loopback_dev to NULL when freed
  batman-adv: set the TAG flag for the vid passed to BLA
  netfilter: nfnetlink_queue: use network skb for sequence adjustment
  net: sctp: rfc4443: do not report ICMP redirects to user space
  net: usb: cdc_ether: use usb.h macros whenever possible
  net: usb: cdc_ether: fix checkpatch errors and warnings
  net: usb: cdc_ether: Use wwan interface for Telit modules
  ip6_tunnels: raddr and laddr are inverted in nl msg
  ...
2013-09-19 13:57:28 -05:00
Ansis Atteka
703133de33 ip: generate unique IP identificator if local fragmentation is allowed
If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.

For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.

Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-19 14:11:15 -04:00
Ansis Atteka
749154aa56 ip: use ip_hdr() in __ip_make_skb() to retrieve IP header
skb->data already points to IP header, but for the sake of
consistency we can also use ip_hdr() to retrieve it.

Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-19 14:11:15 -04:00
Neal Cardwell
269aa759b4 tcp: fix RTO calculated from cached RTT
Commit 1b7fdd2ab5 ("tcp: do not use cached RTT for RTT estimation")
did not correctly account for the fact that crtt is the RTT shifted
left 3 bits. Fix the calculation to consistently reflect this fact.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-17 19:08:08 -04:00
Steffen Klassert
bafd4bd4dc xfrm: Decode sessions with output interface.
The output interface matching does not work on forward
policy lookups, the output interface of the flowi is
always 0. Fix this by setting the output interface when
we decode the session.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-09-16 09:39:43 +02:00
Sha Zhengju
6de5a8bfca memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Eric Dumazet
4e4f1fc226 tcp: properly increase rcv_ssthresh for ofo packets
TCP receive window handling is multi staged.

A socket has a memory budget, static or dynamic, in sk_rcvbuf.

Because we do not really know how this memory budget translates to
a TCP window (payload), TCP announces a small initial window
(about 20 MSS).

When a packet is received, we increase TCP rcv_win depending
on the payload/truesize ratio of this packet. Good citizen
packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2

This heuristic takes place in tcp_grow_window()

Problem is : We currently call tcp_grow_window() only for in-order
packets.

This means that reorders or packet losses stop proper grow of
rcv_win, and senders are unable to benefit from fast recovery,
or proper reordering level detection.

Really, a packet being stored in OFO queue is not a bad citizen.
It should be part of the game as in-order packets.

In our traces, we very often see sender is limited by linux small
receive windows, even if linux hosts use autotuning (DRS) and should
allow rcv_win to grow to ~3MB.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-06 14:43:49 -04:00
Yuchung Cheng
16edfe7ee0 tcp: fix no cwnd growth after timeout
In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
it only allows cwnd to increase in Open state. This mistakenly disables
slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
state moves from Disorder to Open later in tcp_fastretrans_alert().

Therefore the correct logic should be to allow cwnd to grow as long
as the data is received in order in Open, Loss, or even Disorder state.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-06 14:43:49 -04:00
Linus Torvalds
cc998ff881 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
 "Noteworthy changes this time around:

   1) Multicast rejoin support for team driver, from Jiri Pirko.

   2) Centralize and simplify TCP RTT measurement handling in order to
      reduce the impact of bad RTO seeding from SYN/ACKs.  Also, when
      both timestamps and local RTT measurements are available prefer
      the later because there are broken middleware devices which
      scramble the timestamp.

      From Yuchung Cheng.

   3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
      memory consumed to queue up unsend user data.  From Eric Dumazet.

   4) Add a "physical port ID" abstraction for network devices, from
      Jiri Pirko.

   5) Add a "suppress" operation to influence fib_rules lookups, from
      Stefan Tomanek.

   6) Add a networking development FAQ, from Paul Gortmaker.

   7) Extend the information provided by tcp_probe and add ipv6 support,
      from Daniel Borkmann.

   8) Use RCU locking more extensively in openvswitch data paths, from
      Pravin B Shelar.

   9) Add SCTP support to openvswitch, from Joe Stringer.

  10) Add EF10 chip support to SFC driver, from Ben Hutchings.

  11) Add new SYNPROXY netfilter target, from Patrick McHardy.

  12) Compute a rate approximation for sending in TCP sockets, and use
      this to more intelligently coalesce TSO frames.  Furthermore, add
      a new packet scheduler which takes advantage of this estimate when
      available.  From Eric Dumazet.

  13) Allow AF_PACKET fanouts with random selection, from Daniel
      Borkmann.

  14) Add ipv6 support to vxlan driver, from Cong Wang"

Resolved conflicts as per discussion.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
  openvswitch: Fix alignment of struct sw_flow_key.
  netfilter: Fix build errors with xt_socket.c
  tcp: Add missing braces to do_tcp_setsockopt
  caif: Add missing braces to multiline if in cfctrl_linkup_request
  bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
  vxlan: Fix kernel panic on device delete.
  net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
  net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
  icplus: Use netif_running to determine device state
  ethernet/arc/arc_emac: Fix huge delays in large file copies
  tuntap: orphan frags before trying to set tx timestamp
  tuntap: purge socket error queue on detach
  qlcnic: use standard NAPI weights
  ipv6:introduce function to find route for redirect
  bnx2x: VF RSS support - VF side
  bnx2x: VF RSS support - PF side
  vxlan: Notify drivers for listening UDP port changes
  net: usbnet: update addr_assign_type if appropriate
  driver/net: enic: update enic maintainers and driver
  driver/net: enic: Exposing symbols for Cisco's low latency driver
  ...
2013-09-05 14:54:29 -07:00
David S. Miller
06c54055be Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
	net/bridge/br_multicast.c
	net/ipv6/sit.c

The conflicts were minor:

1) sit.c changes overlap with change to ip_tunnel_xmit() signature.

2) br_multicast.c had an overlap between computing max_delay using
   msecs_to_jiffies and turning MLDV2_MRC() into an inline function
   with a name using lowercase instead of uppercase letters.

3) stmmac had two overlapping changes, one which conditionally allocated
   and hooked up a dma_cfg based upon the presence of the pbl OF property,
   and another one handling store-and-forward DMA made.  The latter of
   which should not go into the new of_find_property() basic block.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-05 14:58:52 -04:00
Dave Jones
e2e5c4c07c tcp: Add missing braces to do_tcp_setsockopt
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-05 14:31:02 -04:00
Linus Torvalds
27703bb4a6 PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
 
 This has been sitting in linux-next for a whole cycle.
 
 Thanks,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt
 hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS
 97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ
 NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+
 XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR
 +0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy
 L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7
 e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW
 cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa
 SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50
 GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT
 BiHX6YFWtDDqRlVv8Q0F
 =LeaW
 -----END PGP SIGNATURE-----

Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull PTR_RET() removal patches from Rusty Russell:
 "PTR_RET() is a weird name, and led to some confusing usage.  We ended
  up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.

  This has been sitting in linux-next for a whole cycle"

[ There are still some PTR_RET users scattered about, with some of them
  possibly being new, but most of them existing in Rusty's tree too.  We
  have that

      #define PTR_RET(p) PTR_ERR_OR_ZERO(p)

  thing in <linux/err.h>, so they continue to work for now  - Linus ]

* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
  Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
  drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
  sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
  dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
  drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
  mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
  staging/zcache: don't use PTR_RET().
  remoteproc: don't use PTR_RET().
  pinctrl: don't use PTR_RET().
  acpi: Replace weird use of PTR_RET.
  s390: Replace weird use of PTR_RET.
  PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
  PTR_RET is now PTR_ERR_OR_ZERO
2013-09-04 17:31:11 -07:00
Yuchung Cheng
52f20e655d tcp: better comments for RTO initiallization
Commit 1b7fdd2ab585("tcp: do not use cached RTT for RTT estimation")
removes important comments on how RTO is initialized and updated.
Hopefully this patch puts those information back.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 14:41:55 -04:00
David S. Miller
48f8e0af86 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
The following batch contains:

* Three fixes for the new synproxy target available in your
  net-next tree, from Jesper D. Brouer and Patrick McHardy.

* One fix for TCPMSS to correctly handling the fragmentation
  case, from Phil Oester. I'll pass this one to -stable.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 12:28:02 -04:00
Jesper Dangaard Brouer
7cc9eb6ef7 netfilter: SYNPROXY: let unrelated packets continue
Packets reaching SYNPROXY were default dropped, as they were most
likely invalid (given the recommended state matching).  This
patch, changes SYNPROXY target to let packets, not consumed,
continue being processed by the stack.

This will be more in line other target modules. As it will allow
more flexible configurations of handling, logging or matching on
packets in INVALID states.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-09-04 11:44:23 +02:00
Jesper Dangaard Brouer
775ada6d9f netfilter: more strict TCP flag matching in SYNPROXY
Its seems Patrick missed to incoorporate some of my requested changes
during review v2 of SYNPROXY netfilter module.

Which were, to avoid SYN+ACK packets to enter the path, meant for the
ACK packet from the client (from the 3WHS).

Further there were a bug in ip6t_SYNPROXY.c, for matching SYN packets
that didn't exclude the ACK flag.

Go a step further with SYN packet/flag matching by excluding flags
ACK+FIN+RST, in both IPv4 and IPv6 modules.

The intented usage of SYNPROXY is as follows:
(gracefully describing usage in commit)

 iptables -t raw -A PREROUTING -i eth0 -p tcp --dport 80 --syn -j NOTRACK
 iptables -A INPUT -i eth0 -p tcp --dport 80 -m state UNTRACKED,INVALID \
         -j SYNPROXY --sack-perm --timestamp --mss 1480 --wscale 7 --ecn

 echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

This does filter SYN flags early, for packets in the UNTRACKED state,
but packets in the INVALID state with other TCP flags could still
reach the module, thus this stricter flag matching is still needed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-09-04 11:43:11 +02:00
Vijay Subramanian
c995ae2259 tcp: Change return value of tcp_rcv_established()
tcp_rcv_established() returns only one value namely 0. We change the return
value to void (as suggested by David Miller).

After commit 0c24604b (tcp: implement RFC 5961 4.2), we no longer send RSTs in
response to SYNs. We can remove the check and processing on the return value of
tcp_rcv_established().

We also fix jtcp_rcv_established() in tcp_probe.c to match that of
tcp_rcv_established().

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:28 -04:00
Daniel Borkmann
cc8c6c1b21 net: tcp_probe: adapt tbuf size for recent changes
With recent changes in tcp_probe module (e.g. f925d0a62d ("net: tcp_probe:
add IPv6 support")) we also need to take into account that tbuf needs to
be updated as format string will be further expanded. tbuf sits on the stack
in tcpprobe_read() function that is invoked when user space reads procfs
file /proc/net/tcpprobe, hence not fast path as in jtcp_rcv_established().
Having a size similarly as in sctp_probe module of 256 bytes is fully
sufficient for that, we need theoretical maximum of 252 bytes otherwise we
could get truncated.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:28 -04:00
Nicolas Dichtel
ea23192e8e tunnels: harmonize cleanup done on skb on rx path
The goal of this patch is to harmonize cleanup done on a skbuff on rx path.
Before this patch, behaviors were different depending of the tunnel type.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:26 -04:00
Nicolas Dichtel
963a88b31d tunnels: harmonize cleanup done on skb on xmit path
The goal of this patch is to harmonize cleanup done on a skbuff on xmit path.
Before this patch, behaviors were different depending of the tunnel type.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:25 -04:00
Nicolas Dichtel
8b27f27797 skb: allow skb_scrub_packet() to be used by tunnels
This function was only used when a packet was sent to another netns. Now, it can
also be used after tunnel encapsulation or decapsulation.

Only skb_orphan() should not be done when a packet is not crossing netns.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:25 -04:00
Nicolas Dichtel
8b7ed2d91d iptunnels: remove net arg from iptunnel_xmit()
This argument is not used, let's remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:25 -04:00
Tim Gardner
3e25c65ed0 net: neighbour: Remove CONFIG_ARPD
This config option is superfluous in that it only guards a call
to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
change in behavior. There will now be call to __neigh_notify()
for each ARP resolution, which has no impact unless there is a
user space daemon waiting to receive the notification, i.e.,
the case for which CONFIG_ARPD was designed anyways.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Joe Perches <joe@perches.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03 21:41:43 -04:00
Linus Torvalds
32dad03d16 Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot of activities on the cgroup front.  Most changes aren't visible
  to userland at all at this point and are laying foundation for the
  planned unified hierarchy.

   - The biggest change is decoupling the lifetime management of css
     (cgroup_subsys_state) from that of cgroup's.  Because controllers
     (cpu, memory, block and so on) will need to be dynamically enabled
     and disabled, css which is the association point between a cgroup
     and a controller may come and go dynamically across the lifetime of
     a cgroup.  Till now, css's were created when the associated cgroup
     was created and stayed till the cgroup got destroyed.

     Assumptions around this tight coupling permeated through cgroup
     core and controllers.  These assumptions are gradually removed,
     which consists bulk of patches, and css destruction path is
     completely decoupled from cgroup destruction path.  Note that
     decoupling of creation path is relatively easy on top of these
     changes and the patchset is pending for the next window.

   - cgroup has its own event mechanism cgroup.event_control, which is
     only used by memcg.  It is overly complex trying to achieve high
     flexibility whose benefits seem dubious at best.  Going forward,
     new events will simply generate file modified event and the
     existing mechanism is being made specific to memcg.  This pull
     request contains prepatory patches for such change.

   - Various fixes and cleanups"

Fixed up conflict in kernel/cgroup.c as per Tejun.

* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
  cgroup: fix cgroup_css() invocation in css_from_id()
  cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
  cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
  cgroup: implement CFTYPE_NO_PREFIX
  cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
  cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
  cgroup: fix cgroup_write_event_control()
  cgroup: fix subsystem file accesses on the root cgroup
  cgroup: change cgroup_from_id() to css_from_id()
  cgroup: use css_get() in cgroup_create() to check CSS_ROOT
  cpuset: remove an unncessary forward declaration
  cgroup: RCU protect each cgroup_subsys_state release
  cgroup: move subsys file removal to kill_css()
  cgroup: factor out kill_css()
  cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
  cgroup: replace cgroup->css_kill_cnt with ->nr_css
  cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
  cgroup: move cgroup->subsys[] assignment to online_css()
  cgroup: reorganize css init / exit paths
  cgroup: add __rcu modifier to cgroup->subsys[]
  ...
2013-09-03 18:25:03 -07:00
Cong Wang
5a17a390de net: make snmp_mib_free static inline
Fengguang reported:

   net/built-in.o: In function `in6_dev_finish_destroy':
   (.text+0x4ca7d): undefined reference to `snmp_mib_free'

this is due to snmp_mib_free() is defined when CONFIG_INET is enabled,
but in6_dev_finish_destroy() is now moved to core kernel.

I think snmp_mib_free() is small enough to be inlined, so just make it
static inline.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-02 21:00:50 -07:00
Cong Wang
eb3c0d83cc net: unify skb_udp_tunnel_segment() and skb_udp6_tunnel_segment()
As suggested by Pravin, we can unify the code in case of duplicated
code.

Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-31 22:30:01 -04:00
Li Hongjun
737e828bdb ipv4 tunnels: fix an oops when using ipip/sit with IPsec
Since commit 3d7b46cd20 (ip_tunnel: push generic protocol handling to
ip_tunnel module.), an Oops is triggered when an xfrm policy is configured on
an IPv4 over IPv4 tunnel.

xfrm4_policy_check() calls __xfrm_policy_check2(), which uses skb_dst(skb). But
this field is NULL because iptunnel_pull_header() calls skb_dst_drop(skb).

Signed-off-by: Li Hongjun <hongjun.li@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-30 17:13:28 -04:00
Phil Oester
eb8895debe tcp: tcp_make_synack() should use sock_wmalloc
In commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb()), Eric changed
the call to sock_wmalloc in tcp_make_synack to alloc_skb.  In doing so,
the netfilter owner match lost its ability to block the SYNACK packet on
outbound listening sockets.  Revert the change, restoring the owner match
functionality.

This closes netfilter bugzilla #847.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-30 16:02:04 -04:00
Yuchung Cheng
1b7fdd2ab5 tcp: do not use cached RTT for RTT estimation
RTT cached in the TCP metrics are valuable for the initial timeout
because SYN RTT usually does not account for serialization delays
on low BW path.

However using it to seed the RTT estimator maybe disruptive because
other components (e.g., pacing) require the smooth RTT to be obtained
from actual connection.

The solution is to use the higher cached RTT to set the first RTO
conservatively like tcp_rtt_estimator(), but avoid seeding the other
RTT estimator variables such as srtt.  It is also a good idea to
keep RTO conservative to obtain the first RTT sample, and the
performance is insured by TCP loss probe if SYN RTT is available.

To keep the seeding formula consistent across SYN RTT and cached RTT,
the rttvar is twice the cached RTT instead of cached RTTVAR value. The
reason is because cached variation may be too small (near min RTO)
which defeats the purpose of being conservative on first RTO. However
the metrics still keep the RTT variations as they might be useful for
user applications (through ip).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-30 15:14:38 -04:00
David S. Miller
79f9ab7e0a Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
This pull request fixes some issues that arise when 6in4 or 4in6 tunnels
are used in combination with IPsec, all from Hannes Frederic Sowa and a
null pointer dereference when queueing packets to the policy hold queue.

1) We might access the local error handler of the wrong address family if
   6in4 or 4in6 tunnel is protected by ipsec. Fix this by addind a pointer
   to the correct local_error to xfrm_state_afinet.

2) Add a helper function to always refer to the correct interpretation
   of skb->sk.

3) Call skb_reset_inner_headers to record the position of the inner headers
   when adding a new one in various ipv6 tunnels. This is needed to identify
   the addresses where to send back errors in the xfrm layer.

4) Dereference inner ipv6 header if encapsulated to always call the
   right error handler.

5) Choose protocol family by skb protocol to not call the wrong
   xfrm{4,6}_local_error handler in case an ipv6 sockets is used
   in ipv4 mode.

6) Partly revert "xfrm: introduce helper for safe determination of mtu"
   because this introduced pmtu discovery problems.

7) Set skb->protocol on tcp, raw and ip6_append_data genereated skbs.
   We need this to get the correct mtu informations in xfrm.

8) Fix null pointer dereference in xdst_queue_output.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 16:05:30 -04:00
Chris Clark
c27c9322d0 ipv4: sendto/hdrincl: don't use destination address found in header
ipv4: raw_sendmsg: don't use header's destination address

A sendto() regression was bisected and found to start with commit
f8126f1d51 (ipv4: Adjust semantics of rt->rt_gateway.)

The problem is that it tries to ARP-lookup the constructed packet's
destination address rather than the explicitly provided address.

Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.

cf. commit 2ad5b9e4bd

Reported-by: Chris Clark <chris.clark@alcatel-lucent.com>
Bisected-by: Chris Clark <chris.clark@alcatel-lucent.com>
Tested-by: Chris Clark <chris.clark@alcatel-lucent.com>
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Chris Clark <chris.clark@alcatel-lucent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 15:57:52 -04:00
Eric Dumazet
95bd09eb27 tcp: TSO packets automatic sizing
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 15:50:06 -04:00
Andrew Vagin
e3e1202831 tcp: don't apply tsoffset if rcv_tsecr is zero
The zero value means that tsecr is not valid, so it's a special case.

tsoffset is used to customize tcp_time_stamp for one socket.
tsoffset is usually zero, it's used when a socket was moved from one
host to another host.

Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
TCP_RTO_MAX.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 15:11:12 -04:00
Andrew Vagin
c7781a6e3c tcp: initialize rcv_tstamp for restored sockets
u32 rcv_tstamp;     /* timestamp of last received ACK */

Its value used in tcp_retransmit_timer, which closes socket
if the last ack was received more then TCP_RTO_MAX ago.

Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
is called before receiving a first ack, the connection is closed.

This patch initializes rcv_tstamp to a timestamp, when a socket was
restored.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 15:11:11 -04:00
Fan Du
aba8269588 {ipv4,xfrm}: Introduce xfrm_tunnel_notifier for xfrm tunnel mode callback
Some thoughts on IPv4 VTI implementation:

The connection between VTI receiving part and xfrm tunnel mode input process
is hardly a "xfrm_tunnel", xfrm_tunnel is used in places where, e.g ipip/sit
and xfrm4_tunnel, acts like a true "tunnel" device.

In addition, IMHO, VTI doesn't need vti_err to do something meaningful, as all
VTI needs is just a notifier to be called whenever xfrm_input ingress a packet
to update statistics.

A IPsec protected packet is first handled by protocol handlers, e.g AH/ESP,
to check packet authentication or encryption rightness. PMTU update is taken
care of in this stage by protocol error handler.

Then the packet is rearranged properly depending on whether it's transport
mode or tunnel mode packed by mode "input" handler. The VTI handler code
takes effects in this stage in tunnel mode only. So it neither need propagate
PMTU, as it has already been done if necessary, nor the VTI handler is
qualified as a xfrm_tunnel.

So this patch introduces xfrm_tunnel_notifier and meanwhile wipe out vti_err
code.

Signed-off-by: Fan Du <fan.du@windriver.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-28 09:22:17 +02:00
Patrick McHardy
48b1de4c11 netfilter: add SYNPROXY core/target
Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.

The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.

It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.

Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28 00:27:54 +02:00
Patrick McHardy
0198230b77 net: syncookies: export cookie_v4_init_sequence/cookie_v4_check
Extract the local TCP stack independant parts of tcp_v4_init_sequence()
and cookie_v4_check() and export them for use by the upcoming SYNPROXY
target.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28 00:27:44 +02:00
Patrick McHardy
41d73ec053 netfilter: nf_conntrack: make sequence number adjustments usuable without NAT
Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.

As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28 00:26:48 +02:00
Phil Oester
affe759dba netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridged
As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
with the tcp-reset option sends out reset packets with the src MAC address
of the local bridge interface, instead of the MAC address of the intended
destination.  This causes some routers/firewalls to drop the reset packet
as it appears to be spoofed.  Fix this by bypassing ip[6]_local_out and
setting the MAC of the sender in the tcp reset packet.

This closes netfilter bugzilla #531.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28 00:13:12 +02:00
Daniel Borkmann
b1dcdc68b1 net: tcp_probe: allow more advanced ingress filtering by mark
Currently, the tcp_probe snooper can either filter packets by a given
port (handed to the module via module parameter e.g. port=80) or lets
all TCP traffic pass (port=0, default). When a port is specified, the
port number is tested against the sk's source/destination port. Thus,
if one of them matches, the information will be further processed for
the log.

As this is quite limited, allow for more advanced filtering possibilities
which can facilitate debugging/analysis with the help of the tcp_probe
snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
("pkt_sched: ingress socket filter by mark"), add the possibility to
use skb->mark as a filter.

If the mark is not being used otherwise, this allows ingress filtering
by flow (e.g. in order to track updates from only a single flow, or a
subset of all flows for a given port) and other things such as dynamic
logging and reconfiguration without removing/re-inserting the tcp_probe
module, etc. Simple example:

  insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
  ...
  iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
           --sport 60952 -j MARK --set-mark 8888
  [... sampling interval ...]
  iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
           --sport 60952 -j MARK --set-mark 8888

The current option to filter by a given port is still being preserved. A
similar approach could be done for the sctp_probe module as a follow-up.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 15:53:34 -04:00
David S. Miller
b05930f5d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/iwlwifi/pcie/trans.c
	include/linux/inetdevice.h

The inetdevice.h conflict involves moving the IPV4_DEVCONF values
into a UAPI header, overlapping additions of some new entries.

The iwlwifi conflict is a context overlap.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-26 16:37:08 -04:00
Hannes Frederic Sowa
5a25cf1e31 xfrm: revert ipv4 mtu determination to dst_mtu
In commit 0ea9d5e3e0 ("xfrm: introduce
helper for safe determination of mtu") I switched the determination of
ipv4 mtus from dst_mtu to ip_skb_dst_mtu. This was an error because in
case of IP_PMTUDISC_PROBE we fall back to the interface mtu, which is
never correct for ipv4 ipsec.

This patch partly reverts 0ea9d5e3e0
("xfrm: introduce helper for safe determination of mtu").

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-26 12:40:53 +02:00
Dan Carpenter
b4de77ade3 ipip: potential race in ip_tunnel_init_net()
Eric Dumazet says that my previous fix for an ERR_PTR dereference
(ea857f28ab 'ipip: dereferencing an ERR_PTR in ip_tunnel_init_net()')
could be racy and suggests the following fix instead.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-25 18:39:59 -04:00
Daniel Borkmann
f925d0a62d net: tcp_probe: add IPv6 support
The tcp_probe currently only supports analysis of IPv4 connections.
Therefore, it would be nice to have IPv6 supported as well. Since we
have the recently added %pISpc specifier that is IPv4/IPv6 generic,
build related sockaddress structures from the flow information and
pass this to our format string. Tested with SSH and HTTP sessions
on IPv4 and IPv6.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-22 16:19:50 -07:00
Daniel Borkmann
d8cdeda6dd net: tcp_probe: kprobes: adapt jtcp_rcv_established signature
This patches fixes a rather unproblematic function signature mismatch
as the const specifier was missing for the th variable; and next to
that it adds a build-time assertion so that future function signature
mismatches for kprobes will not end badly, similarly as commit 22222997
("net: sctp: add build check for sctp_sf_eat_sack_6_2/jsctp_sf_eat_sack")
did it for SCTP.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-22 16:19:50 -07:00
Daniel Borkmann
b4c1c1d038 net: tcp_probe: also include rcv_wnd next to snd_wnd
It is helpful to sometimes know the TCP window sizes of an established
socket e.g. to confirm that window scaling is working or to tweak the
window size to improve high-latency connections, etc etc. Currently the
TCP snooper only exports the send window size, but not the receive window
size. Therefore, also add the receive window size to the end of the
output line.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-22 16:19:50 -07:00
Yuchung Cheng
0f7cc9a3c2 tcp: increase throughput when reordering is high
The stack currently detects reordering and avoid spurious
retransmission very well. However the throughput is sub-optimal under
high reordering because cwnd is increased only if the data is deliverd
in order. I.e., FLAG_DATA_ACKED check in tcp_ack().  The more packet
are reordered the worse the throughput is.

Therefore when reordering is proven high, cwnd should advance whenever
the data is delivered regardless of its ordering. If reordering is low,
conservatively advance cwnd only on ordered deliveries in Open state,
and retain cwnd in Disordered state (RFC5681).

Using netperf on a qdisc setup of 20Mbps BW and random RTT from 45ms
to 55ms (for reordering effect). This change increases TCP throughput
by 20 - 25% to near bottleneck BW.

A special case is the stretched ACK with new SACK and/or ECE mark.
For example, a receiver may receive an out of order or ECN packet with
unacked data buffered because of LRO or delayed ACK. The principle on
such an ACK is to advance cwnd on the cummulative acked part first,
then reduce cwnd in tcp_fastretrans_alert().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-22 14:39:46 -07:00
Dan Carpenter
ea857f28ab ipip: dereferencing an ERR_PTR in ip_tunnel_init_net()
We need to move the derefernce after the IS_ERR() check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 15:12:15 -07:00
Eric Dumazet
734d2725db ipv4: raise IP_MAX_MTU to theoretical limit
As discussed last year [1], there is no compelling reason
to limit IPv4 MTU to 0xFFF0, while real limit is 0xFFFF

[1] : http://marc.info/?l=linux-netdev&m=135607247609434&w=2

Willem raised this issue again because some of our internal
regression tests broke after lo mtu being set to 65536.

IP_MTU reports 0xFFF0, and the test attempts to send a RAW datagram of
mtu + 1 bytes, expecting the send() to fail, but it does not.

Alexey raised interesting points about TCP MSS, that should be addressed
in follow-up patches in TCP stack if needed, as someone could also set
an odd mtu anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 15:05:04 -07:00
Christoph Paasch
397b417463 tcp: trivial: Remove nocache argument from tcp_v4_send_synack
The nocache-argument was used in tcp_v4_send_synack as an argument to
inet_csk_route_req. However, since ba3f7f04ef (ipv4: Kill
FLOWI_FLAG_RT_NOCACHE and associated code.) this is no more used.

This patch removes the unsued argument from tcp_v4_send_synack.

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 15:05:04 -07:00
David S. Miller
89d5e23210 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Conflicts:
	net/netfilter/nf_conntrack_proto_tcp.c

The conflict had to do with overlapping changes dealing with
fixing the use of an "s32" to hold the value returned by
NAT_OFFSET().

Pablo Neira Ayuso says:

====================
The following batch contains Netfilter/IPVS updates for your net-next tree.
More specifically, they are:

* Trivial typo fix in xt_addrtype, from Phil Oester.

* Remove net_ratelimit in the conntrack logging for consistency with other
  logging subsystem, from Patrick McHardy.

* Remove unneeded includes from the recently added xt_connlabel support, from
  Florian Westphal.

* Allow to update conntracks via nfqueue, don't need NFQA_CFG_F_CONNTRACK for
  this, from Florian Westphal.

* Remove tproxy core, now that we have socket early demux, from Florian
  Westphal.

* A couple of patches to refactor conntrack event reporting to save a good
  bunch of lines, from Florian Westphal.

* Fix missing locking in NAT sequence adjustment, it did not manifested in
  any known bug so far, from Patrick McHardy.

* Change sequence number adjustment variable to 32 bits, to delay the
  possible early overflow in long standing connections, also from Patrick.

* Comestic cleanups for IPVS, from Dragos Foianu.

* Fix possible null dereference in IPVS in the SH scheduler, from Daniel
  Borkmann.

* Allow to attach conntrack expectations via nfqueue. Before this patch, you
  had to use ctnetlink instead, thus, we save the conntrack lookup.

* Export xt_rpfilter and xt_HMARK header files, from Nicolas Dichtel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 13:30:54 -07:00
Andrey Vagin
7ed5c5ae96 tcp: set timestamps for restored skb-s
When the repair mode is turned off, the write queue seqs are
updated so that the whole queue is considered to be 'already sent.

The "when" field must be set for such skb. It's used in tcp_rearm_rto
for example. If the "when" field isn't set, the retransmit timeout can
be calculated incorrectly and a tcp connected can stop for two minutes
(TCP_RTO_MAX).

Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 13:07:15 -07:00
David S. Miller
2ff1cf12c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
Francesco Fusco
d14c5ab6be net: proc_fs: trivial: print UIDs as unsigned int
UIDs are printed in the proc_fs as signed int, whereas
they are unsigned int.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 14:37:46 -07:00
Nicolas Dichtel
6c742e714d ipip: add x-netns support
This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 01:00:20 -07:00
Nicolas Dichtel
fc8f999daa ipv4 tunnels: use net_eq() helper to check netns
It's better to use available helpers for these tests.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 01:00:20 -07:00
Nicolas Dichtel
64261f230a dev: move skb_scrub_packet() after eth_type_trans()
skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans()
set pkt_type.

In fact, we should force pkt_type to PACKET_HOST, so move the call after
eth_type_trans().

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 01:00:20 -07:00
Hannes Frederic Sowa
0ea9d5e3e0 xfrm: introduce helper for safe determination of mtu
skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we
always have to make sure we a referring to the correct interpretation
of skb->sk.

We only depend on header defines to query the mtu, so we don't introduce
a new dependency to ipv6 by this change.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-14 13:09:07 +02:00
Hannes Frederic Sowa
628e341f31 xfrm: make local error reporting more robust
In xfrm4 and xfrm6 we need to take care about sockets of the other
address family. This could happen because a 6in4 or 4in6 tunnel could
get protected by ipsec.

Because we don't want to have a run-time dependency on ipv6 when only
using ipv4 xfrm we have to embed a pointer to the correct local_error
function in xfrm_state_afinet and look it up when returning an error
depending on the socket address family.

Thanks to vi0ss for the great bug report:
<https://bugzilla.kernel.org/show_bug.cgi?id=58691>

v2:
a) fix two more unsafe interpretations of skb->sk as ipv6 socket
   (xfrm6_local_dontfrag and __xfrm6_output)
v3:
a) add an EXPORT_SYMBOL_GPL(xfrm_local_error) to fix a link error when
   building ipv6 as a module (thanks to Steffen Klassert)

Reported-by: <vi0oss@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-14 13:07:12 +02:00
Pravin B Shelar
4221f40513 ip_tunnel: Do not use inner ip-header-id for tunnel ip-header-id.
Using inner-id for tunnel id is not safe in some rare cases.
E.g. packets coming from multiple sources entering same tunnel
can have same id. Therefore on tunnel packet receive we
could have packets from two different stream but with same
source and dst IP with same ip-id which could confuse ip packet
reassembly.

Following patch reverts optimization from commit
490ab08127 (IP_GRE: Fix IP-Identification.)

CC: Jarno Rajahalme <jrajahalme@nicira.com>
CC: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13 16:52:50 -07:00
Yuchung Cheng
74c181d528 tcp: reset reordering est. selectively on timeout
On timeout the TCP sender unconditionally resets the estimated degree
of network reordering (tp->reordering). The idea behind this is that
the estimate is too large to trigger fast recovery (e.g., due to a IP
path change).

But for example if the sender only had 2 packets outstanding, then a
timeout doesn't tell much about reordering. A sender that learns about
reordering on big writes and loses packets on small writes will end up
falsely retransmitting again and again, especially when reordering is
more likely on big writes.

Therefore the sender should only suspect that tp->reordering is too
high if it could have gone into fast recovery with the (lower) default
estimate.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13 16:08:33 -07:00
Yuchung Cheng
149479d019 tcp: add server ip to encrypt cookie in fast open
Encrypt the cookie with both server and client IPv4 addresses,
such that multi-homed server will grant different cookies
based on both the source and destination IPs. No client change
is needed since cookie is opaque to the client.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10 00:35:33 -07:00
Eliezer Tamir
288a937637 net: rename busy poll MIB counter
Rename mib counter from "low latency" to "busy poll"

v1 also moved the counter to the ip MIB (suggested by Shawn Bohrer)
Eric Dumazet suggested that the current location is better.

So v2 just renames the counter to fit the new naming convention.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 11:39:08 -07:00
William Manley
2690048c01 net: igmp: Allow user-space configuration of igmp unsolicited report interval
Adds the new procfs knobs:

    /proc/sys/net/ipv4/conf/*/igmpv2_unsolicited_report_interval
    /proc/sys/net/ipv4/conf/*/igmpv3_unsolicited_report_interval

Which will allow userspace configuration of the IGMP unsolicited report
interval (see below) in milliseconds.  The defaults are 10000ms for IGMPv2
and 1000ms for IGMPv3 in accordance with RFC2236 and RFC3376.

Background:

If an IGMP join packet is lost you will not receive data sent to the
multicast group so if no data arrives from that multicast group in a
period of time after the IGMP join a second IGMP join will be sent.  The
delay between joins is the "IGMP Unsolicited Report Interval".

Prior to this patch this value was hard coded in the kernel to 10s for
IGMPv2 and 1s for IGMPv3.  10s is unsuitable for some use-cases, such as
IPTV as it can cause channel change to be slow in the presence of packet
loss.

This patch allows the value to be overridden from userspace for both
IGMPv2 and IGMPv3 such that it can be tuned accoding to the network.

Tested with Wireshark and a simple program to join a (non-existent)
multicast group.  The distribution of timings for the second join differ
based upon setting the procfs knobs.

igmpvX_unsolicited_report_interval is intended to follow the pattern
established by force_igmp_version, and while a procfs entry has been added
a corresponding sysctl knob has not as it is my understanding that sysctl
is deprecated[1].

[1]: http://lwn.net/Articles/247243/

Signed-off-by: William Manley <william.manley@youview.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 11:27:46 -07:00
William Manley
5c6fe01c1f net: igmp: Don't flush routing cache when force_igmp_version is modified
The procfs knob /proc/sys/net/ipv4/conf/*/force_igmp_version allows the
IGMP protocol version to use to be explicitly set.  As a side effect this
caused the routing cache to be flushed as it was declared as a
DEVINET_SYSCTL_FLUSHING_ENTRY.  Flushing is unnecessary and this patch
makes it so flushing does not occur.

Requested by Hannes Frederic Sowa as he was reviewing other patches
adding procfs entries.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: William Manley <william.manley@youview.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 11:27:46 -07:00
William Manley
cab70040df net: igmp: Reduce Unsolicited report interval to 1s when using IGMPv3
If an IGMP join packet is lost you will not receive data sent to the
multicast group so if no data arrives from that multicast group in a
period of time after the IGMP join a second IGMP join will be sent.  The
delay between joins is the "IGMP Unsolicited Report Interval".

Previously this value was hard coded to be chosen randomly between 0-10s.
This can be too long for some use-cases, such as IPTV as it can cause
channel change to be slow in the presence of packet loss.

The value 10s has come from IGMPv2 RFC2236, which was reduced to 1s in
IGMPv3 RFC3376.  This patch makes the kernel use the 1s value from the
later RFC if we are operating in IGMPv3 mode.  IGMPv2 behaviour is
unaffected.

Tested with Wireshark and a simple program to join a (non-existent)
multicast group.  The distribution of timings for the second join differ
based upon setting /proc/sys/net/ipv4/conf/eth0/force_igmp_version.

Signed-off-by: William Manley <william.manley@youview.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 11:27:46 -07:00
Timo Teräs
77a482bdb2 ip_gre: fix ipgre_header to return correct offset
Fix ipgre_header() (header_ops->create) to return the correct
amount of bytes pushed. Most callers of dev_hard_header() seem
to care only if it was success, but af_packet.c uses it as
offset to the skb to copy from userspace only once. In practice
this fixes packet socket sendto()/sendmsg() to gre tunnels.

Regression introduced in c544193214
("GRE: Refactor GRE tunneling code.")

Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 11:05:24 -07:00
Florian Westphal
c655bc6896 netfilter: nf_conntrack: don't send destroy events from iterator
Let nf_ct_delete handle delivery of the DESTROY event.

Based on earlier patch from Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-09 12:03:33 +02:00
Eric Dumazet
1f07d03e20 net: add SNMP counters tracking incoming ECN bits
With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
counters do not count the number of frames, but number of aggregated
segments.

Its probably too late to change this now.

This patch adds four new counters, tracking number of frames, regardless
of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.

Ip[6]NoECTPkts : Number of packets received with NOECT
Ip[6]ECT1Pkts  : Number of packets received with ECT(1)
Ip[6]ECT0Pkts  : Number of packets received with ECT(0)
Ip[6]CEPkts    : Number of packets received with Congestion Experienced

lph37:~# nstat | egrep "Pkts|InReceive"
IpInReceives                    1634137            0.0
Ip6InReceives                   3714107            0.0
Ip6InNoECTPkts                  19205              0.0
Ip6InECT0Pkts                   52651828           0.0
IpExtInNoECTPkts                33630              0.0
IpExtInECT0Pkts                 15581379           0.0
IpExtInCEPkts                   6                  0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-08 22:24:59 -07:00
Tejun Heo
182446d087 cgroup: pass around cgroup_subsys_state instead of cgroup in file methods
cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.

This patch converts all cftype file operations to take @css instead of
@cgroup.  cftypes for the cgroup core files don't have their subsytem
pointer set.  These will automatically use the dummy_css added by the
previous patch and can be converted the same way.

Most subsystem conversions are straight forwards but there are some
interesting ones.

* freezer: update_if_frozen() is also converted to take @css instead
  of @cgroup for consistency.  This will make the code look simpler
  too once iterators are converted to use css.

* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
  vmpressure while mem_cgroup_from_cont() can be made static.
  Updated accordingly.

* cpu: cgroup_tg() doesn't have any user left.  Removed.

* cpuacct: cgroup_ca() doesn't have any user left.  Removed.

* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
  Removed.

* net_cls: cgrp_cls_state() doesn't have any user left.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
2013-08-08 20:11:24 -04:00
stephen hemminger
6261d983f2 ip_tunnel: embed hash list head
The IP tunnel hash heads can be embedded in the per-net structure
since it is a fixed size. Reduce the size so that the total structure
fits in a page size. The original size was overly large, even NETDEV_HASHBITS
is only 8 bits!

Also, add some white space for readability.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Pravin B Shelar <pshelar@nicira.com>.
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 16:47:52 -07:00
Eric Dumazet
cd6b423afd tcp: cubic: fix bug in bictcp_acked()
While investigating about strange increase of retransmit rates
on hosts ~24 days after boot, Van found hystart was disabled
if ca->epoch_start was 0, as following condition is true
when tcp_time_stamp high order bit is set.

(s32)(tcp_time_stamp - ca->epoch_start) < HZ

Quoting Van :

 At initialization & after every loss ca->epoch_start is set to zero so
 I believe that the above line will turn off hystart as soon as the 2^31
 bit is set in tcp_time_stamp & hystart will stay off for 24 days.
 I think we've observed that cubic's restart is too aggressive without
 hystart so this might account for the higher drop rate we observe.

Diagnosed-by: Van Jacobson <vanj@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 10:35:08 -07:00
Eric Dumazet
2ed0edf909 tcp: cubic: fix overflow error in bictcp_update()
commit 17a6e9f1aa ("tcp_cubic: fix clock dependency") added an
overflow error in bictcp_update() in following code :

/* change the unit from HZ to bictcp_HZ */
t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) -
      ca->epoch_start) << BICTCP_HZ) / HZ;

Because msecs_to_jiffies() being unsigned long, compiler does
implicit type promotion.

We really want to constrain (tcp_time_stamp - ca->epoch_start)
to a signed 32bit value, or else 't' has unexpected high values.

This bugs triggers an increase of retransmit rates ~24 days after
boot [1], as the high order bit of tcp_time_stamp flips.

[1] for hosts with HZ=1000

Big thanks to Van Jacobson for spotting this problem.

Diagnosed-by: Van Jacobson <vanj@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 10:35:05 -07:00
Eric Dumazet
aab515d7c3 fib_trie: remove potential out of bound access
AddressSanitizer [1] dynamic checker pointed a potential
out of bound access in leaf_walk_rcu()

We could allocate one more slot in tnode_new() to leave the prefetch()
in-place but it looks not worth the pain.

Bug added in commit 82cfbb0085 ("[IPV4] fib_trie: iterator recode")

[1] :
https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05 15:26:11 -07:00
Daniel Borkmann
7921895a5e net: esp{4,6}: fix potential MTU calculation overflows
Commit 91657eafb ("xfrm: take net hdr len into account for esp payload
size calculation") introduced a possible interger overflow in
esp{4,6}_get_mtu() handlers in case of x->props.mode equals
XFRM_MODE_TUNNEL. Thus, the following expression will overflow

  unsigned int net_adj;
  ...
  <case ipv{4,6} XFRM_MODE_TUNNEL>
         net_adj = 0;
  ...
  return ((mtu - x->props.header_len - crypto_aead_authsize(esp->aead) -
           net_adj) & ~(align - 1)) + (net_adj - 2);

where (net_adj - 2) would be evaluated as <foo> + (0 - 2) in an unsigned
context. Fix it by simply removing brackets as those operations here
do not need to have special precedence.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Benjamin Poirier <bpoirier@suse.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05 12:26:50 -07:00
David S. Miller
0e76a3a587 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 21:36:46 -07:00
Stefan Tomanek
73f5698e77 fib_rules: fix suppressor names and default values
This change brings the suppressor attribute names into line; it also changes
the data types to provide a more consistent interface.

While -1 indicates that the suppressor is not enabled, values >= 0 for
suppress_prefixlen or suppress_ifgroup  reject routing decisions violating the
constraint.

This changes the previously presented behaviour of suppress_prefixlen, where a
prefix length _less_ than the attribute value was rejected. After this change,
a prefix length less than *or* equal to the value is considered a violation of
the rule constraint.

It also changes the default values for default and newly added rules (disabling
any suppression for those).

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 10:40:23 -07:00
Stefan Tomanek
6ef94cfafb fib_rules: add route suppression based on ifgroup
This change adds the ability to suppress a routing decision based upon the
interface group the selected interface belongs to. This allows it to
exclude specific devices from a routing decision.

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 15:24:22 -07:00
Daniel Borkmann
446266b0c7 net: rtm_to_ifaddr: free ifa if ifa_cacheinfo processing fails
Commit 5c766d642 ("ipv4: introduce address lifetime") leaves the ifa
resource that was allocated via inet_alloc_ifa() unfreed when returning
the function with -EINVAL. Thus, free it first via inet_free_ifa().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 14:56:06 -07:00
Stefan Tomanek
7764a45a8f fib_rules: add .suppress operation
This change adds a new operation to the fib_rules_ops struct; it allows the
suppression of routing decisions if certain criteria are not met by its
results.

The first implemented constraint is a minimum prefix length added to the
structures of routing rules. If a rule is added with a minimum prefix length
>0, only routes meeting this threshold will be considered. Any other (more
general) routing table entries will be ignored.

When configuring a system with multiple network uplinks and default routes, it
is often convinient to reference the main routing table multiple times - but
omitting the default route. Using this patch and a modified "ip" utility, this
can be achieved by using the following command sequence:

  $ ip route add table secuplink default via 10.42.23.1

  $ ip rule add pref 100            table main prefixlength 1
  $ ip rule add pref 150 fwmark 0xA table secuplink

With this setup, packets marked 0xA will be processed by the additional routing
table "secuplink", but only if no suitable route in the main routing table can
be found. By using a minimal prefixlength of 1, the default route (/0) of the
table "main" is hidden to packets processed by rule 100; packets traveling to
destinations with more specific routing entries are processed as usual.

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:27:17 -07:00
fan.du
ca4c3fc24e net: split rt_genid for ipv4 and ipv6
Current net name space has only one genid for both IPv4 and IPv6, it has below
drawbacks:

- Add/delete an IPv4 address will invalidate all IPv6 routing table entries.
- Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table
  entries even when the policy is only applied for one address family.

Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6
separately in a fine granularity.

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 14:56:36 -07:00
Dmitry Popov
c0155b2da4 tcp: Remove unused tcpct declarations and comments
Remove declaration, 4 defines and confusing comment that are no longer used
since 1a2c6181c4 ("tcp: Remove TCPCT").

Signed-off-by: Dmitry Popov <dp@highloadlab.com>
Acked-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 12:16:45 -07:00
Dan Carpenter
4299c8a94f net: remove an unneeded check
"ifa->ifa_label" is an array inside the in_ifaddr struct.  It can never
be NULL so we can remove this check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 19:24:16 -07:00
Hannes Frederic Sowa
5ad37d5dee tcp: add tcp_syncookies mode to allow unconditionally generation of syncookies
| If you want to test which effects syncookies have to your
| network connections you can set this knob to 2 to enable
| unconditionally generation of syncookies.

Original idea and first implementation by Eric Dumazet.

Cc: Florian Westphal <fw@strlen.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 16:15:18 -07:00
Hannes Frederic Sowa
9d4a031464 ipv4, ipv6: send igmpv3/mld packets with TC_PRIO_CONTROL
v2:
a) Also send ipv4 igmp messages with TC_PRIO_CONTROL

Cc: William Manley <william.manley@youview.com>
Cc: Lukas Tribus <luky-37@hotmail.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-28 11:13:55 -07:00
Thomas Graf
c26bf4a513 pktgen: Add UDPCSUM flag to support UDP checksums
UDP checksums are optional, hence pktgen has been omitting them in
favour of performance. The optional flag UDPCSUM enables UDP
checksumming. If the output device supports hardware checksumming
the skb is prepared and marked CHECKSUM_PARTIAL, otherwise the
checksum is generated in software.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-27 22:16:36 -07:00
Eric Dumazet
c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Eric Dumazet
64dc61306c net: add sk_stream_is_writeable() helper
Several call sites use the hardcoded following condition :

sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)

Lets use a helper because TCP_NOTSENT_LOWAT support will change this
condition for TCP sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Jerry Snitselaar
f585a991e1 fib_trie: potential out of bounds access in trie_show_stats()
With the <= max condition in the for loop, it will be always go 1
element further than needed. If the condition for the while loop is
never met, then max is MAX_STAT_DEPTH, and for loop will walk off the
end of nodesizes[].

Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 16:05:14 -07:00
Amerigo Wang
b9959fd3b0 vti: switch to new ip tunnel code
GRE tunnel and IPIP tunnel already switched to the new
ip tunnel code, VTI tunnel can use it too.

Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Saurabh Mohan <saurabh.mohan@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 17:27:32 -07:00
Rami Rosen
c4854ec8c4 ipmr: change the prototype of ip_mr_forward().
This patch changes the prototpye of the ip_mr_forward() method to return void
instead of int.

The ip_mr_forward() method always returns 0; moreover, the return value of this
method is not checked anywhere.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 17:01:05 -07:00
Jiri Pirko
4aa5dee4d9 net: convert resend IGMP to notifier event
Until now, bond_resend_igmp_join_requests() looks for vlans attached to
bonding device, bridge where bonding act as port manually. It does not
care of other scenarios, like stacked bonds or team device above. Make
this more generic and use netdev notifier to propagate the event to
upper devices and to actually call ip_mc_rejoin_groups().

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 16:52:47 -07:00
Yuchung Cheng
ed08495c31 tcp: use RTT from SACK for RTO
If RTT is not available because Karn's check has failed or no
new packet is acked, use the RTT measured from SACK to estimate
the RTO. The sender can continue to estimate the RTO during loss
recovery or reordering event upon receiving non-partial ACKs.

This also changes when the RTO is re-armed. Previously it is
only re-armed when some data is cummulatively acknowledged (i.e.,
SND.UNA advances), but now it is re-armed whenever RTT estimator
is updated. This feature is particularly useful to reduce spurious
timeout for buffer bloat including cellular carriers [1], and
RTT estimation on reordering events.

[1] "An In-depth Study of LTE: Effect of Network Protocol and
 Application Behavior on Performance", In Proc. of SIGCOMM 2013

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Yuchung Cheng
59c9af4234 tcp: measure RTT from new SACK
Take RTT sample if an ACK selectively acks some sequences that
have never been retransmitted. The Karn's algorithm does not apply
even if that ACK (s)acks other retransmitted sequences, because it
must been generated by an original but perhaps out-of-order packet.
There is no ambiguity. In case when multiple blocks are newly
sacked because of ACK losses the earliest block is used to
measure RTT, similar to cummulative ACKs.

Such RTT samples allow the sender to estimate the RTO during loss
recovery and packet reordering events. It is still useful even with
TCP timestamps. That's because during these events the SND.UNA may
not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
check before calling tcp_ack_update_rtt()).  Therefore this new
RTT source is complementary to existing ACK and TS RTT mechanisms.

This patch does not update the RTO. It is done in the next patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Yuchung Cheng
5b08e47caf tcp: prefer packet timing to TS-ECR for RTT
Prefer packet timings to TS-ecr for RTT measurements when both
sources are available. That's because broken middle-boxes and remote
peer can return packets with corrupted TS ECR fields. Similarly most
congestion controls that require RTT signals favor timing-based
sources as well. Also check for bad TS ECR values to avoid RTT
blow-ups. It has happened on production Web servers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Yuchung Cheng
375fe02c91 tcp: consolidate SYNACK RTT sampling
The first patch consolidates SYNACK and other RTT measurement to use a
central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
RTT measurement happens after PAWS check, potentially reducing the
impact of RTO seeding on bad TCP timestamps values.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Michal Tesar
651e92716a sysctl net: Keep tcp_syn_retries inside the boundary
Limit the min/max value passed to the
/proc/sys/net/ipv4/tcp_syn_retries.

Signed-off-by: Michal Tesar <mtesar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-19 17:36:03 -07:00
Eric Dumazet
21d1196a35 ipv4: set transport header earlier
commit 45f00f99d6 ("ipv4: tcp: clean up tcp_v4_early_demux()") added a
performance regression for non GRO traffic, basically disabling
IP early demux.

IPv6 stack resets transport header in ip6_rcv() before calling
IP early demux in ip6_rcv_finish(), while IPv4 does this only in
ip_local_deliver_finish(), _after_ IP early demux.

GRO traffic happened to enable IP early demux because transport header
is also set in inet_gro_receive()

Instead of reverting the faulty commit, we can make IPv4/IPv6 behave the
same : transport_header should be set in ip_rcv() instead of
ip_local_deliver_finish()

ip_local_deliver_finish() can also use skb_network_header_len() which is
faster than ip_hdrlen()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-16 12:59:28 -07:00