This patch adds support for conntrack marking from user space.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes an oops triggered from userspace. If we don't pass information
about the private protocol info, the reference to attr will be NULL. This is
likely to happen in update messages.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
nfattr_parse (and thus nfattr_parse_nested) always returns success. So we
can make them 'void' and remove all the checking at the caller side.
Based on original patch by Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The packet counter variable of conntrack was changed to 32bits from 64bits.
This follows that change.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ip_queue_xmit calls ip_select_ident_more for IP identity selection
it gives it the wrong packet count for TSO packets. The ip_select_*
functions expect one less than the number of packets, so we need to
subtract one for TSO packets.
This bug was diagnosed and fixed by Tom Young.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Jesper Juhl <jesper.juhl@gmail.com>
This is the net/ part of the big kfree cleanup patch.
Remove pointless checks for NULL prior to calling kfree() in net/.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
There was a fix in 2.6.13 that changed the behaviour of
ip_vs_conn_expire_now function not to put reference to connection,
its callers should hold write lock or connection refcnt. But we
forgot to convert one caller, when the real server for connection
is unavailable caller should put the connection reference. It
happens only when sysctl var expire_nodest_conn is set to 1 and
such connections never expire. Thanks to Roberto Nibali who found
the problem and tested a 2.4.32-rc2 patch, which is equal to this
2.6 version. Patch for 2.4 is already sent to Marcelo.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Roberto Nibali <ratz@drugphish.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch randomizes the port selected on bind() for connections
to help with possible security attacks. It should also be faster
in most cases because there is no need for a global lock.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
There's a missing dependency from the CONNMARK target to ip_conntrack.
Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
It's not necessary to free skb if netlink_unicast() failed.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
The unknown protocol is used as a fallback when a protocol isn't known.
Hence we cannot handle it failing, so don't set ".me". It's OK, since we
only grab a reference from within the same module (iptable_nat.ko), so we
never take the module refcount from 0 to 1.
Also, remove the "protocol is NULL" test: it's never NULL.
Signed-off-by: Rusty Rusty <rusty@rustcorp.com.au>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This endianness bug slipped through while changing the 'gre.key' field in the
conntrack tuple from 32bit to 16bit.
None of my tests caught the problem, since the linux pptp client always has
'0' as call id / gre key. Only windows clients actually trigger the bug.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch fixes compilation of the PPTP conntrack helper when NAT is
configured off.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
The max growth of BIC TCP is too large. Original code was based on
BIC 1.0 and the default there was 32. Later code (2.6.13) included
compensation for delayed acks, and should have reduced the default
value to 16; since normally TCP gets one ack for every two packets sent.
The current value of 32 makes BIC too aggressive and unfair to other
flows.
Submitted-by: Injong Rhee <rhee@eos.ncsu.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
And filter mode is exclude.
Further explanation by David Stevens:
Multicast source filters aren't widely used yet, and that's really the only
feature that's affected if an application actually exercises this bug, as far
as I can tell. An ordinary filter-less multicast join should still work, and
only forwarded multicast traffic making use of filters and doing empty-source
filters with the MSFILTER ioctl would be at risk of not getting multicast
traffic forwarded to them because the reports generated would not be based on
the correct counts.
Signed-off-by: Yan Zheng <yanzheng@21cn.com
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Like ip_tables already has it for some time, this adds support for
having multiple revisions for each match/target. We steal one byte from
the name in order to accomodate a 8 bit version number.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Typo fix: dots appearing after a newline in printk strings.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fib_del_ifaddr() dereferences ifa->ifa_dev, so the code already assumes that
ifa->ifa_dev is non-NULL, the check is unnecessary.
Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Attached is kernel patch for UDP Fragmentation Offload (UFO) feature.
1. This patch incorporate the review comments by Jeff Garzik.
2. Renamed USO as UFO (UDP Fragmentation Offload)
3. udp sendfile support with UFO
This patches uses scatter-gather feature of skb to generate large UDP
datagram. Below is a "how-to" on changes required in network device
driver to use the UFO interface.
UDP Fragmentation Offload (UFO) Interface:
-------------------------------------------
UFO is a feature wherein the Linux kernel network stack will offload the
IP fragmentation functionality of large UDP datagram to hardware. This
will reduce the overhead of stack in fragmenting the large UDP datagram to
MTU sized packets
1) Drivers indicate their capability of UFO using
dev->features |= NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_SG
NETIF_F_HW_CSUM is required for UFO over ipv6.
2) UFO packet will be submitted for transmission using driver xmit routine.
UFO packet will have a non-zero value for
"skb_shinfo(skb)->ufo_size"
skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP
fragment going out of the adapter after IP fragmentation by hardware.
skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[]
contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW
indicating that hardware has to do checksum calculation. Hardware should
compute the UDP checksum of complete datagram and also ip header checksum of
each fragmented IP packet.
For IPV6 the UFO provides the fragment identification-id in
skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating
IPv6 fragments.
Signed-off-by: Ananda Raju <ananda.raju@neterion.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This bug is responsible for causing the infamous "Treason uncloaked"
messages that's been popping up everywhere since the printk was added.
It has usually been blamed on foreign operating systems. However,
some of those reports implicate Linux as both systems are running
Linux or the TCP connection is going across the loopback interface.
In fact, there really is a bug in the Linux TCP header prediction code
that's been there since at least 2.1.8. This bug was tracked down with
help from Dale Blount.
The effect of this bug ranges from harmless "Treason uncloaked"
messages to hung/aborted TCP connections. The details of the bug
and fix is as follows.
When snd_wnd is updated, we only update pred_flags if
tcp_fast_path_check succeeds. When it fails (for example,
when our rcvbuf is used up), we will leave pred_flags with
an out-of-date snd_wnd value.
When the out-of-date pred_flags happens to match the next incoming
packet we will again hit the fast path and use the current snd_wnd
which will be wrong.
In the case of the treason messages, it just happens that the snd_wnd
cached in pred_flags is zero while tp->snd_wnd is non-zero. Therefore
when a zero-window packet comes in we incorrectly conclude that the
window is non-zero.
In fact if the peer continues to send us zero-window pure ACKs we
will continue making the same mistake. It's only when the peer
transmits a zero-window packet with data attached that we get a
chance to snap out of it. This is what triggers the treason
message at the next retransmit timeout.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Fix setting of the broadcast address when the netmask is set via
SIOCSIFNETMASK in Linux 2.6. The code wanted the old value of
ifa->ifa_mask but used it after it had already been overwritten with
the new value.
Signed-off-by: David Engel <gigem@comcast.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
skb_prev is assigned from skb, which cannot be NULL. This patch removes the
unnecessary NULL check.
Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch kills a redundant rcu_dereference on fa->fa_info in fib_trie.c.
As this dereference directly follows a list_for_each_entry_rcu line, we
have already taken a read barrier with respect to getting an entry from
the list.
This read barrier guarantees that all values read out of fa are valid.
In particular, the contents of structure pointed to by fa->fa_info is
initialised before fa->fa_info is actually set (see fn_trie_insert);
the setting of fa->fa_info itself is further separated with a write
barrier from the insertion of fa into the list.
Therefore by taking a read barrier after obtaining fa from the list
(which is given by list_for_each_entry_rcu), we can be sure that
fa->fa_info contains a valid pointer, as well as the fact that the
data pointed to by fa->fa_info is itself valid.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
It's fairly simple to resize the hash table, but currently you need to
remove and reinsert the module. That's bad (we lose connection
state). Harald has even offered to write a daemon which sets this
based on load.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
In 'net' change the explicit use of for-loops and NR_CPUS into the
general for_each_cpu() or for_each_online_cpu() constructs, as
appropriate. This widens the scope of potential future optimizations
of the general constructs, as well as takes advantage of the existing
optimizations of first_cpu() and next_cpu(), which is advantageous
when the true CPU count is much smaller than NR_CPUS.
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
IPVS used flag NFC_IPVS_PROPERTY in nfcache but as now nfcache was removed the
new flag 'ipvs_property' still needs to be copied. This patch should be
included in 2.6.14.
Further comments from Harald Welte:
Sorry, seems like the bug was introduced by me.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
It is legitimate to call tcp_fragment with len == skb->len since
that is done for FIN packets and the FIN flag counts as one byte.
So we should only check for the len > skb->len case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Original patch by Harald Welte, with feedback from Herbert Xu
and testing by Sbastien Bernard.
EBTABLES, ARP tables, and IP/IP6 tables all assume that cpus
are numbered linearly. That is not necessarily true.
This patch fixes that up by calculating the largest possible
cpu number, and allocating enough per-cpu structure space given
that.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second report of this bug. Unfortunately the first
reporter hasn't been able to reproduce it since to provide more
debugging info.
So let's apply this patch for 2.6.14 to
1) Make this non-fatal.
2) Provide the info we need to track it down.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is required to avoid unloading a module that has active timewait
sockets, such as DCCP.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add support to change the state of the private protocol
information via conntrack_netlink.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the ability of changing the state a TCP connection. I know
that this must be used with care but it's required to provide a complete
conntrack creation via conntrack_netlink. So I'll document this aspect on
the upcoming docs.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initially we used 64bit counters for conntrack-based accounting, since we
had no event mechanism to tell userspace that our counters are about to
overflow. With nfnetlink_conntrack, we now have such a event mechanism and
thus can save 16bytes per connection.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the following bugs in ESP:
* Fix transport mode MTU overestimate. This means that the inner MTU
is smaller than it needs be. Worse yet, given an input MTU which
is a multiple of 4 it will always produce an estimate which is not
a multiple of 4.
For example, given a standard ESP/3DES/MD5 transform and an MTU of
1500, the resulting MTU for transport mode is 1462 when it should
be 1464.
The reason for this is because IP header lengths are always a multiple
of 4 for IPv4 and 8 for IPv6.
* Ensure that the block size is at least 4. This is required by RFC2406
and corresponds to what the esp_output function does. At the moment
this only affects crypto_null as its block size is 1.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses the macro ALIGN in all the applicable spots for ESP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To keep consistency, the TCP private protocol information is nested
attributes under CTA_PROTOINFO_TCP. This way the sequence of attributes to
access the TCP state information looks like here below:
CTA_PROTOINFO
CTA_PROTOINFO_TCP
CTA_PROTOINFO_TCP_STATE
instead of:
CTA_PROTOINFO
CTA_PROTOINFO_TCP_STATE
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ID is only required by ICMP type 8 (echo), so it's not
mandatory for all sort of ICMP connections. This patch makes
mandatory only the type and the code for ICMP netlink messages.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we send "status" from userspace, we forget to convert the endianness.
This patch adds the reqired conversion. Thanks to Pablo Neira for
discovering this.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to nfnetlink_queue and ip_queue, we mark ipt_ULOG as obsolete.
This should have been part of the original nfnetlink_log merge, but
I somehow missed it.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
PPTP should not be selectable without conntrack enabled
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Missing parenthesis in causes BIC to be slow in increasing congestion
window.
Spotted by Injong Rhee.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Randy Dunlap <rdunlap@xenotime.net>
Fix implicit nocast warnings in ip_vs code:
net/ipv4/ipvs/ip_vs_app.c:631:54: warning: implicit cast to nocast type
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch below introduces special thresholds to keep root node in the trie
large. This gives a flatter tree at the cost of a modest memory increase.
Overall it seems to be gain and this was also proposed by one the authors
of the paper in recent a seminar.
Main table after loading 123 k routes.
Aver depth: 3.30
Max depth: 9
Root-node size 12 bits
Total size: 4044 kB
With the patch:
Aver depth: 2.78
Max depth: 8
Root-node size 15 bits
Total size: 4150 kB
An increase of 8-10% was seen in forwading performance for an rDoS attack.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
introduces __in_dev_get_rcu() to cover the second case.
1) RCU with refcnt should use in_dev_get().
2) RCU without refcnt should use __in_dev_get_rcu().
3) All others must hold RTNL and use __in_dev_get_rtnl().
There is one exception in net/ipv4/route.c which is in fact a pre-existing
race condition. I've marked it as such so that we remember to fix it.
This patch is based on suggestions and prior work by Suzanne Wood and
Paul McKenney.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Meelis Roos <mroos@linux.ee> wrote:
> RK> My firewall setup relies on proxyarp working. However, with 2.6.14-rc3,
> RK> it appears to be completely broken. The firewall is 212.18.232.186,
>
> Same here with some kernel between 14-rc2 and 14-rc3 - no reposnse to
> ARP on a proxyarp gateway. Sorry, no exact revison and no more debugging
> yet since it'a a production gateway.
The breakage is caused by the change to use the CB area for flagging
whether a packet has been queued due to proxy_delay. This area gets
cleared every time arp_rcv gets called. Unfortunately packets delayed
due to proxy_delay also go through arp_rcv when they are reprocessed.
In fact, I can't think of a reason why delayed proxy packets should go
through netfilter again at all. So the easiest solution is to bypass
that and go straight to arp_process.
This is essentially what would've happened before netfilter support
was added to ARP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnaldo and I agreed it could be applied now, because I have other
pending patches depending on this one (Thank you Arnaldo)
(The other important patch moves skc_refcnt in a separate cache line,
so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)
1) First some performance data :
--------------------------------
tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()
The most time critical code is :
sk_for_each(sk, node, &head->chain) {
if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
goto hit; /* You sunk my battleship! */
}
The sk_for_each() does use prefetch() hints but only the begining of
"struct sock" is prefetched.
As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
away from the begining of "struct sock", it has to bring into CPU
cache cold cache line. Each iteration has to use at least 2 cache
lines.
This can be problematic if some chains are very long.
2) The goal
-----------
The idea I had is to change things so that INET_MATCH() may return
FALSE in 99% of cases only using the data already in the CPU cache,
using one cache line per iteration.
3) Description of the patch
---------------------------
Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
filling a 32 bits hole on 64 bits platform.
struct sock_common {
unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;
int skc_bound_dev_if;
struct hlist_node skc_node;
struct hlist_node skc_bind_node;
atomic_t skc_refcnt;
+ unsigned int skc_hash;
struct proto *skc_prot;
};
Store in this 32 bits field the full hash, not masked by (ehash_size -
1) Using this full hash as the first comparison done in INET_MATCH
permits us immediatly skip the element without touching a second cache
line in case of a miss.
Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
sk_hash and tw_hash) already contains the slot number if we mask with
(ehash_size - 1)
File include/net/inet_hashtables.h
64 bits platforms :
#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash))
((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \
((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
32bits platforms:
#define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash)) && \
(inet_sk(__sk)->daddr == (__saddr)) && \
(inet_sk(__sk)->rcv_saddr == (__daddr)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
- Adds a prefetch(head->chain.first) in
__inet_lookup_established()/__tcp_v4_check_established() and
__inet6_lookup_established()/__tcp_v6_check_established() and
__dccp_v4_check_established() to bring into cache the first element of the
list, before the {read|write}_lock(&head->lock);
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've found the problem in general. It affects any 64-bit
architecture. The problem occurs when you change the system time.
Suppose that when you boot your system clock is forward by a day.
This gets recorded down in skb_tv_base. You then wind the clock back
by a day. From that point onwards the offset will be negative which
essentially overflows the 32-bit variables they're stored in.
In fact, why don't we just store the real time stamp in those 32-bit
variables? After all, we're not going to overflow for quite a while
yet.
When we do overflow, we'll need a better solution of course.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Handle better the case where the sender sends full sized
frames initially, then moves to a mode where it trickles
out small amounts of data at a time.
This known problem is even mentioned in the comments
above tcp_grow_window() in tcp_input.c, specifically:
...
* The scheme does not work when sender sends good segments opening
* window and then starts to feed us spagetti. But it should work
* in common situations. Otherwise, we have to rely on queue collapsing.
...
When the sender gives full sized frames, the "struct sk_buff" overhead
from each packet is small. So we'll advertize a larger window.
If the sender moves to a mode where small segments are sent, this
ratio becomes tilted to the other extreme and we start overrunning
the socket buffer space.
tcp_clamp_window() tries to address this, but it's clamping of
tp->window_clamp is a wee bit too aggressive for this particular case.
Fix confirmed by Ion Badulescu.
Signed-off-by: David S. Miller <davem@davemloft.net>
But retain the comment fix.
Alexey Kuznetsov has explained the situation as follows:
--------------------
I think the fix is incorrect. Look, the RFC function init_cwnd(mss) is
not continuous: f.e. for mss=1095 it needs initial window 1095*4, but
for mss=1096 it is 1096*3. We do not know exactly what mss sender used
for calculations. If we advertised 1096 (and calculate initial window
3*1096), the sender could limit it to some value < 1096 and then it
will need window his_mss*4 > 3*1096 to send initial burst.
See?
So, the honest function for inital rcv_wnd derived from
tcp_init_cwnd() is:
init_rcv_wnd(mss)=
min { init_cwnd(mss1)*mss1 for mss1 <= mss }
It is something sort of:
if (mss < 1096)
return mss*4;
if (mss < 1096*2)
return 1096*4;
return mss*2;
(I just scrablled a graph of piece of paper, it is difficult to see or
to explain without this)
I selected it differently giving more window than it is strictly
required. Initial receive window must be large enough to allow sender
following to the rfc (or just setting initial cwnd to 2) to send
initial burst. But besides that it is arbitrary, so I decided to give
slack space of one segment.
Actually, the logic was:
If mss is low/normal (<=ethernet), set window to receive more than
initial burst allowed by rfc under the worst conditions
i.e. mss*4. This gives slack space of 1 segment for ethernet frames.
For msses slighlty more than ethernet frame, take 3. Try to give slack
space of 1 frame again.
If mss is huge, force 2*mss. No slack space.
Value 1460*3 is really confusing. Minimal one is 1096*2, but besides
that it is an arbitrary value. It was meant to be ~4096. 1460*3 is
just the magic number from RFC, 1460*3 = 1095*4 is the magic :-), so
that I guess hands typed this themselves.
--------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
When you've enabled conntrack and NAT as a module (standard case in all
distributions), and you've also enabled the new conntrack netlink
interface, loading ip_conntrack_netlink.ko will auto-load iptable_nat.ko.
This causes a huge performance penalty, since for every packet you iterate
the nat code, even if you don't want it.
This patch splits iptable_nat.ko into the NAT core (ip_nat.ko) and the
iptables frontend (iptable_nat.ko). Threfore, ip_conntrack_netlink.ko will
only pull ip_nat.ko, but not the frontend. ip_nat.ko will "only" allocate
some resources, but not affect runtime performance.
This separation is also a nice step in anticipation of new packet filters
(nf-hipac, ipset, pkttables) being able to use the NAT core.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GRE, SCTP and TCP protocol helpers did not call
ip_conntrack_event_cache() when updating ct->status. This patch adds
the respective calls.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to introduce a separate Kconfig menu entry for the NFQUEUE targets.
They cannot "just" depend on nfnetlink_queue, since nfnetlink_queue could
be linked into the kernel, whereas iptables can be a module.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a number of bugs. It cannot be reasonably split up in
multiple fixes, since all bugs interact with each other and affect the same
function:
Bug #1:
The event cache code cannot be called while a lock is held. Therefore, the
call to ip_conntrack_event_cache() within ip_ct_refresh_acct() needs to be
moved outside of the locked section. This fixes a number of 2.6.14-rcX
oops and deadlock reports.
Bug #2:
We used to call ct_add_counters() for unconfirmed connections without
holding a lock. Since the add operations are not atomic, we could race
with another CPU.
Bug #3:
ip_ct_refresh_acct() lost REFRESH events in some cases where refresh
(and the corresponding event) are desired, but no accounting shall be
performed. Both, evenst and accounting implicitly depended on the skb
parameter bein non-null. We now re-introduce a non-accounting
"ip_ct_refresh()" variant to explicitly state the desired behaviour.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As noted by Alexey Dobriyan, the DEBUGP statement prints the wrong
callID.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the introduction of TSO pcount a year ago, it has been possible
for tcp_fragment() to cause packets_out to decrease. Prior to that,
tcp_retrans_try_collapse() was the only way for that to happen on the
retransmission path.
When this happens with Reno, it is possible for sasked_out to become
invalid because it is only an estimate and not tied to any particular
packet on the retransmission queue.
Therefore we need to adjust sacked_out as well as left_out in the Reno
case. The following patch does exactly that.
This bug is pretty difficult to trigger in practice though since you
need a SACKless peer with a retransmission that occurs just as the
cached MTU value expires.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch from Joel Sing to fix the default congestion control algorithm
for incoming connections. If a new congestion control handler is added
(via module), it should become the default for new
connections. Instead, the incoming connections use reno. The cause is
incorrect initialisation causes the tcp_init_congestion_control()
function to return after the initial if test fails.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup the printk's in fib_trie:
* Convert a couple of places in the dump code to BUG_ON
* Put log level's on each message
The version message really needed the message since it leaks out
on the pretty Fedora bootup.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Robert Olsson <Robert.Olsson@data.slu.se>,
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix unchecked __get_user that could be tricked into generating a
memory read on an arbitrary address. The result of the read is not
returned directly but you may be able to divine some information about
it, or use the read to cause a crash on some architectures by reading
hardware state. CAN-2004-2492.
Fix from Al Viro, ack from Dave Miller.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The problem is that we're now calling tcp_fragment() in a context
where the packets might be marked as SACKED_ACKED or SACKED_RETRANS.
This was not possible before as you never retransmitted packets that
are so marked.
Because of this, we need to adjust sacked_out and retrans_out in
tcp_fragment(). This is exactly what the following patch does.
We also need to preserve the SACKED_ACKED/SACKED_RETRANS marking
if they exist.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Those exports are needed by the PPTP helper following in the next
couple of changes.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both __ip_conntrack_expect_find and ip_conntrack_expect_find_get take
a reference to the expectation, the difference is that callers of
__ip_conntrack_expect_find must hold ip_conntrack_lock.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new "version 3" PPTP conntrack/nat helper is finally ready for
mainline inclusion. Special thanks to lots of last-minute bugfixing
by Patric McHardy.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* This patch is from Paul McKenney's RCU reviewing.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Prints the route tnode and set the stats level deepth as before.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_ct_refresh_acct() can be called without a valid "skb" pointer.
This used to work, since ct_add_counters() deals with that fact.
However, the recently-added event cache doesn't handle this at all.
This patch is a quick fix that is supposed to be replaced soon by a cleaner
solution during the pending redesign of the event cache.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of maintaining an array containing a list of nodes this instance
is responsible for let's use a simple bitmap. This provides the
following features:
* clusterip_responsible() and the add_node()/delete_node() operations
become very simple and don't need locking
* the config structure is much smaller
In spite of the completely different internal data representation the
user-space interface remains almost unchanged; the only difference is
that the proc file does not list nodes in the order they were added.
(The target info structure remains the same.)
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CLUSTERIP target creates a procfs entry for all different cluster
IPs. Although more than one rules can refer to a single cluster IP (and
thus a single config structure), removal of the procfs entry is done
unconditionally in destroy(). In more complicated situations involving
deferred dereferencing of the config structure by procfs and creating a
new rule with the same cluster IP it's also possible that no entry will
be created for the new rule.
This patch fixes the problem by counting the number of entries
referencing a given config structure and moving the config list
manipulation and procfs entry deletion parts to the
clusterip_config_entry_put() function.
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_vs_ftp when loaded can create NAT connections with unknown client
port for passive FTP. For such expectations we lookup with cport=0 on
incoming packet but it matches the format of the persistence templates
causing packets to other persistent virtual servers to be forwarded to
real server without creating connection. Later the reply packets are
treated as foreign and not SNAT-ed.
This patch changes the connection lookup for packets from clients:
* introduce IP_VS_CONN_F_TEMPLATE connection flag to mark the
connection as template
* create new connection lookup function just for templates -
ip_vs_ct_in_get
* make sure ip_vs_conn_in_get hits only connections with
IP_VS_CONN_F_NO_CPORT flag set when s_port is 0. By this way
we avoid returning template when looking for cport=0 (ftp)
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Agostino di Salle noticed that persistent templates are not
invalidated due to buggy optimization.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes line dupes at /ipv4/igmp.c and /ipv6/mcast.c in the
2.6 kernel, where MCAST_EXCLUDE is mistakenly used instead of
MCAST_INCLUDE.
Signed-off-by: Denis Lukianov <denis@voxelsoft.com>
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem is that the SACK fragmenting code may incorrectly call
tcp_fragment() with a length larger than the skb->len. This happens
when the skb on the transmit queue completely falls to the LHS of the
SACK.
And add a BUG() check to tcp_fragment() so we can spot this kind of
error more quickly in the future.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 2.6.13-rcX the MASQUERADE target was changed not to exclude local
packets for better source address consistency. This breaks DHCP clients
using UDP sockets when the DHCP requests are caught by a MASQUERADE rule
because the MASQUERADE target drops packets when no address is configured
on the outgoing interface. This patch makes it ignore packets with a
source address of 0.
Thanks to Rusty for this suggestion.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't parse the packet, the data is already available in the conntrack
structure.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With large port numbers the helper_names buffer can overflow.
Noticed by Samir Bellabes <sbellabes@mandriva.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use schedule_timeout_{,un}interruptible() instead of
set_current_state()/schedule_timeout() to reduce kernel size. Also use
human-time conversion functions instead of hard-coded division to avoid
rounding issues.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is an extra left_out/lost_out adjustment in tcp_fragment which
means that the lost_out accounting is always wrong. This patch removes
that chunk of code.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up timer initialization by introducing DEFINE_TIMER a'la
DEFINE_SPINLOCK. Build and boot-tested on x86. A similar patch has been
been in the -RT tree for some time.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With the use of RCU in files structure, the look-up of files using fds can now
be lock-free. The lookup is protected by rcu_read_lock()/rcu_read_unlock().
This patch changes the readers to use lock-free lookup.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Ravikiran Thirumalai <kiran_th@gmail.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Create one iterator for walking over FIB trie, and use it
for all the /proc functions. Add a /proc/net/route
output for backwards compatibility with old applications.
Make initialization of fib_trie same as fib_hash so no #ifdef
is needed in af_inet.c
Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=5209
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One such place that can damage the dst refcnts is route.c with
CONFIG_IP_ROUTE_MULTIPATH_CACHED enabled, i don't see the user's
.config. In this new code i see that rt_intern_hash is called before
dst->refcnt is set to 1, dst is the 2nd arg to rt_intern_hash.
Arg 2 of rt_intern_hash must come with refcnt 1 as it is added to
table or dropped depending on error/add/update. One such example is
ip_mkroute_input where __mkroute_input return rth with refcnt 0 which
is provided to rt_intern_hash. ip_mkroute_output looks like a 2nd such
place. Appending untested patch for comments and review. The idea is
to put previous reference as we are going to return next result/error.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
A UDP packet may contain extra data that needs to be trimmed off.
But when doing so, UDP forgets to fixup the skb checksum if CHECKSUM_HW
is being used.
I think this explains the case of a NFS receive using skge driver
causing 'udp hw checksum failures' when interacting with a crufty
settop box.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was found by inspection while looking for checksum problems
with the skge driver that sets CHECKSUM_HW. It did not fix the
problem, but it looks like it is needed.
If IP reassembly is trimming an overlapping fragment, it
should reset (or adjust) the hardware checksum flag on the skb.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following patch kills __ip_ct_expect_unlink_destroy and export
unlink_expect as ip_ct_unlink_expect. As it was discussed [1], the function
__ip_ct_expect_unlink_destroy is a bit confusing so better do the following
sequence: ip_ct_destroy_expect and ip_conntrack_expect_put.
[1] https://lists.netfilter.org/pipermail/netfilter-devel/2005-August/020794.html
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the NAT module is loaded when connections are already confirmed
it must not change their tuples anymore. This is especially important
with CONFIG_NETFILTER_DEBUG, the netfilter listhelp functions will
refuse to remove an entry from a list when it can not be found on
the list, so when a changed tuple hashes to a new bucket the entry
is kept in the list until and after the conntrack is freed.
Allocate the exact conntrack tuple for NAT for already confirmed
connections or drop them if that fails.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Connection mark tracking support is one of the feature in connection
tracking, so IP_NF_CONNTRACK_MARK depends on IP_NF_CONNTRACK.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A permanent expectation exists until timeing out and can expect
multiple related connections.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TCP_OFF assignment at the bottom of that if block can indeed set
TCP_OFF without setting TCP_PAGE. Since there is not much to be
gained from avoiding this situation, we might as well just zap the
offset. The following patch should fix it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every file should #include the header files containing the prototypes of
it's global functions.
nfs_fs.h contains the prototype of root_nfs_parse_addr().
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
All we need to do is resegment the queue so that
we record SACK information accurately. The edges
of the SACK blocks guide our resegmenting decisions.
With help from Herbert Xu.
Signed-off-by: David S. Miller <davem@davemloft.net>
I've finally found a potential cause of the sk_forward_alloc underflows
that people have been reporting sporadically.
When tcp_sendmsg tacks on extra bits to an existing TCP_PAGE we don't
check sk_forward_alloc even though a large amount of time may have
elapsed since we allocated the page. In the mean time someone could've
come along and liberated packets and reclaimed sk_forward_alloc memory.
This patch makes tcp_sendmsg check sk_forward_alloc every time as we
do in do_tcp_sendpages.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces sk_stream_wmem_schedule as a short-hand for
the sk_forward_alloc checking on egress.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the patch to add a NULL short-circuit to crypto_free_tfm() went in,
there's no longer any need for callers of that function to check for NULL.
This patch removes the redundant NULL checks and also a few similar checks
for NULL before calls to kfree() that I ran into while doing the
crypto_free_tfm bits.
I've succesfuly compile tested this patch, and a kernel with the patch
applied boots and runs just fine.
When I posted the patch to LKML (and other lists/people on Cc) it drew the
following comments :
J. Bruce Fields commented
"I've no problem with the auth_gss or nfsv4 bits.--b."
Sridhar Samudrala said
"sctp change looks fine."
Herbert Xu signed off on the patch.
So, I guess this is ready to be dropped into -mm and eventually mainline.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a trivial typo in clusterip_config_init().
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new iptables target allows manipulation of the TTL of an IPv4 packet.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch puts mostly read only data in the right section
(read_mostly), to help sharing of these data between CPUS without
memory ping pongs.
On one of my production machine, tcp_statistics was sitting in a
heavily modified cache line, so *every* SNMP update had to force a
reload.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Removes RW-lock
* Proteced read functions uses
rcu_dereference proteced with rcu_read_lock()
* writing of procted pointer w. rcu_assigen_pointer
* Insert/Replace atomic list_replace_rcu
* A BUG_ON condition removed.in trie_rebalance
With help from Paul E. McKenney.
Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* RCU versions of hlist_***_rcu
* fib_alias partial rcu port just whats needed now.
Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With ip_rcv nowhere outside the IP stack being used anymore it's
EXPORT_SYMBOL is not needed any longer either.
Signed-off-by: Ralf Baechle DL5RB <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a redo of earlier cleanup stuff:
* replace DBG() macro with pr_debug()
* get rid of duplicate extern's that are already in fib_lookup.h
* use BUG_ON and WARN_ON
* don't use BUG checks for null pointers where next statement would
get a fault anyway
* remove debug printout when rebalance causes deep tree
* remove trailing blanks
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Originally written by Henrik Nordstrom <hno@marasystems.com>, taken
from netfilter patch-o-matic and added ip6_tables support.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocols that make extensive use of SKB cloning,
for example TCP, eat at least 2 allocations per
packet sent as a result.
To cut the kmalloc() count in half, we implement
a pre-allocation scheme wherein we allocate
2 sk_buff objects in advance, then use a simple
reference count to free up the memory at the
correct time.
Based upon an initial patch by Thomas Graf and
suggestions from Herbert Xu.
Signed-off-by: David S. Miller <davem@davemloft.net>
This variant is needed to satisfy sparse __user annotations.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Of this type, mostly:
CHECK net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rip out cmd/sid/pid matching since its unfixable broken and stands in the
way of locking changes to tasklist_lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Gary Wayne Smith <gary.w.smith@primeexalia.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Increases consistency in source-address selection.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch ads a new "connbytes" match that utilizes the CONFIG_NF_CT_ACCT
per-connection byte and packet counters. Using it you can do things like
packet classification on average packet size within a connection.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this the previous setup is back, i.e. tcp_diag can be built as a module,
as dccp_diag and both share the infrastructure available in inet_diag.
If one selects CONFIG_INET_DIAG as module CONFIG_INET_TCP_DIAG will also be
built as a module, as will CONFIG_INET_DCCP_DIAG, if CONFIG_IP_DCCP was
selected static or as a module, if CONFIG_INET_DIAG is y, being statically
linked CONFIG_INET_TCP_DIAG will follow suit and CONFIG_INET_DCCP_DIAG will be
built in the same manner as CONFIG_IP_DCCP.
Now to aim at UDP, converting it to use inet_hashinfo, so that we can use
iproute2 for UDP sockets as well.
Ah, just to show an example of this new infrastructure working for DCCP :-)
[root@qemu ~]# ./ss -dane
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 0 *:5001 *:* ino:942 sk:cfd503a0
ESTAB 0 0 127.0.0.1:5001 127.0.0.1:32770 ino:943 sk:cfd50a60
ESTAB 0 0 127.0.0.1:32770 127.0.0.1:5001 ino:947 sk:cfd50700
TIME-WAIT 0 0 127.0.0.1:32769 127.0.0.1:5001 timer:(timewait,3.430ms,0) ino:0 sk:cf209620
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Next changeset will introduce net/ipv4/tcp_diag.c, moving the code that was put
transitioanlly in inet_diag.c.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Next changeset will rename tcp_diag.[ch] to inet_diag.[ch].
I'm taking this longer route so as to easy review, making clear the changes
made all along the way.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Next changeset will rename tcp_diag to inet_diag and move the tcp_diag code out
of it and into a new tcp_diag.c, similar to the net/dccp/diag.c introduced in
this changeset, completing the transition to a generic inet_diag
infrastructure.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Doing this we allow tcp_diag to support IPV6 even if tcp_diag is compiled
statically and IPV6 is compiled as a module, removing the previous restriction
while not building any IPV6 code if it is not selected.
Now to work on the tcpdiag_register infrastructure and then to rename the whole
thing to inetdiag, reflecting its by then completely generic nature.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the same way as was done with the v4 counterparts, this will be moved
to inet6_hashtables.c.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With ugly ifdefs, etc, but this actually:
1. keeps the existing ABI, i.e. no need to recompile the iproute2
utilities if not interested in DCCP.
2. Provides all the tcp_diag functionality in DCCP, with just a
small patch that makes iproute2 support DCCP.
Of course I'll get this cleaned-up in time, but for now I think its
OK to be this way to quickly get this functionality.
iproute2-ss050808 patch at:
http://vger.kernel.org/~acme/iproute2-ss050808.dccp.patch
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This changeset basically moves tcp_sk()->{ca_ops,ca_state,etc} to inet_csk(),
minimal renaming/moving done in this changeset to ease review.
Most of it is just changes of struct tcp_sock * to struct sock * parameters.
With this we move to a state closer to two interesting goals:
1. Generalisation of net/ipv4/tcp_diag.c, becoming inet_diag.c, being used
for any INET transport protocol that has struct inet_hashinfo and are
derived from struct inet_connection_sock. Keeps the userspace API, that will
just not display DCCP sockets, while newer versions of tools can support
DCCP.
2. INET generic transport pluggable Congestion Avoidance infrastructure, using
the current TCP CA infrastructure with DCCP.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also export the ones that will be used in the next changeset, when
DCCP uses this infrastructure.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
That groups all of the tables and variables associated to the TCP timewait
schedulling/recycling/killing code, that now can be isolated from the TCP
specific code and used by other transport protocols, such as DCCP.
Next changeset will move this code to net/ipv4/inet_timewait_sock.c
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using this new iptables DCCP protocol header match, it is possible to
create simplistic stateless packet filtering rules for DCCP. It
permits matching of port numbers, packet type and options.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use const where possible and get rid of EXTRACT() macro
that was never used.
Signed-off-by: Stephen Hemmigner <shemminger@osdl.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Below is a patch that cleans up some of this, supposedly without
changing any behaviour:
* Whitespace cleanups
* Introduce DBG()
* BUG_ON() instead of if () { BUG(); }
* Remove some of the deep nesting to make the code flow more
comprehensible
* Some mask operations were simplified
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the bug which doesn't return ERR_PTR(-ENOMEM) if it
failed to allocate memory space from slab cache. This bug leads to
erroneously not dropped packets under stress, and wrong statistic
counters ('invalid' is incremented instead of 'drop'). It was
introduced during the ctnetlink merge in the net-2.6.14 tree, so no
stable or mainline releases affected.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a /proc/net/netfilter/nf_queue file, similar to the
recently-added /proc/net/netfilter/nf_log. It indicates which queue
handler is registered to which protocol family. This is useful since
there are now multiple queue handlers in the treee (ip[6]_queue,
nfnetlink_queue).
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since nfnetlink_queue can override ip{6}_queue as queue handlers, we
can no longer blindly unregister whoever is registered for PF_INET[6],
but only unregister ourselves.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to DaveM, it is preferrable to have large data structures be
allocated dynamically from the module init() function rather than
putting them as static global variables into BSS.
This patch moves the conntrack helper packet buffers into dynamically
allocated memory.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syntax is net-pf-PROTOCOL_FAMILY-PROTOCOL-SOCK_TYPE and if this
fails net-pf-PROTOCOL_FAMILY-PROTOCOL.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also improves reqsk_queue_prune and renames it to
inet_csk_reqsk_queue_prune, as it deals with both inet_connection_sock
and inet_request_sock objects, not just with request_sock ones thus
belonging to inet_request_sock.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this we're very close to getting all of the current TCP
refactorings in my dccp-2.6 tree merged, next changeset will export
some functions needed by the current DCCP code and then dccp-2.6.git
will be born!
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also moved inet_iif from tcp to inet_hashtables.h, as it is
needed by the inet_lookup callers, perhaps this needs a bit of
polishing, but for now seems fine.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Completing the previous changeset, this also generalises tcp_v4_synq_add,
renaming it to inet_csk_reqsk_queue_hash_add, already geing used in the
DCCP tree, which I plan to merge RSN.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This creates struct inet_connection_sock, moving members out of struct
tcp_sock that are shareable with other INET connection oriented
protocols, such as DCCP, that in my private tree already uses most of
these members.
The functions that operate on these members were renamed, using a
inet_csk_ prefix while not being moved yet to a new file, so as to
ease the review of these changes.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Out of tcp_create_openreq_child, will be used in
dccp_create_openreq_child, and is a nice sock function anyway.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the parts of tcp_time_wait that are not TCP specific, tcp_time_wait uses
it and so will dccp_time_wait.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
And also some TIME_WAIT functions.
[acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
/tmp/before.size: 282955 13122 9312 305389 4a8ed net/ipv4/built-in.o
/tmp/after.size: 281566 13122 9312 304000 4a380 net/ipv4/built-in.o
[acme@toy net-2.6.14]$
I kept them still inlined, will uninline at some point to see what
would be the performance difference.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This paves the way to generalise the rest of the sock ID lookup
routines and saves some bytes in TCPv4 TIME_WAIT sockets on distro
kernels (where IPv6 is always built as a module):
[root@qemu ~]# grep tw_sock /proc/slabinfo
tw_sock_TCPv6 0 0 128 31 1
tw_sock_TCP 0 0 96 41 1
[root@qemu ~]#
Now if a protocol wants to use the TIME_WAIT generic infrastructure it
only has to set the sk_prot->twsk_obj_size field with the size of its
inet_timewait_sock derived sock and proto_register will create
sk_prot->twsk_slab, for now its only for INET sockets, but we can
introduce timewait_sock later if some non INET transport protocolo
wants to use this stuff.
Next changesets will take advantage of this new infrastructure to
generalise even more TCP code.
[acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
/tmp/before.size: 188646 11764 5068 205478 322a6 net/ipv4/built-in.o
/tmp/after.size: 188144 11764 5068 204976 320b0 net/ipv4/built-in.o
[acme@toy net-2.6.14]$
Tested with both IPv4 & IPv6 (::1 (localhost) & ::ffff:172.20.0.1
(qemu host)).
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
[acme@toy net-2.6.14]$ grep built-in /tmp/before /tmp/after
/tmp/before: 282560 13122 9312 304994 4a762 net/ipv4/built-in.o
/tmp/after: 282560 13122 9312 304994 4a762 net/ipv4/built-in.o
Will be used in DCCP, not exporting it right now not to get in Adrian
Bunk's exported-but-not-used-on-modules radar 8)
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It really just makes the existing code be a helper function that
tcp_v4_hash and tcp_unhash uses, specifying the right inet_hashinfo,
tcp_hashinfo.
One thing I'll investigate at some point is to have the inet_hashinfo
pointer in sk_prot, so that we get all the hashtable information from
the sk pointer, this can lead to some extra indirections that may well
hurt performance/code size, we'll see. Ultimate idea would be that
sk_prot would provide _all_ the information about a protocol
implementation.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lots of places just needs the states, not even linux/tcp.h, where this
enum was, needs it.
This speeds up development of the refactorings as less sources are
rebuilt when things get moved from net/tcp.h.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also expose all of the tcp_hashinfo members, i.e. killing those
tcp_ehash, etc macros, this will more clearly expose already generic
functions and some that need just a bit of work to become generic, as
we'll see in the upcoming changesets.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This required moving tcp_bucket_cachep to inet_hashinfo.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently conntracks are inserted after the head. That means that
conntracks are sorted from the biggest to the smallest id. This happens
because we use list_prepend (list_add) instead list_add_tail. This can
result in problems during the list iteration.
list_for_each(i, &ip_conntrack_hash[cb->args[0]]) {
h = (struct ip_conntrack_tuple_hash *) i;
if (DIRECTION(h) != IP_CT_DIR_ORIGINAL)
continue;
ct = tuplehash_to_ctrack(h);
if (ct->id <= *id)
continue;
In that case just the first conntrack in the bucket will be dumped. To
fix this, we iterate the list from the tail to the head via
list_for_each_prev. Same thing for the list of expectations.
Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the size of the ctnl_exp_cb array that is IPCTNL_MSG_EXP_MAX
instead of IPCTNL_MSG_MAX. Simple typo.
Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In unlink_expect(), the expectation is removed from the list so the
refcount must be dropped as well.
Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following sequence is displayed during events dumping of an ICMP
connection: [NEW] [DESTROY] [UPDATE]
This happens because the event IPCT_DESTROY is delivered in
death_by_timeout(), that is called from the icmp protocol helper
(ct->timeout.function) once we see the reply.
To fix this, we move this event to destroy_conntrack().
Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We used to use nested nfattr structures for ip_conntrack_expect. This is
bogus, since ip_conntrack and ip_conntrack_expect are communicated in
different netlink message types. both should be encoded at the top level
attributes, no extra nesting required. This patch addresses the issue.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to this patch, every nfnetlink subsystem had to specify it's
attribute count. However, in reality the attribute count depends on
the message type within the subsystem, not the subsystem itself. This
patch moves 'attr_count' from 'struct nfnetlink_subsys' into
nfnl_callback to fix this.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There was a stupid copy+paste mistake where we parse the MASK nfattr into
the "tuple" variable instead of the "mask" variable. This patch fixes it.
Thanks to Pablo Neira.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current codepath allowed for ip_conntrack_lock to be unlock'ed twice.
Signed-off-by: Pablo Neira <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
nfattr_parse_nested() calls nfattr_parse() which in turn does a memset
on the 'tb' array. All callers therefore don't need to memset before
calling it.
Signed-off-by: Pablo Neira <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcnt underflow: the reference count is decremented when a conntrack
entry is removed from the hash but it is not incremented when entering
new entries.
missing protection of process context against softirq context: all
cache operations need to locally disable softirqs to avoid races.
Additionally the event cache can't be initialized when a packet
enteres the conntrack code but needs to be initialized whenever we
cache an event and the stored conntrack entry doesn't match the
current one.
incorrect flushing of the event cache in ip_ct_iterate_cleanup:
without real locking we can't flush the cache for different CPUs
without incurring races. The cache for different CPUs can only be
flushed when no packets are going through the
code. ip_ct_iterate_cleanup doesn't need to drop all references, so
flushing is moved to the cleanup path.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This should really be in a inet_connection_sock, but I'm leaving it
for a later optimization, when some more fields common to INET
transport protocols now in tcp_sk or inet_sk will be chunked out into
inet_connection_sock, for now its better to concentrate on getting the
changes in the core merged to leave the DCCP tree with only DCCP
specific code.
Next changesets will take advantage of this move to generalise things
like tcp_bind_hash, tcp_put_port, tcp_inherit_port, making the later
receive a inet_hashinfo parameter, and even __tcp_tw_hashdance, etc in
the future, when tcp_tw_bucket gets transformed into the struct
timewait_sock hierarchy.
tcp_destroy_sock also is eligible as soon as tcp_orphan_count gets
moved to sk_prot.
A cascade of incremental changes will ultimately make the tcp_lookup
functions be fully generic.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to break down the complexity of the series of patches,
making it very clear that this one just does:
1. renames tcp_ prefixed hashtable functions and data structures that
were already mostly generic to inet_ to share it with DCCP and
other INET transport protocols.
2. Removes not used functions (__tb_head & tb_head)
3. Removes some leftover prototypes in the headers (tcp_bucket_unlock &
tcp_v4_build_header)
Next changesets will move tcp_sk(sk)->bind_hash to inet_sock so that we can
make functions such as tcp_inherit_port, __tcp_inherit_port, tcp_v4_get_port,
__tcp_put_port, generic and get others like tcp_destroy_sock closer to generic
(tcp_orphan_count will go to sk->sk_prot to allow this).
Eventually most of these functions will be used passing the transport protocol
inet_hashinfo structure.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be shared with DCCP (and others), this is the start of a series of patches
that will expose the already generic TCP hash table routines.
The few changes noticed when calling gcc -S before/after on a pentium4 were of
this type:
movl 40(%esp), %edx
cmpl %esi, 472(%edx)
je .L168
- pushl $291
+ pushl $272
pushl $.LC0
pushl $.LC1
pushl $.LC2
[acme@toy net-2.6.14]$ size net/ipv4/tcp_ipv4.before.o net/ipv4/tcp_ipv4.after.o
text data bss dec hex filename
17804 516 140 18460 481c net/ipv4/tcp_ipv4.before.o
17804 516 140 18460 481c net/ipv4/tcp_ipv4.after.o
Holler if some weird architecture has issues with things like this 8)
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is in preparation to nfnetlink_log:
- loggers now have to register struct nf_logger instead of nf_logfn
- nf_log_unregister() replaced by nf_log_unregister_pf() and
nf_log_unregister_logger()
- add comment to ip[6]t_LOG.h to assure nobody redefines flags
- add /proc/net/netfilter/nf_log to tell user which logger is currently
registered for which address family
- if user has configured logging, but no logging backend (logger) is
available, always spit a message to syslog, not just the first time.
- split ip[6]t_LOG.c into two parts:
Backend: Always try to register as logger for the respective address family
Frontend: Always log via nf_log_packet() API
- modify all users of nf_log_packet() to accomodate additional argument
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From tcp_v4_rebuild_header, that already was pretty generic, I only
needed to use sk->sk_protocol instead of the hardcoded IPPROTO_TCP and
establish the requirement that INET transport layer protocols that
want to use this function map TCP_SYN_SENT to its equivalent state.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
From tcp_v4_setup_caps, that always is preceded by a call to
__sk_dst_set, so coalesce this sequence into sk_setup_caps, removing
one call to a TCP function in the IP layer.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This operation was already generic and DCCP will use it.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add new nfnetlink_queue module
- Add new ipt_NFQUEUE and ip6t_NFQUEUE modules to access queue numbers 1-65535
- Mark ip_queue and ip6_queue Kconfig options as OBSOLETE
- Update feature-removal-schedule to remove ip[6]_queue in December
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- split netfiler verdict in 16bit verdict and 16bit queue number
- add 'queuenum' argument to nf_queue_outfn_t and its users ip[6]_queue
- move NFNL_SUBSYS_ definitions from enum to #define
- introduce autoloading for nfnetlink subsystem modules
- add MODULE_ALIAS_NFNL_SUBSYS macro
- add nf_unregister_queue_handlers() to register all handlers for a given
nf_queue_outfn_t
- add more verbose DEBUGP macro definition to nfnetlink.c
- make nfnetlink_subsys_register fail if subsys already exists
- add some more comments and debug statements to nfnetlink.c
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rerouting functionality is required by the core, therefore it has
to be implemented by the core and not in individual queue handlers.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Remove bogus code for compiling netlink as module
- Add module refcounting support for modules implementing a netlink
protocol
- Add support for autoloading modules that implement a netlink protocol
as soon as someone opens a socket for that protocol
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netfilter cleanup
- Move ipv4 code from net/core/netfilter.c to net/ipv4/netfilter.c
- Move ipv6 netfilter code from net/ipv6/ip6_output.c to net/ipv6/netfilter.c
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is nothing IPv4-specific in it. In fact, it was already used by
IPv6, too... Upcoming nfnetlink_queue code will use it for any kind
of packet.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch contains the following possible cleanups:
- make needlessly global code static
- #if 0 the following unused global function:
- xfrm4_state.c: xfrm4_state_fini
- remove the following unneeded EXPORT_SYMBOL's:
- ip_output.c: ip_finish_output
- ip_output.c: sysctl_ip_default_ttl
- fib_frontend.c: ip_dev_find
- inetpeer.c: inet_peer_idlock
- ip_options.c: ip_options_compile
- ip_options.c: ip_options_undo
- net/core/request_sock.c: sysctl_max_syn_backlog
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bonding just wants the device before the skb_bond()
decapsulation occurs, so simply pass that original
device into packet_type->func() as an argument.
It remains to be seen whether we can use this same
exact thing to get rid of skb->input_dev as well.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ctnetlink subsystem for userspace-access to ip_conntrack table.
This allows reading and updating of existing entries, as well as
creating new ones (and new expect's) via nfnetlink.
Please note the 'strange' byte order: nfattr (tag+length) are in host
byte order, while the payload is always guaranteed to be in network
byte order. This allows a simple userspace process to encapsulate netlink
messages into arch-independent udp packets by just processing/swapping the
headers and not knowing anything about the actual payload.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a notifier chain based event mechanism for ip_conntrack state
changes. As opposed to the previous implementations in patch-o-matic, we
do no longer need a field in the skb to achieve this.
Thanks to the valuable input from Patrick McHardy and Rusty on the idea
of a per_cpu implementation.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the "list" member of struct sk_buff, as it is entirely
redundant. All SKB list removal callers know which list the
SKB is on, so storing this in sk_buff does nothing other than
taking up some space.
Two tricky bits were SCTP, which I took care of, and two ATM
drivers which Francois Romieu <romieu@fr.zoreil.com> fixed
up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
As discussed at netconf'05, we're trying to save every bit in sk_buff.
The patch below makes sk_buff 8 bytes smaller. I did some basic
testing on my notebook and it seems to work.
The only real in-tree user of nfcache was IPVS, who only needs a
single bit. Unfortunately I couldn't find some other free bit in
sk_buff to stuff that bit into, so I introduced a separate field for
them. Maybe the IPVS guys can resolve that to further save space.
Initially I wanted to shrink pkt_type to three bits (PACKET_HOST and
alike are only 6 values defined), but unfortunately the bluetooth code
overloads pkt_type :(
The conntrack-event-api (out-of-tree) uses nfcache, but Rusty just
came up with a way how to do it without any skb fields, so it's safe
to remove it.
- remove all never-implemented 'nfcache' code
- don't have ipvs code abuse 'nfcache' field. currently get's their own
compile-conditional skb->ipvs_property field. IPVS maintainers can
decide to move this bit elswhere, but nfcache needs to die.
- remove skb->nfcache field to save 4 bytes
- move skb->nfctinfo into three unused bits to save further 4 bytes
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discussed at netconf'05, we convert nfmark and conntrack-mark to be
32bits even on 64bit architectures.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a semantic match occurs either success, not found or an error
(for matching unreachable routes/blackholes) is returned. fib_trie
ignores the errors and looks for a different matching route. Treat
results other than "no match" as success and end lookup.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This trips up a lot of folks reading this code.
Put an unlikely() around the port-exhaustion test
for good measure.
Signed-off-by: David S. Miller <davem@davemloft.net>
Intention of this bit is to force pushing of the existing
send queue when TCP_CORK or TCP_NODELAY state changes via
setsockopt().
But it's easy to create a situation where the bit never
clears. For example, if the send queue starts empty:
1) set TCP_NODELAY
2) clear TCP_NODELAY
3) set TCP_CORK
4) do small write()
The current code will leave TCP_NAGLE_PUSH set after that
sequence. Unconditionally clearing the bit when new data
is added via skb_entail() solves the problem.
Signed-off-by: David S. Miller <davem@davemloft.net>
The checksum needs to be filled in on output, after mangling a packet
ip_summed needs to be reset.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Dave Johnson <djohnson+linux-kernel@sw.starentnetworks.com>
Found this bug while doing some scaling testing that created 500K inet
peers.
peer_check_expire() in net/ipv4/inetpeer.c isn't using inet_peer_gc_mintime
correctly and will end up creating an expire timer with less than the
minimum duration, and even zero/negative if enough active peers are
present.
If >65K peers, the timer will be less than inet_peer_gc_mintime, and with
>70K peers, the timer duration will reach zero and go negative.
The timer handler will continue to schedule another zero/negative timer in
a loop until peers can be aged. This can continue for at least a few
minutes or even longer if the peers remain active due to arriving packets
while the loop is occurring.
Bug is present in both 2.4 and 2.6. Same patch will apply to both just
fine.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the tail SKB fits into the window, it is still
benefitical to defer until the goal percentage of
the window is available. This give the application
time to feed more data into the send queue and thus
results in larger TSO frames going out.
Patch from Dmitry Yusupov <dima@neterion.com>.
Signed-off-by: David S. Miller <davem@davemloft.net>
Most importantly, remove bogus BUG() in receive path.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
An incorrect check made it bail out before doing anything.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a false-positive from debug_smp_processor_id().
The processor ID is only used to look up crypto_tfm objects.
Any processor ID is acceptable here as long as it is one that is
iterated on by for_each_cpu().
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a bug report and initial patch by
Ollie Wild.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) We send out a normal sized packet with TSO on to start off.
2) ICMP is received indicating a smaller MTU.
3) We send the current sk_send_head which needs to be fragmented
since it was created before the ICMP event. The first fragment
is then sent out.
At this point the remaining fragment is allocated by tcp_fragment.
However, its size is padded to fit the L1 cache-line size therefore
creating tail-room up to 124 bytes long.
This fragment will also be sitting at sk_send_head.
4) tcp_sendmsg is called again and it stores data in the tail-room of
of the fragment.
5) tcp_push_one is called by tcp_sendmsg which then calls tso_fragment
since the packet as a whole exceeds the MTU.
At this point we have a packet that has data in the head area being
fed to tso_fragment which bombs out.
My take on this is that we shouldn't ever call tcp_fragment on a TSO
socket for a packet that is yet to be transmitted since this creates
a packet on sk_send_head that cannot be extended.
So here is a patch to change it so that tso_fragment is always used
in this case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Well I've only found one potential cause for the assertion
failure in tcp_mark_head_lost. First of all, this can only
occur if cnt > 1 since tp->packets_out is never zero here.
If it did hit zero we'd have much bigger problems.
So cnt is equal to fackets_out - reordering. Normally
fackets_out is less than packets_out. The only reason
I've found that might cause fackets_out to exceed packets_out
is if tcp_fragment is called from tcp_retransmit_skb with a
TSO skb and the current MSS is greater than the MSS stored
in the TSO skb. This might occur as the result of an expiring
dst entry.
In that case, packets_out may decrease (line 1380-1381 in
tcp_output.c). However, fackets_out is unchanged which means
that it may in fact exceed packets_out.
Previously tcp_retrans_try_collapse was the only place where
packets_out can go down and it takes care of this by decrementing
fackets_out.
So we should make sure that fackets_out is reduced by an appropriate
amount here as well.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here's a small patch to cleanup NETDEBUG() use in net/ipv4/ for Linux
kernel 2.6.13-rc5. Also weird use of indentation is changed in some
places.
Signed-off-by: Heikki Orsila <heikki.orsila@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of 'rustynat' in 2.6.11, the old tricks of preventing
NAT of 'untracked' connections (e.g. NOTRACK target in 'raw' table) are no
longer sufficient.
The ip_conntrack_untracked.status |= IPS_NAT_DONE_MASK effectively
prevents iteration of the 'nat' table, but doesn't prevent nat_packet()
to be executed. Since nr_manips is gone in 'rustynat', nat_packet() now
implicitly thinks that it has to do NAT on the packet.
This patch fixes that problem by explicitly checking for
ip_conntrack_untracked in ip_nat_fn().
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The interface needs much redesigning if we wish to allow
normal users to do this in some way.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_write_xmit caches the cwnd value indirectly in cwnd_quota. When
tcp_transmit_skb reduces the cwnd because of tcp_enter_cwr, the cached
value becomes invalid.
This patch ensures that the cwnd value is always reread after each
tcp_transmit_skb call.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
MSS changes can be lost since we preemptively initialize the tso_segs count
for an SKB before we %100 commit to sending it out.
So, by the time we send it out, the tso_size information can be stale due
to PMTU events. This mucks up all of the logic in our send engine, and can
even result in the BUG() triggering in tcp_tso_should_defer().
Another problem we have is that we're storing the tp->mss_cache, not the
SACK block normalized MSS, as the tso_size. That's wrong too.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tunnel modules used to obtain module refcount each time when
some tunnel was created, which meaned that tunnel could be unloaded
only after all the tunnels are deleted.
Since killing old MOD_*_USE_COUNT macros this protection has gone.
It is possible to return it back as module_get/put, but it looks
more natural and practically useful to force destruction of all
the child tunnels on module unload.
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
masq_index is used for cleanup in case the interface address changes
(such as a dialup ppp link with dynamic addreses). Without this patch,
slave connections are not evicted in such a case, since they don't inherit
masq_index.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move in_aton to allow netpoll and pktgen to work without the rest of
the IPv4 stack. Fix whitespace and add comment for the odd placement.
Delete now-empty net/ipv4/utils.c
Re-enable netpoll/netconsole without CONFIG_INET
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: "Hans-Juergen Tappe (SYSGO AG)" <hjt@sysgo.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The portptr pointing to the port in the conntrack tuple is declared static,
which could result in memory corruption when two packets of the same
protocol are NATed at the same time and one conntrack goes away.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a connection tracking helper tells us to expect a connection, and
we're already expecting that connection, we simply free the one they
gave us and return success.
The problem is that NAT helpers (eg. FTP) have to allocate the
expectation first (to see what port is available) then rewrite the
packet. If that rewrite fails, they try to remove the expectation,
but it was freed in ip_conntrack_expect_related.
This is one example of a larger problem: having registered the
expectation, the pointer is no longer ours to use. Reference counting
is needed for ctnetlink anyway, so introduce it now.
To have a single "put" path, we need to grab the reference to the
connection on creation, rather than open-coding it in the caller.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the following kconfig warning:
net/ipv4/Kconfig:92:warning: defaults for choice values not supported
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Revert the nf_reset change that caused so much trouble, drop conntrack
references manually before packets are queued to packet sockets.
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the protocol specific config options out to the specific protocols.
With this change net/Kconfig now starts to become readable and serve as a
good basis for further re-structuring.
The menu structure is left almost intact, except that indention is
fixed in most cases. Most visible are the INET changes where several
"depends on INET" are replaced with a single ifdef INET / endif pair.
Several new files were created to accomplish this change - they are
small but serve the purpose that config options are now distributed
out where they belongs.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some cases, we may be generating packets with a source address that
qualifies as martian. This can happen when we're in the middle of setting
up the network, and netfilter decides to reject a packet with an RST.
The IPv4 routing code would try to print a warning and oops, because
locally generated packets do not have a valid skb->mac.raw pointer
at this point.
Signed-off-by: David S. Miller <davem@davemloft.net>
An addition to the last ipvs changes that move
update_defense_level/si_meminfo to keventd:
- ip_vs_random_dropentry now runs in process context and should use _bh
locks to protect from softirqs
- update_defense_level still needs _bh locks after si_meminfo is called,
for the same purpose
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the multicast group matching for
IP_DROP_MEMBERSHIP, similar to the IP_ADD_MEMBERSHIP fix in a prior
patch. Groups are identifiedby <group address,interface> and including
the interface address in the match will fail if a leave-group is done
by address when the join was done by index, or if different addresses
on the same interface are used in the join and leave.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Adds (INCLUDE, empty)/leave-group equivalence to the full-state
multicast source filter APIs (IPv4 and IPv6)
2) Fixes an incorrect errno in the IPv6 leave-group (ENOENT should be
EADDRNOTAVAIL)
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) In the full-state API when imsf_numsrc == 0
errno should be "0", but returns EADDRNOTAVAIL
2) An illegal filter mode change
errno should be EINVAL, but returns EADDRNOTAVAIL
3) Trying to do an any-source option without IP_ADD_MEMBERSHIP
errno should be EINVAL, but returns EADDRNOTAVAIL
4) Adds comments for the less obvious error return values
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Changes IP_ADD_SOURCE_MEMBERSHIP and MCAST_JOIN_SOURCE_GROUP to ignore
EADDRINUSE errors on a "courtesy join" -- prior membership or not
is ok for these.
2) Adds "leave group" equivalence of (INCLUDE, empty) filters in the
delta-based API. Without this, mixing delta-based API calls that
end in an (INCLUDE, empty) filter would not allow a subsequent
regular IP_ADD_MEMBERSHIP. It also frees socket buffer memory that
isn't needed for both the multicast group record and source filter.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects a few problems with the IP_ADD_MEMBERSHIP
socket option:
1) The existing code makes an attempt at reference counting joins when
using the ip_mreqn/imr_ifindex interface. Joining the same group
on the same socket is an error, whatever the API. This leads to
unexpected results when mixing ip_mreqn by index with ip_mreqn by
address, ip_mreq, or other API's. For example, ip_mreq followed by
ip_mreqn of the same group will "work" while the same two reversed
will not.
Fixed to always return EADDRINUSE on a duplicate join and
removed the (now unused) reference count in ip_mc_socklist.
2) The group-search list in ip_mc_join_group() is comparing a full
ip_mreqn structure and all of it must match for it to find the
group. This doesn't correctly match a group that was joined with
ip_mreq or ip_mreqn with an address (with or without an index). It
also doesn't match groups that are joined by different addresses on
the same interface. All of these are the same multicast group,
which is identified by group address and interface index.
Fixed the check to correctly match groups so we don't get
duplicate group entries on the ip_mc_socklist.
3) The old code allocates a multicast address before searching for
duplicates requiring it to free in various error cases. This
patch moves the allocate until after the search and
igmp_max_memberships check, so never a need to allocate, then free
an entry.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Victor Fusco <victor@cetuc.puc-rio.br>
Fix the sparse warning "implicit cast to nocast type"
Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is part of the grand scheme to eliminate the qlen
member of skb_queue_head, and subsequently remove the
'list' member of sk_buff.
Most users of skb_queue_len() want to know if the queue is
empty or not, and that's trivially done with skb_queue_empty()
which doesn't use the skb_queue_head->qlen member and instead
uses the queue list emptyness as the test.
Signed-off-by: David S. Miller <davem@davemloft.net>
Congestion window recover after loss depends upon the fact
that if we have a full MSS sized frame at the head of the
send queue, we will send it. TSO deferral can defeat the
ACK clocking necessary to exit cleanly from recovery.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make TSO segment transmit size decisions at send time not earlier.
The basic scheme is that we try to build as large a TSO frame as
possible when pulling in the user data, but the size of the TSO frame
output to the card is determined at transmit time.
This is guided by tp->xmit_size_goal. It is always set to a multiple
of MSS and tells sendmsg/sendpage how large an SKB to try and build.
Later, tcp_write_xmit() and tcp_push_one() chop up the packet if
necessary and conditions warrant. These routines can also decide to
"defer" in order to wait for more ACKs to arrive and thus allow larger
TSO frames to be emitted.
A general observation is that TSO elongates the pipe, thus requiring a
larger congestion window and larger buffering especially at the sender
side. Therefore, it is important that applications 1) get a large
enough socket send buffer (this is accomplished by our dynamic send
buffer expansion code) 2) do large enough writes.
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp_clean_rtx_queue(), if the TSO packet is not even partially
acked, do not waste time calling tcp_tso_acked().
Signed-off-by: David S. Miller <davem@davemloft.net>
Everything stated there is out of data. tcp_trim_skb()
does adjust the available socket send buffer space and
skb->truesize now.
Signed-off-by: David S. Miller <davem@davemloft.net>
Only put user data purely to pages when doing TSO.
The extra page allocations cause two problems:
1) Add the overhead of the page allocations themselves.
2) Make us do small user copies when we get to the end
of the TCP socket cache page.
It is still beneficial to purely use pages for TSO,
so we will do it for that case.
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_snd_test() is run for every packet output by a single
call to tcp_write_xmit(), but this is not necessary.
For one, the congestion window space needs to only be
calculated one time, then used throughout the duration
of the loop.
This cleanup also makes experimenting with different TSO
packetization schemes much easier.
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_snd_test() does several different things, use inline
functions to express this more clearly.
1) It initializes the TSO count of SKB, if necessary.
2) It performs the Nagle test.
3) It makes sure the congestion window is adhered to.
4) It makes sure SKB fits into the send window.
This cleanup also sets things up so that things like the
available packets in the congestion window does not need
to be calculated multiple times by packet sending loops
such as tcp_write_xmit().
Signed-off-by: David S. Miller <davem@davemloft.net>
'nonagle' should be passed to the tcp_snd_test() function
as 'TCP_NAGLE_PUSH' if we are checking an SKB not at the
tail of the write_queue. This is because Nagle does not
apply to such frames since we cannot possibly tack more
data onto them.
However, while doing this __tcp_push_pending_frames() makes
all of the packets in the write_queue use this modified
'nonagle' value.
Fix the bug and simplify this function by just calling
tcp_write_xmit() directly if sk_send_head is non-NULL.
As a result, we can now make tcp_data_snd_check() just call
tcp_push_pending_frames() instead of the specialized
__tcp_data_snd_check().
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_write_xmit() uses tcp_current_mss(), but some of it's callers,
namely __tcp_push_pending_frames(), already has this value available
already.
While we're here, fix the "cur_mss" argument to be "unsigned int"
instead of plain "unsigned".
Signed-off-by: David S. Miller <davem@davemloft.net>
Put the main basic block of work at the top-level of
tabbing, and mark the TCP_CLOSE test with unlikely().
Signed-off-by: David S. Miller <davem@davemloft.net>
The tcp_cwnd_validate() function should only be invoked
if we actually send some frames, yet __tcp_push_pending_frames()
will always invoke it. tcp_write_xmit() does the call for us,
so the call here can simply be removed.
Also, tcp_write_xmit() can be marked static.
Signed-off-by: David S. Miller <davem@davemloft.net>
When we add any new packet to the TCP socket write queue,
we must call skb_header_release() on it in order for the
TSO sharing checks in the drivers to work.
Signed-off-by: David S. Miller <davem@davemloft.net>
It reimplements portions of tcp_snd_check(), so it
we move it to tcp_output.c we can consolidate it's
logic much easier in a later change.
Signed-off-by: David S. Miller <davem@davemloft.net>
This just moves the code into tcp_output.c, no code logic changes are
made by this patch.
Using this as a baseline, we can begin to untangle the mess of
comparisons for the Nagle test et al. We will also be able to reduce
all of the redundant computation that occurs when outputting data
packets.
Signed-off-by: David S. Miller <davem@davemloft.net>
On each packet output, we call tcp_dec_quickack_mode()
if the ACK flag is set. It drops tp->ack.quick until
it hits zero, at which time we deflate the ATO value.
When doing TSO, we are emitting multiple packets with
ACK set, so we should decrement tp->ack.quick that many
segments.
Note that, unlike this case, tcp_enter_cwr() should not
take the tcp_skb_pcount(skb) into consideration. That
function, one time, readjusts tp->snd_cwnd and moves
into TCP_CA_CWR state.
Signed-off-by: David S. Miller <davem@davemloft.net>
The ideal and most optimal layout for an SKB when doing
scatter-gather is to put all the headers at skb->data, and
all the user data in the page array.
This makes SKB splitting and combining extremely simple,
especially before a packet goes onto the wire the first
time.
So, when sk_stream_alloc_pskb() is given a zero size, make
sure there is no skb_tailroom(). This is achieved by applying
SKB_DATA_ALIGN() to the header length used here.
Next, make select_size() in TCP output segmentation use a
length of zero when NETIF_F_SG is true on the outgoing
interface.
Signed-off-by: David S. Miller <davem@davemloft.net>
Below a patch to preallocate memory when doing resize of trie (inflate halve)
If preallocations fails it just skips the resize of this tnode for this time.
The oops we got when killing bgpd (with full routing) is now gone.
Patrick memory patch is also used.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
- rt_check_expire() fixes (an overflow occured if size of the hash
was >= 65536)
reminder of the bugfix:
The rt_check_expire() has a serious problem on machines with large
route caches, and a standard HZ value of 1000.
With default values, ie ip_rt_gc_interval = 60*HZ = 60000 ;
the loop count :
for (t = ip_rt_gc_interval << rt_hash_log; t >= 0;
overflows (t is a 31 bit value) as soon rt_hash_log is >= 16 (65536
slots in route cache hash table).
In this case, rt_check_expire() does nothing at all
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- rt hash table allocated using alloc_large_system_hash() function.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Locking abstraction
- Spinlocks moved out of rt hash table : Less memory (50%) used by rt
hash table. it's a win even on UP.
- Sizing of spinlocks table depends on NR_CPUS
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inflating a node a couple of times makes it exceed the 128k kmalloc limit.
Use __get_free_pages for allocations > PAGE_SIZE, as in fib_hash.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Makes IPv4 ip_rcv registration happen last in af_inet.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 2.6.12 we started dropping the conntrack reference when a packet
leaves the IP layer. This broke connection tracking on a bridge,
because bridge-netfilter defers calling some NF_IP_* hooks to the bridge
layer for locally generated packets going out a bridge, where the
conntrack reference is no longer available. This patch keeps the
reference in this case as a temporary solution, long term we will
remove the defered hook calling. No attempt is made to drop the
reference in the bridge-code when it is no longer needed, tc actions
could already have sent the packet anywhere.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In an smp system, it is possible for an connection timer to expire, calling
ip_vs_conn_expire while the connection table is being flushed, before
ct_write_lock_bh is acquired.
Since the list iterator loop in ip_vs_con_flush releases and re-acquires the
spinlock (even though it doesn't re-enable softirqs), it is possible for the
expiration function to modify the connection list, while it is being traversed
in ip_vs_conn_flush.
The result is that the next pointer gets set to NULL, and subsequently
dereferenced, resulting in an oops.
Signed-off-by: Neil Horman <nhorman@redhat.com>
Acked-by: JulianAnastasov
Signed-off-by: David S. Miller <davem@davemloft.net>
This should help up the insertion... but the resize is more crucial.
and complex and needs some thinking.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
I think there is a small bug in ipconfig.c in case IPCONFIG_DHCP is set
and dhcp is used.
When a DHCPOFFER is received, ip address is kept until we get DHCPACK.
If no ack is received, ic_dynamic() returns negatively, but leaves the
offered ip address in ic_myaddr.
This makes the main loop in ip_auto_config() break and uses the maybe
incomplete configuration.
Not sure if it's the best way to do, but the following trivial patch
correct this.
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
I followed Thomas' proposal to see every martian destination as a case
where the ipInAddrErrors counter has to be incremented. There are
two advantages by doing so: (1) The relation between the ipInReceive
counter and all the other ipInXXX counters is more accurate in the
case the RTN_UNICAST code check fails and (2) it makes the code in
ip_route_input_slow easier.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly missing initialization of padding fields of 1 or 2 bytes length,
two instances of uninitialized nlmsgerr->msg of 16 bytes length.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds mangling of ARP requests (in addition to replies),
since ARP caches are made from snooping both requests and replies.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: <pageexec@freemail.hu>
$subject was fixed in 2.4 already, 2.6 needs it as well.
The impact of the bugs is a kernel stack overflow and privilege escalation
from CAP_NET_ADMIN via the IP_VS_SO_SET_STARTDAEMON/IP_VS_SO_GET_DAEMON
ioctls. People running with 'root=all caps' (i.e., most users) are not
really affected (there's nothing to escalate), but SELinux and similar
users should take it seriously if they grant CAP_NET_ADMIN to other users.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It doesn't seem to make much sense to let an "If unsure, say N." option
default to y.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since it is tristate when we offer it as a choice, we should
definte it also as tristate when forcing it as the default.
Otherwise kconfig warns.
Signed-off-by: David S. Miller <davem@davemloft.net>
Create TCP_CONG_ADVANCED option, akin to IP_ADVANCED_ROUTER, which
when disabled will bypass all of the congestion control Kconfig
options and leave the user with a safe default.
That safe default is currently BIC-TCP with new Reno as a fallback.
Signed-off-by: David S. Miller <davem@davemloft.net>
Most users need not be concerned with a complex choice of what
FIB lookup algorithm to use. So give them the safe default of
IP_FIB_HASH if IP_ADVANCED_ROUTING is disabled.
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow using setsockopt to set TCP congestion control to use on a per
socket basis.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements Tom Kelly's Scalable TCP congestion control algorithm
for the modular framework.
The algorithm has some nice scaling properties, and has been used a fair bit
in research, though is known to have significant fairness issues, so it's not
really suitable for general purpose use.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
H-TCP is a congestion control algorithm developed at the Hamilton Institute, by
Douglas Leith and Robert Shorten. It is extending the standard Reno algorithm
with mode switching is thus a relatively simple modification.
H-TCP is defined in a layered manner as it is still a research platform. The
basic form includes the modification of beta according to the ratio of maxRTT
to min RTT and the alpha=2*factor*(1-beta) relation, where factor is dependant
on the time since last congestion.
The other layers improve convergence by adding appropriate factors to alpha.
The following patch implements the H-TCP algorithm in it's basic form.
Signed-Off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP Vegas code modified for the new TCP infrastructure.
Vegas now uses microsecond resolution timestamps for
better estimation of performance over higher speed links.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP Hybla congestion avoidance.
- "In heterogeneous networks, TCP connections that incorporate a
terrestrial or satellite radio link are greatly disadvantaged with
respect to entirely wired connections, because of their longer round
trip times (RTTs). To cope with this problem, a new TCP proposal, the
TCP Hybla, is presented and discussed in the paper[1]. It stems from an
analytical evaluation of the congestion window dynamics in the TCP
standard versions (Tahoe, Reno, NewReno), which suggests the necessary
modifications to remove the performance dependence on RTT.[...]"[1]
[1]: Carlo Caini, Rosario Firrincieli, "TCP Hybla: a TCP enhancement for
heterogeneous networks",
International Journal of Satellite Communications and Networking
Volume 22, Issue 5 , Pages 547 - 566. September 2004.
Signed-off-by: Daniele Lacamera (root at danielinux.net)net
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sally Floyd's high speed TCP congestion control.
This is useful for comparison and research.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the existing 2.6.12 Westwood code moved from tcp_input
to the new congestion framework. A lot of the inline functions
have been eliminated to try and make it clearer.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP BIC congestion control reworked to use the new congestion control
infrastructure. This version is more up to date than the BIC
code in 2.6.12; it incorporates enhancements from BICTCP 1.1,
to handle low latency links.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enhancement to the tcp_diag interface used by the iproute2 ss command
to report the tcp congestion control being used by a socket.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow TCP to have multiple pluggable congestion control algorithms.
Algorithms are defined by a set of operations and can be built in
or modules. The legacy "new RENO" algorithm is used as a starting
point and fallback.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a new kstrdup library function and changes the "local"
implementations in several places to use this function.
Most of the changes come from the sound and net subsystems. The sound part
had already been acknowledged by Takashi Iwai and the net part by David S.
Miller.
I left UML alone for now because I would need more time to read the code
carefully before making changes there.
Signed-off-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kconfig option had an extra double quote at the end of the line
which was causing in warning when building.
Signed-off-by: Kumar Gala <kumar.gala@freescale.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Drop reference before handing the packets to raw_rcv()
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Keir Fraser <Keir.Fraser@xl.cam.ac.uk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since expectation timeouts were made compulsory [1], there is no need to
check for them in ip_conntrack_expect_insert.
[1] https://lists.netfilter.org/pipermail/netfilter-devel/2005-January/018143.html
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Below is a more generic patch to do fib_lookup via netlink. For others
we should say that we discussed this as a way to verify route selection.
It's also possible there are others uses for this.
In short the fist half of struct fib_result_nl is filled in by caller
and netlink call fills in the other half and returns it.
In case anyone is interested there is a corresponding user app to compare
the full routing table this was used to test implementation of the LC-trie.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the flag XFRM_STATE_NOPMTUDISC for xfrm states. It is
similar to the nopmtudisc on IPIP/GRE tunnels. It only has an effect
on IPv4 tunnel mode states. For these states, it will ensure that the
DF flag is always cleared.
This is primarily useful to work around ICMP blackholes.
In future this flag could also allow a larger MTU to be set within the
tunnel just like IPIP/GRE tunnels. This could be useful for short haul
tunnels where temporary fragmentation outside the tunnel is desired over
smaller fragments inside the tunnel.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds xfrm_init_state which is simply a wrapper that calls
xfrm_get_type and subsequently x->type->init_state. It also gets rid
of the unused args argument.
Abstracting it out allows us to add common initialisation code, e.g.,
to set family-specific flags.
The add_time setting in xfrm_user.c was deleted because it's already
set by xfrm_state_alloc.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When enabled, this should disable UCOPY prequeue'ing altogether,
but it does not due to a missing test.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the type of the third parameter 'length' of the
raw_send_hdrinc() function from 'int' to 'size_t'.
This makes sense since this function is only ever called from one
location, and the value passed as the third parameter in that location is
itself of type size_t, so this makes the recieving functions parameter
type match. Also, inside raw_send_hdrinc() the 'length' variable is
used in comparisons with unsigned values and passed as parameter to
functions expecting unsigned values (it's used in a single comparison with
a signed value, but that one can never actually be negative so the patch
also casts that one to size_t to stop gcc worrying, and it is passed in a
single instance to memcpy_fromiovecend() which expects a signed int, but
as far as I can see that's not a problem since the value of 'length'
shouldn't ever exceed the value of a signed int).
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the type of the local variable 'i' in
raw_probe_proto_opt() from 'int' to 'unsigned int'. The only use of 'i' in
this function is as a counter in a for() loop and subsequent index into
the msg->msg_iov[] array.
Since 'i' is compared in a loop to the unsigned variable msg->msg_iovlen
gcc -W generates this warning :
net/ipv4/raw.c:340: warning: comparison between signed and unsigned
Changing 'i' to unsigned silences this warning and is safe since the array
index can never be negative anyway, so unsigned int is the logical type to
use for 'i' and also enables a larger msg_iov[] array (but I don't know if
that will ever matter).
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch gets rid of the following gcc -W warning in net/ipv4/raw.c :
net/ipv4/raw.c:387: warning: comparison of unsigned expression < 0 is always false
Since 'len' is of type size_t it is unsigned and can thus never be <0, and
since this is obvious from the function declaration just a few lines above
I think it's ok to remove the pointless check for len<0.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch silences these two gcc -W warnings in net/ipv4/raw.c :
net/ipv4/raw.c:517: warning: signed and unsigned type in conditional expression
net/ipv4/raw.c:613: warning: signed and unsigned type in conditional expression
It doesn't change the behaviour of the code, simply writes the conditional
expression with plain 'if()' syntax instead of '? :' , but since this
breaks it into sepperate statements gcc no longer complains about having
both a signed and unsigned value in the same conditional expression.
Signed-off-by: David S. Miller <davem@davemloft.net>
In light of my recent patch to net/ipv4/udp.c that replaced the
spin_lock_irq calls on the receive queue lock with spin_lock_bh,
here is a similar patch for all other occurences of spin_lock_irq
on receive/error queue locks in IPv4 and IPv6.
In these stacks, we know that they can only be entered from user
or softirq context. Therefore it's safe to disable BH only.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch ensures that netlink events created as a result of programns
using ioctls (such as ifconfig, route etc) contains the correct PID of
those events.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch rectifies some rtnetlink message builders that derive the
flags from the pid. It is now explicit like the other cases
which get it right. Also fixes half a dozen dumpers which did not
set NLM_F_MULTI at all.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This chunks out the accept_queue and tcp_listen_opt code and moves
them to net/core/request_sock.c and include/net/request_sock.h, to
make it useful for other transport protocols, DCCP being the first one
to use it.
Next patches will rename tcp_listen_opt to accept_sock and remove the
inline tcp functions that just call a reqsk_queue_ function.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ok, this one just renames some stuff to have a better namespace and to
dissassociate it from TCP:
struct open_request -> struct request_sock
tcp_openreq_alloc -> reqsk_alloc
tcp_openreq_free -> reqsk_free
tcp_openreq_fastfree -> __reqsk_free
With this most of the infrastructure closely resembles a struct
sock methods subset.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes various crashes on 64-bit when using this module.
Based upon a patch by Juergen Kreileder <jk@blackdown.de>.
Signed-off-by: David S. Miller <davem@davemloft.net>
ACKed-by: Patrick McHardy <kaber@trash.net>
This patch alows you to change the source address of icmp error
messages. It applies cleanly to 2.6.11.11 and retains the default
behaviour.
In the old (default) behaviour icmp error messages are sent with the ip
of the exiting interface.
The new behaviour (when the sysctl variable is toggled on), it will send
the message with the ip of the interface that received the packet that
caused the icmp error. This is the behaviour network administrators will
expect from a router. It makes debugging complicated network layouts
much easier. Also, all 'vendor routers' I know of have the later
behaviour.
Signed-off-by: David S. Miller <davem@davemloft.net>
Steven Hand <Steven.Hand@cl.cam.ac.uk> wrote:
>
> Reconstructed forward trace:
>
> net/ipv4/udp.c:1334 spin_lock_irq()
> net/ipv4/udp.c:1336 udp_checksum_complete()
> net/core/skbuff.c:1069 skb_shinfo(skb)->nr_frags > 1
> net/core/skbuff.c:1086 kunmap_skb_frag()
> net/core/skbuff.h:1087 local_bh_enable()
> kernel/softirq.c:0140 WARN_ON(irqs_disabled());
The receive queue lock is never taken in IRQs (and should never be) so
we can simply substitute bh for irq.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have ip_queue being used from LOCAL_IN, then we end up with a
situation where the verdicts coming back from userspace traverse the TCP
input path from syscall context. While this seems to work most of the
time, there's an ugly deadlock:
syscall context is interrupted by the timer interrupt. When the timer
interrupt leaves, the timer softirq get's scheduled and calls
tcp_delack_timer() and alike. They themselves do bh_lock_sock(sk),
which is already held from somewhere else -> boom.
I've now tested the suggested solution by Patrick McHardy and Herbert Xu to
simply use local_bh_{en,dis}able().
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It cannot work properly, so just ignore it in drr
and rr multipath algorithms just like the random
multipath algorithm does.
Suggested by Herbert Xu.
Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an option to make secondary IP addresses get promoted
when primary IP addresses are removed from the device.
It defaults to off to preserve existing behavior.
Signed-off-by: Harald Welte <laforge@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we are doing ucopy, we try to defer the ACK generation to
cleanup_rbuf(). This works most of the time very well, but if the
ucopy prequeue is large, this ACKing behavior kills performance.
With TSO, it is possible to fill the prequeue so large that by the
time the ACK is sent and gets back to the sender, most of the window
has emptied of data and performance suffers significantly.
This behavior does help in some cases, so we should think about
re-enabling this trick in the future, using some kind of limit in
order to avoid the bug case.
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove extra __ip_vs_conn_put for incoming ICMP in direct routing
mode. Mark de Vries reports that IPVS connections are not leaked anymore.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having frag_list members which holds wmem of an sk leads to nightmares
with partially cloned frag skb's. The reason is that once you unleash
a skb with a frag_list that has individual sk ownerships into the stack
you can never undo those ownerships safely as they may have been cloned
by things like netfilter. Since we have to undo them in order to make
skb_linearize happy this approach leads to a dead-end.
So let's go the other way and make this an invariant:
For any skb on a frag_list, skb->sk must be NULL.
That is, the socket ownership always belongs to the head skb.
It turns out that the implementation is actually pretty simple.
The above invariant is actually violated in the following patch
for a short duration inside ip_fragment. This is OK because the
offending frag_list member is either destroyed at the end of the
slow path without being sent anywhere, or it is detached from
the frag_list before being sent.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ross moved. Remove the bad email address so people will find the correct
one in ./CREDITS.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I found a bug that stopped IPsec/IPv6 from working. About
a month ago IPv6 started using rt6i_idev->dev on the cached socket dst
entries. If the cached socket dst entry is IPsec, then rt6i_idev will
be NULL.
Since we want to look at the rt6i_idev of the original route in this
case, the easiest fix is to store rt6i_idev in the IPsec dst entry just
as we do for a number of other IPv6 route attributes. Unfortunately
this means that we need some new code to handle the references to
rt6i_idev. That's why this patch is bigger than it would otherwise be.
I've also done the same thing for IPv4 since it is conceivable that
once these idev attributes start getting used for accounting, we
probably need to dereference them for IPv4 IPsec entries too.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's recap the problem. The current asynchronous netlink kernel
message processing is vulnerable to these attacks:
1) Hit and run: Attacker sends one or more messages and then exits
before they're processed. This may confuse/disable the next netlink
user that gets the netlink address of the attacker since it may
receive the responses to the attacker's messages.
Proposed solutions:
a) Synchronous processing.
b) Stream mode socket.
c) Restrict/prohibit binding.
2) Starvation: Because various netlink rcv functions were written
to not return until all messages have been processed on a socket,
it is possible for these functions to execute for an arbitrarily
long period of time. If this is successfully exploited it could
also be used to hold rtnl forever.
Proposed solutions:
a) Synchronous processing.
b) Stream mode socket.
Firstly let's cross off solution c). It only solves the first
problem and it has user-visible impacts. In particular, it'll
break user space applications that expect to bind or communicate
with specific netlink addresses (pid's).
So we're left with a choice of synchronous processing versus
SOCK_STREAM for netlink.
For the moment I'm sticking with the synchronous approach as
suggested by Alexey since it's simpler and I'd rather spend
my time working on other things.
However, it does have a number of deficiencies compared to the
stream mode solution:
1) User-space to user-space netlink communication is still vulnerable.
2) Inefficient use of resources. This is especially true for rtnetlink
since the lock is shared with other users such as networking drivers.
The latter could hold the rtnl while communicating with hardware which
causes the rtnetlink user to wait when it could be doing other things.
3) It is still possible to DoS all netlink users by flooding the kernel
netlink receive queue. The attacker simply fills the receive socket
with a single netlink message that fills up the entire queue. The
attacker then continues to call sendmsg with the same message in a loop.
Point 3) can be countered by retransmissions in user-space code, however
it is pretty messy.
In light of these problems (in particular, point 3), we should implement
stream mode netlink at some point. In the mean time, here is a patch
that implements synchronous processing.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converts remaining rtnetlink_link tables to use c99 designated
initializers to make greping a little bit easier.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This has been brought up before.. http://lkml.org/lkml/2000/1/21/116
but didnt seem to get resolved. This morning I got someone
file a bugzilla about it breaking sysctl(8).
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A lot of places in there are including major.h for no reason
whatsoever. Removed. And yes, it still builds.
The history of that stuff is often amusing. E.g. for net/core/sock.c
the story looks so, as far as I've been able to reconstruct it: we used to
need major.h in net/socket.c circa 1.1.early. In 1.1.13 that need had
disappeared, along with register_chrdev(SOCKET_MAJOR, "socket", &net_fops)
in sock_init(). Include had not. When 1.2 -> 1.3 reorg of net/* had moved
a lot of stuff from net/socket.c to net/core/sock.c, this crap had followed...
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes a superfluous intialization from tcp_data_queue().
Signed-off-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the event a raw socket is created for sending purposes only, the creator
never bothers to check the socket's receive queue. But we continue to
add skbs to its queue until it fills up.
Unfortunately, if ip_conntrack is loaded on the box, each skb we add to the
queue potentially holds a reference to a conntrack. If the user attempts
to unload ip_conntrack, we will spin around forever since the queued skbs
are pinned.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Yasuyuki KOZAKAI <yasuyuki.kozkaai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem is that when doing MTU discovery, the too-large segments in
the write queue will be calculated as having a pcount of >1. When
tcp_write_xmit() is trying to send, tcp_snd_test() fails the cwnd test
when pcount > cwnd.
The segments are eventually transmitted one at a time by keepalive, but
this can take a long time.
This patch checks if TSO is enabled when setting pcount.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NAT changes in 2.6.11 changed the position where helpers
are called and perform packet mangling. Before 2.6.11, a NAT
helper was called before the packet was NATed and had its
sequence number adjusted. Since 2.6.11, the helpers get packets
with already adjusted sequence numbers.
This breaks sequence number adjustment, adjust_tcp_sequence()
needs the original sequence number to determine whether
a packet was a retransmission and to store it for further
corrections. It can't be reconstructed without more information
than available, so this patch restores the old order by
calling helpers from a new conntrack hook two priorities
below ip_conntrack_confirm() and adjusting the sequence number
from a new NAT hook one priority below ip_conntrack_confirm().
Tracked down by Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following patch just makes the header part of the skb writeable.
This is needed since we modify the IP headers just a few lines below.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is a revised alternative that uses BUG_ON/WARN_ON
(as suggested by Herbert Xu) to eliminate NET_CALLER.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!