Use RCU & RTNL protection for mfc_cache_array[]
ipmr_cache_find() is called under rcu_read_lock();
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RCU and RTNL to protect (struct mr_table)->mroute_sk
Readers use RCU, writers use RTNL.
ip_ra_control() already use an RCU grace period before
ip_ra_destroy_rcu(), so we dont need synchronize_rcu() in
mrtsock_destruct()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to get a reference on reg_dev and release it, we are in a
rcu_read_lock() protected section.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit e81963b180.
LRO is now deprecated in favour of GRO, and only a few drivers use it,
so it is desirable to build it as a module in distribution kernels.
The original change to prevent building it as a module was made in an
attempt to avoid the case where some dependents are set to y and some
to m, and INET_LRO can be set to m rather than y. However, the
Kconfig system will reliably set INET_LRO=y in this case.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the condition (3rd arg) passed to sk_wait_event() in
sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory()
causes the following soft lockup in tcp_sendmsg() when the global tcp
memory pool has exhausted.
>>> snip <<<
localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
localhost kernel: CPU 3:
localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200] [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200
localhost kernel:
localhost kernel: Call Trace:
localhost kernel: [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200
localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel: [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
localhost kernel: [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
localhost kernel: [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel: [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
localhost kernel: [vfs_write+0x185/0x190] vfs_write+0x185/0x190
localhost kernel: [sys_write+0x50/0x90] sys_write+0x50/0x90
localhost kernel: [system_call+0x7e/0x83] system_call+0x7e/0x83
>>> snip <<<
What is happening is, that the sk_wait_event() condition passed from
sk_stream_wait_memory() evaluates to true for the case of tcp global memory
exhaustion. This is because both sk_stream_memory_free() and vm_wait are true
which causes sk_wait_event() to *not* call schedule_timeout().
Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
This causes the caller to again try allocation, which again fails and again
calls sk_stream_wait_memory(), and so on.
[ Bug introduced by commit c1cbe4b7ad
("[NET]: Avoid atomic xchg() for non-error case") -DaveM ]
Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_dev_find(net, addr) finds a device given an IPv4 source address and
takes a reference on it.
Introduce __ip_dev_find(), taking a third argument, to optionally take
the device reference. Callers not asking the reference to be taken
should be in an rcu_read_lock() protected section.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roger Luethi noticed packets for unknown VLANs getting silently dropped
even in promiscuous mode.
Check for promiscuous mode in __vlan_hwaccel_rx() and vlan_gro_common()
before drops.
As suggested by Patrick, mark such packets to have skb->pkt_type set to
PACKET_OTHERHOST to make sure they are dropped by IP stack.
Reported-by: Roger Luethi <rl@hellgate.ch>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing stress tests with a disabled IP route cache, I found
__mkroute_output() was touching three times in_device atomic refcount.
Use RCU to touch it once to reduce cache line ping pongs.
Before patch
time to perform the test
real 1m42.009s
user 0m12.545s
sys 25m0.726s
Profile :
16109.00 26.4% ip_route_output_slow vmlinux
7434.00 12.2% dst_destroy vmlinux
3280.00 5.4% fib_rules_lookup vmlinux
3252.00 5.3% fib_semantic_match vmlinux
2622.00 4.3% fib_table_lookup vmlinux
2535.00 4.1% dst_alloc vmlinux
1750.00 2.9% _raw_read_lock vmlinux
1532.00 2.5% rt_set_nexthop vmlinux
After patch
real 1m36.503s
user 0m12.977s
sys 23m25.608s
14234.00 22.4% ip_route_output_slow vmlinux
8717.00 13.7% dst_destroy vmlinux
4052.00 6.4% fib_rules_lookup vmlinux
3951.00 6.2% fib_semantic_match vmlinux
3191.00 5.0% dst_alloc vmlinux
1764.00 2.8% fib_table_lookup vmlinux
1692.00 2.7% _raw_read_lock vmlinux
1605.00 2.5% rt_set_nexthop vmlinux
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch restores the below flow control patch submitted by Rémi
Denis-Courmont, which accidentaly got lost due to Pipe controller patch
on Phonet.
commit 1a98214fee
Author: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Date: Mon Aug 30 12:57:03 2010 +0000
Phonet: restore flow control credits when sending fails
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
Acked-by: Linus Walleij <linus.walleij@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 8cb8e6f168.
That commit introduced a regression with the Bluetooth Profile Tuning
Suite(PTS), Reverting this make sure that L2CAP is in a qualificable
state.
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
As we don't have any error control on the Streaming mode, i.e., we don't
need to keep a copy of the skb for later resending we don't need to
call skb_clone() on it.
Then we can go one further here, and dequeue the skb before sending it,
that also means we don't need to look to sk->sk_send_head anymore.
The patch saves memory and time when sending Streaming mode data, so
it is good to mainline.
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
When receiving L2CAP negative configuration response with respect
to MTU parameter we modify wrong field. MTU here means proposed
value of MTU that the remote device intends to transmit. So for local
L2CAP socket it is pi->imtu.
Signed-off-by: Andrei Emeltchenko <andrei.emeltchenko@nokia.com>
Acked-by: Ville Tervo <ville.tervo@nokia.com>
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
This fixes a bug which caused the FCS setting to show L2CAP_FCS_CRC16
with L2CAP modes other than ERTM or streaming. At present, this only
affects the FCS value shown with getsockopt() for basic mode.
Signed-off-by: Mat Martineau <mathewm@codeaurora.org>
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
HARD_TX_LOCK no longer protects tunnels from dead loops,
but xmit_recursion percpu counter.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Retrieve the header after doing pskb_may_pull since, pskb_may_pull
could change the buffer structure.
This is based on the comment given by Eric Dumazet on Phonet
Pipe controller patch for a similar problem.
Signed-off-by: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
Acked-by: Linus Walleij <linus.walleij@stericsson.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is some confusion with rx_queue name after RPS, and net drivers
private rx_queue fields.
I suggest to rename "struct net_device"->rx_queue to ingress_queue.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
Other seldom used fields are kept in netdev->stats structure, possibly
unsafe.
This is a preliminary work to support lockless transmit path, and
correct RX stats, that are already unsafe.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPIP tunnels can benefit from lockless xmits, using NETIF_F_LLTX
Bench on a 16 cpus machine (dual E5540 cpus), 16 threads sending
10000000 UDP frames via one ipip tunnel (size:200 bytes per frame)
Before patch :
real 2m53.321s
user 0m10.277s
sys 46m0.597s
After patch:
real 0m32.063s
user 0m9.237s
sys 8m16.255s
Last problem to solve is the contention on dst :
16118.00 28.3% __ip_route_output_key vmlinux
6135.00 10.8% dst_release vmlinux
3220.00 5.6% ip_finish_output vmlinux
2149.00 3.8% ip_route_output_flow vmlinux
1575.00 2.8% ip_append_data vmlinux
1481.00 2.6% ip_push_pending_frames vmlinux
1349.00 2.4% __xfrm_lookup vmlinux
1216.00 2.1% csum_partial_copy_generic vmlinux
1208.00 2.1% udp_sendmsg vmlinux
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE tunnels can benefit from lockless xmits, using NETIF_F_LLTX
Note: If tunnels are created with the "oseq" option, LLTX is not
enabled :
Even using an atomic_t o_seq, we would increase chance for packets being
out of order at receiver.
Bench on a 16 cpus machine (dual E5540 cpus), 16 threads sending
10000000 UDP frames via one gre tunnel (size:200 bytes per frame)
Before patch :
real 3m0.094s
user 0m9.365s
sys 47m50.103s
After patch:
real 0m29.756s
user 0m11.097s
sys 7m33.012s
Last problem to solve is the contention on dst :
38660.00 21.4% __ip_route_output_key vmlinux
20786.00 11.5% dst_release vmlinux
14191.00 7.8% __xfrm_lookup vmlinux
12410.00 6.9% ip_finish_output vmlinux
4540.00 2.5% ip_push_pending_frames vmlinux
4427.00 2.4% ip_append_data vmlinux
4265.00 2.4% __alloc_skb vmlinux
4140.00 2.3% __ip_local_out vmlinux
3991.00 2.2% dev_queue_xmit vmlinux
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 15fc1f7056 (sit: percpu stats accounting) forgot the fallback
tunnel case (sit0), and can crash pretty fast.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 3c97af99a5 (ipip: percpu stats accounting) forgot the fallback
tunnel case (tunl0), and can crash pretty fast.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As tunnel devices are going to be lockless, we need to make sure a
misconfigured machine wont enter an infinite loop.
Add a percpu variable, and limit to three the number of stacked xmits.
Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AnyIP is the capability to receive packets and establish incoming
connections on IPs we have not explicitly configured on the machine.
An example use case is to configure a machine to accept all incoming
traffic on eth0, and leave the policy of whether traffic for a given IP
should be delivered to the machine up to the load balancer.
Can be setup as follows:
ip -6 rule from all iif eth0 lookup 200
ip -6 route add local default dev lo table 200
(in this case for all IPv6 addresses)
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows a host to be configured to respond to any address in
a specified range as if it were local, without actually needing to
configure the address on an interface. This is done through routing
table configuration. For instance, to configure a host to respond
to any address in 10.1/16 received on eth0 as a local address we can do:
ip rule add from all iif eth0 lookup 200
ip route add local 10.1/16 dev lo proto kernel scope host src 127.0.0.1 table 200
This host is now reachable by any 10.1/16 address (route lookup on
input for packets received on eth0 can find the route). On output, the
rule will not be matched so that this host can still send packets to
10.1/16 (not sent on loopback). Presumably, external routing can be
configured to make sense out of this.
To make this work, we needed to modify the logic in finding the
interface which is assigned a given source address for output
(dev_ip_find). We perform a normal fib_lookup instead of just a
lookup on the local table, and in the lookup we ignore the input
interface for matching.
This patch is useful to implement IP-anycast for subnets of virtual
addresses.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GRE tunnel driver needs to invoke icmpv6 helpers in the
ipv6 stack when ipv6 support is enabled.
Therefore if IPV6 is enabled, we have to enforce that GRE's
enabling (modular or static) matches that of ipv6.
Reported-by: Patrick McHardy <kaber@trash.net>
Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes kernel Bugzilla Bug 18952
This patch adds a syn_set parameter to the retransmits_timed_out()
routine and updates its callers. If not set, TCP_RTO_MIN is taken
as the calculation basis as before. If set, TCP_TIMEOUT_INIT is
used instead, so that sysctl_syn_retries represents the actual
amount of SYN retransmissions in case no SYNACKs are received when
establishing a new connection.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The WMM parameter configuration function (ieee80211_sta_wmm_params) only
configures the WMM parameters to the driver is the wmm_last_param_set
counter value is changed by the AP.
The wmm_last_param_set is initialized to -1 on association in order to ensure
the configuration is made to the driver at least once on association, but
currently this initialization is done *after* the WMM parameter configuration
function was called.
This leads to unreliability in the driver getting properly configured on first
association (depending on what counter value the AP happens to use.) When
disassociating (the wmm default parameters are configured to the driver) and
then reassociating, due to the above the WMM configuration is not set to the
driver at all.
On drivers without beacon filtering the problem is corrected by later beacons,
but on drivers with beacon filtering the WMM will remain permanently incorrectly
configured.
Fix this by moving the initialization of wmm_last_param_set to -1 before
ieee80211_sta_wmm_params is called on association.
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
IBSS started from wireless extensions is currently
missing basic rate configuration, fix this by moving
the code to generate the default to the common code
that gets invoked for both nl80211 and wext.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Association is dealt with as an atomic offchannel operation,
we do this because we don't know we are associated until we
get the associatin response from the AP. When we do get the
associatin response though we were never clearing the offchannel
state. This has a few implications, we told drivers we were
still offchannel, and the first configured TX power for the
channel does not take into account any power constraints.
For ath9k this meant ANI calibration would not start upon
association, and we'd have to wait until the first bgscan
to be triggered. There may be other issues this resolves
but I'm too lazy to comb the code to check.
Cc: stable@kernel.org
Cc: Amod Bodas <amod.bodas@atheros.com>
Cc: Vasanth Thiagarajan <vasanth.thiagarajan@atheros.com>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
tcp: Fix >4GB writes on 64-bit.
net/9p: Mount only matching virtio channels
de2104x: fix ethtool
tproxy: check for transparent flag in ip_route_newports
ipv6: add IPv6 to neighbour table overflow warning
tcp: fix TSO FACK loss marking in tcp_mark_head_lost
3c59x: fix regression from patch "Add ethtool WOL support"
ipv6: add a missing unregister_pernet_subsys call
s390: use free_netdev(netdev) instead of kfree()
sgiseeq: use free_netdev(netdev) instead of kfree()
rionet: use free_netdev(netdev) instead of kfree()
ibm_newemac: use free_netdev(netdev) instead of kfree()
smsc911x: Add MODULE_ALIAS()
net: reset skb queue mapping when rx'ing over tunnel
br2684: fix scheduling while atomic
de2104x: fix TP link detection
de2104x: fix power management
de2104x: disable autonegotiation on broken hardware
net: fix a lockdep splat
e1000e: 82579 do not gate auto config of PHY by hardware during nominal use
...
This covers RX if necessary, as well as TX.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For RPS, we create a kobject for each RX queue based on the number of
queues passed to alloc_netdev_mq(). However, drivers generally do not
determine the numbers of hardware queues to use until much later, so
this usually represents the maximum number the driver may use and not
the actual number in use.
For TX queues, drivers can update the actual number using
netif_set_real_num_tx_queues(). Add a corresponding function for RX
queues, netif_set_real_num_rx_queues().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_attach_filter() and sk_detach_filter() are run with socket locked.
Use the appropriate rcu_dereference_protected() instead of blocking BH,
and rcu_dereference_bh().
There is no point adding BH prevention and memory barrier.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems we dont use appropriate refcount increment in an
rcu_read_lock() protected section.
fib_rule_get() might increment a null refcount and bad things could
happen.
While fib_nl_delrule() respects an rcu grace period before calling
fib_rule_put(), fib_rules_cleanup_ops() calls fib_rule_put() without a
grace period.
Note : after this patch, we might avoid the synchronize_rcu() call done
in fib_nl_delrule()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
Other seldom used fields are kept in netdev->stats structure, possibly
unsafe.
This is a preliminary work to support lockless transmit path, and
correct RX stats, that are already unsafe.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
Other seldom used fields are kept in netdev->stats structure, possibly
unsafe.
This is a preliminary work to support lockless transmit path, and
correct RX stats, that are already unsafe.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le lundi 27 septembre 2010 à 14:29 +0100, Ben Hutchings a écrit :
> > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> > index 5d6ddcb..de39b22 100644
> > --- a/net/ipv4/ip_gre.c
> > +++ b/net/ipv4/ip_gre.c
> [...]
> > @@ -377,7 +405,7 @@ static struct ip_tunnel *ipgre_tunnel_locate(struct net *net,
> > if (parms->name[0])
> > strlcpy(name, parms->name, IFNAMSIZ);
> > else
> > - sprintf(name, "gre%%d");
> > + strcpy(name, "gre%d");
> >
> > dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup);
> > if (!dev)
> [...]
>
> This is a valid fix, but doesn't belong in this patch!
>
Sorry ? It was not a fix, but at most a cleanup ;)
Anyway I forgot the gretap case...
[PATCH 2/4 v2] ip_gre: percpu stats accounting
Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
Other seldom used fields are kept in netdev->stats structure, possibly
unsafe.
This is a preliminary work to support lockless transmit path, and
correct RX stats, that are already unsafe.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phonet stack assumes the presence of Pipe Controller, either in Modem or
on Application Processing Engine user-space for the Pipe data.
Nokia Slim Modems like WG2.5 used in ST-Ericsson U8500 platform do not
implement Pipe controller in them.
This patch adds Pipe Controller implemenation to Phonet stack to support
Pipe data over Phonet stack for Nokia Slim Modems.
Signed-off-by: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
Acked-by: Linus Walleij <linus.walleij@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes kernel bugzilla #16603
tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write
zero bytes, for example.
There is also the problem higher up of how verify_iovec() works. It
wants to prevent the total length from looking like an error return
value.
However it does this using 'int', but syscalls return 'long' (and
thus signed 64-bit on 64-bit machines). So it could trigger
false-positives on 64-bit as written. So fix it to use 'long'.
Reported-by: Olaf Bonorden <bono@onlinehome.de>
Reported-by: Daniel Büse <dbuese@gmx.de>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
p9_virtio_create will only compare the the channel's tag characters
against the device name till the end of the channel's tag but not till
the end of the device name. This means that if a user defines channels
with the tags foo and foobar then he would mount foo when he requested
foonot and may mount foo when he requested foobar.
Thus it is necessary to check both string lengths against each other in
case of a successful partial string match.
Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 and IPv6 have separate neighbour tables, so
the warning messages should be distinguishable.
[ Add a suitable message prefix on the ipv4 side as well -DaveM ]
Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP uses FACK algorithm to mark lost packets in
tcp_mark_head_lost(), if the number of packets in the (TSO) skb is
greater than the number of packets that should be marked lost, TCP
incorrectly exits the loop and marks no packets lost in the skb. This
underestimates tp->lost_out and affects the recovery/retransmission.
This patch fargments the skb and marks the correct amount of packets
lost.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 8c0c709eea
Author: Johannes Berg <johannes@sipsolutions.net>
Date: Wed Nov 25 17:46:15 2009 +0100
mac80211: move cmntr flag out of rx flags
moved the CMNTR flag into the skb RX flags for
some aggregation cleanups, but this was wrong
since the optimisation this flag tried to make
requires that it is kept across the processing
of multiple interfaces -- which isn't true for
flags in the skb. The patch not only broke the
optimisation, it also introduced a bug: under
some (common!) circumstances the flag will be
set on an already freed skb!
However, investigating this in more detail, I
found that most of the flags that we set should
be per packet, _except_ for this one, due to
a-MPDU processing. Additionally, the flags used
for processing (currently just this one) need
to be reset before processing a new packet.
Since we haven't actually seen bugs reported as
a result of the wrong flags handling (which is
not too surprising -- the only real bug case I
can come up with is an a-MSDU contained in an
a-MPDU), I'll make a different fix for rc.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Even if the reorder timeout timer fires while
scanning, the frames weren't received during
scanning and therefore shouldn't be dropped.
To implement this, changes to the passive scan
RX handler simplify understanding it, because
it currently checks HW_SCANNING independently
of a packet's in-scan receive status (which
doesn't make a big difference, since scan_rx()
will only pick up probe responses and beacons,
which can't be aggregated.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
If a station was found, then we'll have exited
the function already, so it is not necessary to
have a variable keeping track of it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There are now four instances of vaguely the same
code that does packet preparation, checking for
MMIC errors and reporting them, and then invoking
packet processing. Consolidate all of these.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The first argument to prepare_for_handlers is always
the sdata that can just be stored in rx data directly
(and even already is, in two of four code paths.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This reverts commit cd87a2d3a3.
Author reports it conflicts with proper fixes, applied hereafter.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
If interface does not existk, when nl80211_set_power_save is called, (eg.
module has been unloaded) it has been causing kernel panic. Added new
goto target to avoid crash if get_rdev_dev_by_info_ifindex does not
return dev and rdev pointers.
Signed-off-by: Teemu Paasikivi <ext-teemu.3.paasikivi@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When using multiple STA interfaces on the same radio, some
data packets need to be received on all interfaces
(broadcast, for instance).
Make the STA loop look similar to the mgt-data loop.
Also, add logic to check RX_FLAG_MMIC_ERROR for last
interface in mgt-data loop.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The old ieee80211_find_sta_by_hw method didn't properly
find VIFS when there was more than one per AP. This caused
AMPDU logic in ath9k to get the wrong VIF when trying to
account for transmitted SKBs.
This patch changes ieee80211_find_sta_by_hw to take a
localaddr argument to distinguish between VIFs with the
same AP but different local addresses. The method name
is changed to ieee80211_find_sta_by_ifaddr.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Create 'stations' sub-directory under each netdev:[vif-name]
directory to hold all stations for that network device.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Return -ENOMEM when erroring on kmalloc and fix memory leaks when returning on error.
Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Clean up a missing exit path in the ipv6 module init routines. In
addrconf_init we call ipv6_addr_label_init which calls register_pernet_subsys
for the ipv6_addr_label_ops structure. But if module loading fails, or if the
ipv6 module is removed, there is no corresponding unregister_pernet_subsys call,
which leaves a now-bogus address on the pernet_list, leading to oopses in
subsequent registrations. This patch cleans up both the failed load path and
the unload path. Tested by myself with good results.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
include/net/addrconf.h | 1 +
net/ipv6/addrconf.c | 11 ++++++++---
net/ipv6/addrlabel.c | 5 +++++
3 files changed, 14 insertions(+), 3 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
__in_dev_get_rtnl(dev_out) is called while RTNL is not held, thus
triggers a lockdep fault.
At this point, we only perform a raw test of dev_out->ip_ptr being NULL,
we dont need to make sure ip_ptr cant changed right after.
We can use rcu_dereference_raw() for this.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having two places were we allocate dev->_rx, introduce
netif_alloc_rx_queues() helper and call it only from
register_netdevice(), not from alloc_netdev_mq()
Goal is to let drivers change dev->num_rx_queues after allocating netdev
and before registering it.
This also removes a lot of ifdefs in net/core/dev.c
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
You can't call atomic_notifier_chain_unregister() while in atomic context.
Fix, call un/register_atmdevice_notifier in module __init and __exit.
Bug report:
http://comments.gmane.org/gmane.linux.network/172603
Reported-by: Mikko Vinni <mmvinni@yahoo.com>
Tested-by: Mikko Vinni <mmvinni@yahoo.com>
Signed-off-by: Karl Hiramoto <karl@hiramoto.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Automatically allows vlans to get NETIF_F_HIGHDMA if underlying device
supports it.
On 32bit arches (and more precisely if CONFIG_HIGHMEM is enabled), it
can help to reduce cost of illegal_highdma() and __skb_linearize()
calls.
Tested on tg3 , bnx2, bonding, this worked very well.
This is a generalization of a patch provided by Yi Zou & Jeff Kirsher.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have for each socket :
One spinlock (sk_slock.slock)
One rwlock (sk_callback_lock)
Possible scenarios are :
(A) (this is used in net/sunrpc/xprtsock.c)
read_lock(&sk->sk_callback_lock) (without blocking BH)
<BH>
spin_lock(&sk->sk_slock.slock);
...
read_lock(&sk->sk_callback_lock);
...
(B)
write_lock_bh(&sk->sk_callback_lock)
stuff
write_unlock_bh(&sk->sk_callback_lock)
(C)
spin_lock_bh(&sk->sk_slock)
...
write_lock_bh(&sk->sk_callback_lock)
stuff
write_unlock_bh(&sk->sk_callback_lock)
spin_unlock_bh(&sk->sk_slock)
This (C) case conflicts with (A) :
CPU1 [A] CPU2 [C]
read_lock(callback_lock)
<BH> spin_lock_bh(slock)
<wait to spin_lock(slock)>
<wait to write_lock_bh(callback_lock)>
We have one problematic (C) use case in inet_csk_listen_stop() :
local_bh_disable();
bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
WARN_ON(sock_owned_by_user(child));
...
sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)
lockdep is not happy with this, as reported by Tetsuo Handa
It seems only way to deal with this is to use read_lock_bh(callbacklock)
everywhere.
Thanks to Jarek for pointing a bug in my first attempt and suggesting
this solution.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While investigating a bit, I found ip_fragment() slow path was taken
because ip_append_data() provides following layout for a send(MTU +
N*(MTU - 20)) syscall :
- one skb with 1500 (mtu) bytes
- N fragments of 1480 (mtu-20) bytes (before adding IP header)
last fragment gets 17 bytes of trail data because of following bit:
if (datalen == length + fraggap)
alloclen += rt->dst.trailer_len;
Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
another bug ?)
In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
so we take slow path, building another skb chain.
In order to avoid taking slow path, we should correct ip_append_data()
to make sure last fragment has real trail space, under mtu...
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes stale mac80211_tx_control_flags for
filtered / retried frames.
Because ieee80211_handle_filtered_frame feeds skbs back
into the tx path, they have to be stripped of some tx
flags so they won't confuse the stack, driver or device.
Cc: <stable@kernel.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
IEEE Std 802.11k-2008 added DS Parameter Set information element into
Probe Request frames as an optional information on 2.4 GHz band (and
mandatory, if radio measurements are enabled). This allows APs to
filter out Probe Request frames that may be received from neighboring
overlapping channels and by doing so, reduce the number of unnecessary
frames in the air. Make mac80211 add this IE into Probe Request frames
whenever the channel is known (i.e., whenever hwscan is not used).
Signed-off-by: Jouni Malinen <j@w1.fi>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
If the TX rate set has been masked, the removed rates can also be
removed from the Supported Rates and Extended Supported Rates IEs in
Probe Request frames.
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
commit 8c0c709eea
Author: Johannes Berg <johannes@sipsolutions.net>
Date: Wed Nov 25 17:46:15 2009 +0100
mac80211: move cmntr flag out of rx flags
moved the CMTR flag into the skb's status, and
in doing so introduced a use-after-free -- when
the skb has been handed to cooked monitors the
status setting will touch now invalid memory.
Additionally, moving it there has effectively
discarded the optimisation -- since the bit is
only ever set on freed SKBs, and those were a
copy, it could never be checked.
For the current release, fixing this properly
is a bit too involved, so let's just remove the
problematic code and leave userspace with one
copy of each frame for each virtual interface.
Cc: stable@kernel.org [2.6.33+]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Change "return (EXPR);" to "return EXPR;"
return is not a function, parentheses are not required.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
otherwise ECT(1) bit will get interpreted as RTO_ONLINK
and routing will fail with XfrmOutBundleGenError.
Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The x25_datagram_poll didn't add anything, removed it.
Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
we need to check proper socket type within ipv4_conntrack_defrag
function before referencing the nodefrag flag.
For example the tun driver receive path produces skbs with
AF_UNSPEC socket type, and so current code is causing unwanted
fragmented packets going out.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix checksum calculation in nf_nat_snmp_basic.
Based on patches by Clark Wang <wtweeker@163.com> and
Stephen Hemminger <shemminger@vyatta.com>.
https://bugzilla.kernel.org/show_bug.cgi?id=17622
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As soon as rcu_read_unlock() is called, there is no guarantee current
thread can safely derefence t pointer, rcu protected.
Fix is to copy t->alloc_size in a temporary variable.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_me_harder can't create the route cache when the outdev is the same
with the indev for the skbs whichout a valid protocol set.
__mkroute_input functions has this check:
1998 if (skb->protocol != htons(ETH_P_IP)) {
1999 /* Not IP (i.e. ARP). Do not create route, if it is
2000 * invalid for proxy arp. DNAT routes are always valid.
2001 *
2002 * Proxy arp feature have been extended to allow, ARP
2003 * replies back to the same interface, to support
2004 * Private VLAN switch technologies. See arp.c.
2005 */
2006 if (out_dev == in_dev &&
2007 IN_DEV_PROXY_ARP_PVLAN(in_dev) == 0) {
2008 err = -EINVAL;
2009 goto cleanup;
2010 }
2011 }
This patch gives the new skb a valid protocol to bypass this check. In order
to make ipt_REJECT work with bridges, you also need to enable ip_forward.
This patch also fixes a regression. When we used skb_copy_expand(), we
didn't have this issue stated above, as the protocol was properly set.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
I initially noticed this because of the compiler warning below, but it
does seem to be a valid concern in the case where ct_sip_get_header()
returns 0 in the first iteration of the while loop.
net/netfilter/nf_conntrack_sip.c: In function 'sip_help_tcp':
net/netfilter/nf_conntrack_sip.c:1379: warning: 'ret' may be used uninitialized in this function
Signed-off-by: Simon Horman <horms@verge.net.au>
[Patrick: changed NF_DROP to NF_ACCEPT]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
transparent field of a socket is either inet_twsk(sk)->tw_transparent
for timewait sockets, or inet_sk(sk)->transparent for other sockets
(TCP/UDP).
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
CAIF sockets should use socket's default send and receive buffers sizes.
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check that receive function pointer is not null before calling it.
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use pr_debug for flow control printouts, and refine an error printout.
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove debugging quirk redefining pr_debug to pr_warning.
Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/ethtool.c: In function 'ethtool_get_regs':
net/core/ethtool.c:818:2: error: implicit declaration of function 'vmalloc'
net/core/ethtool.c:818:9: warning: assignment makes pointer from integer without a cast
net/core/ethtool.c:833:2: error: implicit declaration of function 'vfree'
Signed-off-by: David S. Miller <davem@davemloft.net>
Special care should be taken when slow path is hit in ip_fragment() :
When walking through frags, we transfert truesize ownership from skb to
frags. Then if we hit a slow_path condition, we must undo this or risk
uncharging frags->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.
Many thanks to Nick Bowler, who provided a very clean bug report and
test program.
Thanks to Jarek for reviewing my first patch and providing a V2
While Nick bisection pointed to commit 2b85a34e91 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older
(2.6.12-rc5)
A side effect is to extend work done in commit b2722b1c3a
(ip_fragment: also adjust skb->truesize for packets not owned by a
socket) to ipv6 as well.
Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some NICs have huge register files which exceed the maximum heap
allocation size.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The `options_received' struct is redundant, since it re-duplicates the existing
`p' and `x_recv' fields. This patch removes the sub-struct and migrates the
format conversion operations to ccid3_hc_tx_parse_options().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This adds a function to take care of the following, separate cases occurring in
the computation of the Loss Rate p:
* 1/(2^32-1) is mapped into 0% as per RFC 4342, 8.5;
* 1/0 is mapped into 100%, the maximum;
* to avoid that p = 1/x is rounded down to 0 when x is very large, since this
means accidentally re-entering slow-start indicated by p == 0, the minimum
resolution value of p is now returned instead;
* a bug in ccid3_hc_rx_getsockopt is fixed: 1/0 was mapped into ~0U.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch is thanks to an investigation by Leandro Sales de Melo and his
colleagues. They worked out two state diagrams which highlight the fact that
the xxx_TERM states in CCID-3/4 are in fact not necessary.
And this can be confirmed by in turn looking at the code: the xxx_TERM states
are only ever set in ccid3_hc_{rx,tx}_exit(): when CCID-3 sets the state
to xxx_TERM, it is at a time where no more processing should be going on,
hence it is not necessary to introduce a dedicated exit state - this is already
implied by unloading the CCID.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The constants DCCPO_{MIN,MAX}_CCID_SPECIFIC are nowhere used in the code, but
instead for the CCID-specific options numbers are used.
This patch unifies the use of CCID-specific option numbers, by adding symbolic
names reflecting the definitions in RFC 4340, 10.3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This
1. adds packet type information to ccid_hc_{rx,tx}_parse_options(). This is
necessary, since table 3 in RFC 4340, 5.8 leaves it to the CCIDs to state
which options may (not) appear on what packet type.
2. adds such a check for CCID-3's {Loss Event, Receive} Rate as specified in
RFC 4340 8.3 ("Receive Rate options MUST NOT be sent on DCCP-Data packets")
and 8.5 ("Loss Event Rate options MUST NOT be sent on DCCP-Data packets").
3. removes an unused argument `idx' from ccid_hc_{rx,tx}_parse_options(). This
is also no longer necessary, since the CCID-specific option-parsing routines
are passed every single parameter of the type-length-value option encoding.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
If a RST comes in immediately after checking sk->sk_err, tcp_poll will
return POLLIN but not POLLOUT. Fix this by checking sk->sk_err at the end
of tcp_poll. Additionally, ensure the correct order of operations on SMP
machines with memory barriers.
Signed-off-by: Tom Marshall <tdm.code@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just use explicit casts, since we really can't change the
types of structures exported to userspace which have been
around for 15 years or so.
Reported-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The family parameter xfrm_state_find is used to find a state matching a
certain policy. This value is set to the template's family
(encap_family) right before xfrm_state_find is called.
The family parameter is however also used to construct a temporary state
in xfrm_state_find itself which is wrong for inter-family scenarios
because it produces a selector for the wrong family. Since this selector
is included in the xfrm_user_acquire structure, user space programs
misinterpret IPv6 addresses as IPv4 and vice versa.
This patch splits up the original init_tempsel function into a part that
initializes the selector respectively the props and id of the temporary
state, to allow for differing ip address families whithin the state.
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a driver doesn't fill the entire buffer, old
heap contents may remain, and if it also doesn't
update the length properly, this old heap content
will be copied back to userspace.
It is very unlikely that this happens in any of
the drivers using private ioctls since it would
show up as junk being reported by iwpriv, but it
seems better to be safe here, so use kzalloc.
Reported-by: Jeff Mahoney <jeffm@suse.com>
Cc: stable@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Under load, netif_rx() can drop incoming packets but administrators dont
have a chance to spot which device needs some tuning (RPS activation for
example)
This patch adds rx_dropped accounting in vlans and tunnels.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 can be a module, we should test CONFIG_IPV6 and CONFIG_IPV6_MODULE
to enable ipv6 bits in ip_gre.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Related dicussion here : http://lkml.org/lkml/2010/9/3/16
Introduce a function br_parse_ip_options that will audit the
skb and possibly refill IP options before a packet enters the
IP stack. If no options are present, the function will zero out
the skb cb area so that it is not misinterpreted as options by some
unsuspecting IP layer routine. If packet consistency fails, drop it.
Signed-off-by: Bandan Das <bandan.das@stratus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is basically just a cleanup. IRQs were disabled on the previous
line so we don't need to do it again here. In the current code IRQs
would get turned on one line earlier than intended.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the original code if the copy_from_user() fails in rds_rdma_pages()
then the error handling fails and we get a stack trace from kmalloc().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits)
dca: disable dca on IOAT ver.3.0 multiple-IOH platforms
netpoll: Disable IRQ around RCU dereference in netpoll_rx
sctp: Do not reset the packet during sctp_packet_config().
net/llc: storing negative error codes in unsigned short
MAINTAINERS: move atlx discussions to netdev
drivers/net/cxgb3/cxgb3_main.c: prevent reading uninitialized stack memory
drivers/net/eql.c: prevent reading uninitialized stack memory
drivers/net/usb/hso.c: prevent reading uninitialized memory
xfrm: dont assume rcu_read_lock in xfrm_output_one()
r8169: Handle rxfifo errors on 8168 chips
3c59x: Remove atomic context inside vortex_{set|get}_wol
tcp: Prevent overzealous packetization by SWS logic.
net: RPS needs to depend upon USE_GENERIC_SMP_HELPERS
phylib: fix PAL state machine restart on resume
net: use rcu_barrier() in rollback_registered_many
bonding: correctly process non-linear skbs
ipv4: enable getsockopt() for IP_NODEFRAG
ipv4: force_igmp_version ignored when a IGMPv3 query received
ppp: potential NULL dereference in ppp_mp_explode()
net/llc: make opt unsigned in llc_ui_setsockopt()
...
The ethtool utility does not set masks for flow parameters that are
not specified, so if both value and mask are 0 then this must be
treated as equivalent to a mask with all bits set. Currently that is
done in the only driver that implements RX n-tuple filtering, ixgbe.
Move it to the ethtool core.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_packet_config() is called when getting the packet ready
for appending of chunks. The function should not touch the
current state, since it's possible to ping-pong between two
transports when sending, and that can result packet corruption
followed by skb overlfow crash.
Reported-by: Thomas Dreibholz <dreibh@iem.uni-due.de>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
previously, if a vlan master device was moved from one network namespace
to another, all 802.1q and macvlan slaves were deleted.
we can use dev->reg_state to figure out whether dev_change_net_namespace
is happening, since that won't set dev->reg_state NETREG_UNREGISTERING.
so, this changes 8021q and macvlan to ignore NETDEV_UNREGISTER when
reg_state is not NETREG_UNREGISTERING.
Signed-off-by: David Lamparter <equinox@diac24.net>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to switch on GRO on a device, ethtool_set_gro() checks this
device provides a get_rx_csum() method.
Some devices dont provide this method, while they do support RX
checksumming.
This patch allows bonding to support GRO :
ethtool -K bond0 gro on
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the alloc_skb() fails then we return 65431 instead of -ENOBUFS
(-105).
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As RTNL is held while doing tunnels inserts and deletes, we can remove
ip6_tnl_lock spinlock. My initial RCU conversion was conservative and
converted the rwlock to spinlock, with no RTNL requirement.
Use appropriate rcu annotations and modern lockdep checks as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_local_out() is called with rcu_read_lock() held from ip_queue_xmit()
but not from other call sites.
Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rcu_dereference_raw() now needs to know the type of its argument.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some buggy APs do not respond to unicast probe requests
or send unicast probe requests very delayed so in the
worst case we should try to send broadcast probe requests,
otherwise we can get disconnected from these APs.
Even if drivers do not have filters to disregard probe
responses from foreign APs mac80211 will only process
probe responses from our associated AP for re-arming
connection monitoring.
We need to do this since the beacon monitor does not
push back the connection monitor by design so even if we
are getting beacons from these type of APs our connection
monitor currently relies heavily on the way the probe
requests are received on the AP. An example of an AP
affected by this is the Nexus One, but this has also been
observed with random APs.
We can probably optimize this later by using null funcs
instead of probe requests.
For more details refer to:
http://code.google.com/p/chromium-os/issues/detail?id=5715
This patch has fixes for stable kernels [2.6.35+].
Cc: stable@kernel.org
Cc: Paul Stewart <pstew@google.com>
Cc: Amod Bodas <amod.bodas@atheros.com>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The beacon monitor should be disabled when going off channel
to prevent spurious warnings and triggering connection
deterioration work such as sending probe requests. Re-enable
the beacon monitor once we come back to the home channel.
This patch has fixes for stable kernels [2.6.34+].
Cc: stable@kernel.org
Cc: Paul Stewart <pstew@google.com>
Cc: Amod Bodas <amod.bodas@atheros.com>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This will be used by other components next. The beacon
monitor was added as of 2.6.34 so these fixes are applicable
only to kernels >= 2.6.34.
Cc: stable@kernel.org
Cc: Paul Stewart <pstew@google.com>
Cc: Amod Bodas <amod.bodas@atheros.com>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When we go offchannel mac80211 currently leaves alive the
connection idle monitor. This should be instead postponed
until we come back to our home channel, otherwise by the
time we get back to the home channel we could be triggering
unecesary probe requests. For APs that do not respond to
unicast probe requests (Nexus One is a simple example) this
means we essentially get disconnected after the probes
fails.
This patch has stable fixes for kernels [2.6.35+]
Cc: stable@kernel.org
Cc: Paul Stewart <pstew@google.com>
Cc: Amod Bodas <amod.bodas@atheros.com>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Upon beacon loss we send probe requests after 30 seconds of idle
time and we wait for each probe response 1/2 second. We send a
total of 3 probe requests before giving up on the AP. In the case
that we reset the connection idle monitor we should reset the probe
requests count to 0. Right now this won't help in any way but
the next patch will.
This patch has fixes for stable kernel [2.6.35+].
Cc: stable@kernel.org
Cc: Paul Stewart <pstew@google.com>
Cc: Amod Bodas <amod.bodas@atheros.com>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This will be used in another place later. The connection
monitor was added as of 2.6.35 so these fixes will be
applicable to >= 2.6.35.
Cc: stable@kernel.org
Cc: Paul Stewart <pstew@google.com>
Cc: Amod Bodas <amod.bodas@atheros.com>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When a driver advertises p2p device support,
mac80211 will handle it, but internally it will
rewrite the interface type to STA/AP rather than
P2P-STA/GO since otherwise a lot of paths need
to be touched that are otherwise identical. A
p2p boolean tells drivers whether or not a given
interface will be used for p2p or not.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This adds P2P-STA and P2P-GO as device types so
we can distinguish between those and normal STA
or AP (respectively) type interfaces.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When an interface is brought up, the recent changes
to allow changing type-while-up only set the running
bit after everything was done. This broke a number
of things, including idle calculation for monitor
interfaces, and it also broke WDS station insertion
(although nobody noticed yet).
Thus, change the code to set the running bit earlier,
but keep it after the driver's add_interface was
called because otherwise drivers may iterate over
interfaces they haven't fully set up yet.
Reported-by: Rajkumar Manoharan <rmanoharan@atheros.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Instead of using a WARN_ON(!mutex_is_locked())
use lockdep_assert_held() which compiles away
completely when lockdep isn't enabled, and
also is a more accurate assertion since it
checks that the current thread is holding the
mutex.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This code is modifying the station flags, and
as such should hold the flags lock so it can
do so atomically vs. other flags modifications
and readers. This issue was introduced when
this code was added in eccb8e8f, as it used
the wrong lock (thus not fixing the race that
was previously documented in a comment.)
Cc: stable@kernel.org [2.6.31+]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The output becomes:
[ 41.261941] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently vlan devices don't have GRO by default as none of the Ethernet
drivers add NETIF_F_GRO to their vlan_features.
As GRO is a software feature add GRO to dev->vlan_features in
register_netdevice() and let vlan_dev_init() take care that it gets
enabled only when dev->features has NETIF_F_GRO too.
Signed-off-by: Brandon Philips <bphilips@suse.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->ip_ptr is protected by rtnl and rcu.
Yet some places dont use appropriate primitives and/or locking rules.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/phonet/socket.c: In function ‘pn_res_seq_show’:
net/phonet/socket.c:726: warning: format ‘%02X’ expects type ‘unsigned int’, but argument 3 has type ‘long int’
Signed-off-by: David S. Miller <davem@davemloft.net>
I wish we could use something cleaner, such as bind(). But that would
not work since resource subscription is orthogonal/in addition to the
normal object ID allocated via bind(). This is similar to multicasting
which also uses ioctl()'s.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When both destination device and object are nul, Phonet routes the
packet according to the resource field. In fact, this is the most
common pattern when sending Phonet "request" packets. In this case,
the packet is delivered to whichever endpoint (socket) has
registered the resource.
This adds a new table so that Linux processes can register their
Phonet sockets to Phonet resources, if they have adequate privileges.
(Namespace support is not implemented at the moment.)
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Closing a pipe endpoint is not normally allowed by the Phonet pipe,
other than as a side after-effect of removing the pipe between two
endpoints. But there is no way to prevent Linux userspace processes
from being killed or suffering from bugs, so this can still happen.
We might as well forcefully close Phonet pipe endpoints then.
The cellular modem supports only a few existing pipes at a time. So we
really should not leak them. This change instructs the modem to destroy
the pipe if either of the pipe's endpoint (Linux socket) is closed too
early.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There may be applications trying to seek
on the irnet character device, so we should
use noop_llseek to avoid returning an error
when the default llseek changes to no_llseek.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
As RTNL is held while doing tunnels inserts and deletes, we can remove
ipip6_lock spinlock. My initial RCU conversion was conservative and
converted the rwlock to spinlock, with no RTNL requirement.
Use appropriate rcu annotations and modern lockdep checks as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As RTNL is held while doing tunnels inserts and deletes, we can remove
ipgre_lock spinlock. My initial RCU conversion was conservative and
converted the rwlock to spinlock, with no RTNL requirement.
Use appropriate rcu annotations and modern lockdep checks as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As RTNL is held while doing tunnels inserts and deletes, we can remove
ipip_lock spinlock. My initial RCU conversion was conservative and
converted the rwlock to spinlock, with no RTNL requirement.
Use appropriate rcu annotations and modern lockdep checks as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ethtool_rawip4_spec and struct ethtool_ether_spec are neither
commented nor used by any driver, so remove them. Adjust padding in
the user-visible unions that included these structures.
Fix references to struct ethtool_rawip4_spec in
ethtool_get_rx_ntuple(), which should use struct ethtool_usrip4_spec.
struct ethtool_usrip4_spec cannot hold IPv6 host addresses and there
is no separate structure that can, so remove ETH_RX_NFC_IP6 and the
reference to it in niu.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This simplifies and consolidates the TX option-parsing code:
1. The Loss Intervals option is not currently used, so dead code related to
this option is removed. I am aware of no plans to support the option, but
if someone wants to implement it (e.g. for inter-op tests), it is better
to start afresh than having to also update currently unused code.
2. The Loss Event and Receive Rate options have a lot of code in common (both
are 32 bit, both have same length etc.), so this is consolidated.
3. The test against GSR is not necessary, because
- on first loading CCID3, ccid_new() zeroes out all fields in the socket;
- ccid3_hc_tx_packet_recv() treats 0 and ~0U equivalently, due to
pinv = opt_recv->ccid3or_loss_event_rate;
if (pinv == ~0U || pinv == 0)
hctx->p = 0;
- as a result, the sequence number field is removed from opt_recv.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the RTT-sampling function tfrc_tx_hist_rtt(), since
1. it suffered from complex passing of return values (the return value both
indicated successful lookup while the value doubled as RTT sample);
2. when for some odd reason the sample value equalled 0, this triggered a bug
warning about "bogus Ack", due to the ambiguity of the return value;
3. on a passive host which has not sent anything the TX history is empty and
thus will lead to unwanted "bogus Ack" warnings such as
ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-28197148
ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-26641606.
The fix is to replace the implicit encoding by performing the steps manually.
Furthermore, the "bogus Ack" warning has been removed, since it can actually be
triggered due to several reasons (network reordering, old packet, (3) above),
hence it is not very useful.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This fixes a subtle bug in the calculation of the inter-packet gap and shows
that t_delta, as it is currently used, is not needed.
The algorithm from RFC 5348, 8.3 below continually computes a send time t_nom,
which is initialised with the current time t_now; t_gran = 1E6 / HZ specifies
the scheduling granularity, s the packet size, and X the sending rate:
t_distance = t_nom - t_now; // in microseconds
t_delta = min(t_ipi, t_gran) / 2; // `delta' parameter in microseconds
if (t_distance >= t_delta) {
reschedule after (t_distance / 1000) milliseconds;
} else {
t_ipi = s / X; // inter-packet interval in usec
t_nom += t_ipi; // compute the next send time
send packet now;
}
Problem:
--------
Rescheduling requires a conversion into milliseconds (sk_reset_timer()). The
highest jiffy resolution with HZ=1000 is 1 millisecond, so using a higher
granularity does not make much sense here.
As a consequence, values of t_distance < 1000 are truncated to 0. This issue
has so far been resolved by using instead
if (t_distance >= t_delta + 1000)
reschedule after (t_distance / 1000) milliseconds;
This is unnecessarily large, a lower bound is t_delta' = max(t_delta, 1000).
And it implies a further simplification:
a) when HZ >= 500, then t_delta <= t_gran/2 = 10^6/(2*HZ) <= 1000, so that
t_delta' = MAX(1000, t_delta) = 1000 (constant value);
b) when HZ < 500, then t_delta = 1/2*MIN(rtt, t_ipi, t_gran) <= t_gran/2,
so that 1000 <= t_delta' <= t_gran/2.
The maximum error of using a constant t_delta in (b) is less than half a jiffy.
Fix:
----
The patch replaces t_delta with a constant, whose value depends on CONFIG_HZ,
changing the above algorithm to:
if (t_distance >= t_delta')
reschedule after (t_distance / 1000) milliseconds;
where t_delta' = 10^6/(2*HZ) if HZ < 500, and t_delta' = 1000 otherwise.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
You cannot invoke __smp_call_function_single() unless the
architecture sets this symbol.
Reported-by: Daniel Hellstrom <daniel@gaisler.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Accept already has socket locking.
[ Extend socket locking over TCP_LISTEN state test. -DaveM ]
Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Accept updates socket values in 3 lines so wrapped with lock_sock.
Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Listen updates socket values and needs lock_sock.
Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependencies
statfs() gives ESTALE error
NFS: Fix a typo in nfs_sockaddr_match_ipaddr6
sunrpc: increase MAX_HASHTABLE_BITS to 14
gss:spkm3 miss returning error to caller when import security context
gss:krb5 miss returning error to caller when import security context
Remove incorrect do_vfs_lock message
SUNRPC: cleanup state-machine ordering
SUNRPC: Fix a race in rpc_info_open
SUNRPC: Fix race corrupting rpc upcall
Fix null dereference in call_allocate
netdev_wait_allrefs() waits that all references to a device vanishes.
It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users reported that no more than 4 devices can be dismantled per
second, this is a pretty serious problem for some setups.
Most of the time, a refcount is about to be released by an RCU callback,
that is still in flight because rollback_registered_many() uses a
synchronize_rcu() call instead of rcu_barrier(). Problem is visible if
number of online cpus is one, because synchronize_rcu() is then a no op.
time to remove 50 ipip tunnels on a UP machine :
before patch : real 11.910s
after patch : real 1.250s
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: Octavian Purdila <opurdila@ixiacom.com>
Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sta_info_get_bss() is used to match STA pointers
for VLAN/AP interfaces, but if the same station
is also added to multiple other interfaces it
will erroneously match because both pointers are
NULL, fix this by ignoring NULL pointers here.
Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This is needed to avoid warning in ieee80211_restart_hw about hardware
scan in progress.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Wey-Yi W Guy <wey-yi.w.guy@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
hdr pointer is left dangling after call to ieee80211_skb_resize. This
can cause guards around mesh path selection to fail.
Signed-off-by: Steve deRosier <steve@cozybit.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There is a path in nl80211_set_wiphy where result is tested but
uninitialized.
I am hitting this path when I attempt:
sh# iw dev wlan0 set channel 10
command failed: Unknown error 1069727332 (-1069727332)
Signed-off-by: William Jordan <bjordan@rajant.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Replace sizeof(rtap_namespace_sizes) / sizeof(rtap_namespace_sizes[0])
with ARRAY_SIZE(rtap_namespace_sizes) in net/wireless/radiotap.c
Signed-off-by: Nikitas Angelinas <nikitasangelinas@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Allocate hash tables for every online cpus, not every possible ones.
NUMA aware allocations.
Dont use a full page on arches where PAGE_SIZE > 1024*sizeof(void *)
misc:
__percpu , __read_mostly, __cpuinit annotations
flow_compare_t is just an "unsigned long"
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While integrating your man-pages patch for IP_NODEFRAG, I noticed
that this option is settable by setsockopt(), but not gettable by
getsockopt(). I suppose this is not intended. The (untested,
trivial) patch below adds getsockopt() support.
Signed-off-by: Michael kerrisk <mtk.manpages@gmail.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After all these years, it turns out that the
/proc/sys/net/ipv4/conf/*/force_igmp_version
parameter isn't fully implemented.
*Symptom*:
When set force_igmp_version to a value of 2, the kernel should only perform
multicast IGMPv2 operations (IETF rfc2236). An host-initiated Join message
will be sent as a IGMPv2 Join message. But if a IGMPv3 query message is
received, the host responds with a IGMPv3 join message. Per rfc3376 and
rfc2236, a IGMPv2 host should treat a IGMPv3 query as a IGMPv2 query and
respond with an IGMPv2 Join message.
*Consequences*:
This is an issue when a IGMPv3 capable switch is the querier and will only
issue IGMPv3 queries (which double as IGMPv2 querys) and there's an
intermediate switch that is only IGMPv2 capable. The intermediate switch
processes the initial v2 Join, but fails to recognize the IGMPv3 Join responses
to the Query, resulting in a dropped connection when the intermediate v2-only
switch times it out.
*Identifying issue in the kernel source*:
The issue is in this section of code (in net/ipv4/igmp.c), which is called when
an IGMP query is received (from mainline 2.6.36-rc3 gitweb):
...
A IGMPv3 query has a length >= 12 and no sources. This routine will exit after
line 880, setting the general query timer (random timeout between 0 and query
response time). This calls igmp_gq_timer_expire():
...
.. which only sends a v3 response. So if a v3 query is received, the kernel
always sends a v3 response.
IGMP queries happen once every 60 sec (per vlan), so the traffic is low. A
IGMPv3 query *is* a strict superset of a IGMPv2 query, so this patch properly
short circuit's the v3 behaviour.
One issue is that this does not address force_igmp_version=1. Then again, I've
never seen any IGMPv1 multicast equipment in the wild. However there is a lot
of v2-only equipment. If it's necessary to support the IGMPv1 case as well:
837 if (len == 8 || IGMP_V2_SEEN(in_dev) || IGMP_V1_SEEN(in_dev)) {
Signed-off-by: David S. Miller <davem@davemloft.net>
The members of struct llc_sock are unsigned so if we pass a negative
value for "opt" it can cause a sign bug. Also it can cause an integer
overflow when we multiply "opt * HZ".
CC: stable@kernel.org
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The maximum size of the authcache is now set to 1024 (10 bits),
but on our server we need at least 4096 (12 bits). Increase
MAX_HASHTABLE_BITS to 14. This is a maximum of 16384 entries,
each containing a pointer (8 bytes on x86_64). This is
exactly the limit of kmalloc() (128K).
Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
spkm3 miss returning error to up layer when import security context,
it may be return ok though it has failed to import security context.
Signed-off-by: Bian Naimeng <biannm@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
krb5 miss returning error to up layer when import security context,
it may be return ok though it has failed to import security context.
Signed-off-by: Bian Naimeng <biannm@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is just a minor cleanup: net/sunrpc/clnt.c clarifies the rpc client
state machine by commenting each state and by laying out the functions
implementing each state in the order that each state is normally
executed (in the absence of errors).
The previous patch "Fix null dereference in call_allocate" changed the
order of the states. Move the functions and update the comments to
reflect the change.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is a race between rpc_info_open and rpc_release_client()
in that nothing stops a process from opening the file after
the clnt->cl_kref goes to zero.
Fix this by using atomic_inc_unless_zero()...
Reported-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
If rpc_queue_upcall() adds a new upcall to the rpci->pipe list just
after rpc_pipe_release calls rpc_purge_list(), but before it calls
gss_pipe_release (as rpci->ops->release_pipe(inode)), then the latter
will free a message without deleting it from the rpci->pipe list.
We will be left with a freed object on the rpc->pipe list. Most
frequent symptoms are kernel crashes in rpc.gssd system calls on the
pipe in question.
Reported-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
In call_allocate we need to reach the auth in order to factor au_cslack
into the allocation.
As of a17c2153d2 "SUNRPC: Move the bound
cred to struct rpc_rqst", call_allocate attempts to do this by
dereferencing tk_client->cl_auth, however this is not guaranteed to be
defined--cl_auth can be zero in the case of gss context destruction (see
rpc_free_auth).
Reorder the client state machine to bind credentials before allocating,
so that we can instead reach the auth through the cred.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
The list_head conversion unearther an unnecessary flow
check. Since flow is always NULL here we don't need to
see if a matching flow exists already.
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that est_tree_lock is acquired with BH protection, the other
call is unnecessary.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use rcu_dereference_rtnl() helper
Change hard coded constants in fib_flag_trans()
7 -> RTN_UNREACHABLE
8 -> RTN_PROHIBIT
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove code that trimmed excess trailing info from incoming messages
arriving over an Ethernet interface. TIPC now ignores the extra info
while the message is being processed by the node, and only trims it off
if the message is retransmitted to another node. (This latter step is
done to ensure the extra info doesn't cause the sk_buff to exceed the
outgoing interface's MTU limit.) The outgoing buffer is guaranteed to
be linear.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm4_tunnel_register() & xfrm6_tunnel_register() should
use rcu_assign_pointer() to make sure previous writes
(to handler->next) are committed to memory before chain
insertion.
deregister functions dont need a particular barrier.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__lock_sock() and __release_sock() releases and regrabs lock but
were missing proper annotations. Add it. This removes following
warning from sparse. (Currently __lock_sock() does not emit any
warning about it but I think it is better to add also.)
net/core/sock.c:1580:17: warning: context imbalance in '__release_sock' - unexpected unlock
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
move_addr_to_kernel() and copy_from_user() requires their argument
as __user pointer but were missing proper markups. Add it.
This removes following warnings from sparse.
net/core/iovec.c:44:52: warning: incorrect type in argument 1 (different address spaces)
net/core/iovec.c:44:52: expected void [noderef] <asn:1>*uaddr
net/core/iovec.c:44:52: got void *msg_name
net/core/iovec.c:55:34: warning: incorrect type in argument 2 (different address spaces)
net/core/iovec.c:55:34: expected void const [noderef] <asn:1>*from
net/core/iovec.c:55:34: got struct iovec *msg_iov
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a list_has_sctp_addr function to simplify loop
Based on a patches by Dan Carpenter and David Miller
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).
Problem is that following sequence :
fd = socket(...)
connect(fd, &remote, ...)
not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)
Sequence is :
- autobind() : choose a random local port, insert socket in hash tables
[while local address is INADDR_ANY]
- connect() : set remote address and port, change local address to IP
given by a route lookup.
When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.
One solution to this problem is to rehash datagram socket if needed.
We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.
This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.
Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
friends.
inet_protos[] & inet6_protos[] are moved to read_mostly section
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two CMSGs for masked versions of cswp and fadd. args
struct modified to use a union for different atomic op type's
arguments. Change IB to do masked atomic ops. Atomic op type
in rds_message similarly unionized.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
This prints the constant identifier for work completion status and rdma
cm event types, like we already do for IB event types.
A core string array helper is added that each string type uses.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Nothing was canceling the send and receive work that might have been
queued as a conn was being destroyed.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
rds_conn_shutdown() can return before the connection is shut down when
it encounters an existing state that it doesn't understand. This lets
rds_conn_destroy() then start tearing down the conn from under paths
that are still using it.
It's more reliable the shutdown work and wait for krdsd to complete the
shutdown callback. This stopped some hangs I was seeing where krdsd was
trying to shut down a freed conn.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Right now there's nothing to stop the various paths that use
rs->rs_transport from racing with rmmod and executing freed transport
code. The simple fix is to have binding to a transport also hold a
reference to the transport's module, removing this class of races.
We already had an unused t_owner field which was set for the modular
transports and which wasn't set for the built-in loop transport.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
rs_transport is now also used by the rdma paths once the socket is
bound. We don't need this stale comment to tell us what cscope can.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
rds_conn_destroy() can race with all other modifications of the
rds_conn_count but it was modifying the count without locking.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
The RDS IB device list wasn't protected by any locking. Traversal in
both the get_mr and FMR flushing paths could race with additon and
removal.
List manipulation is done with RCU primatives and is protected by the
write side of a rwsem. The list traversal in the get_mr fast path is
protected by a rcu read critical section. The FMR list traversal is
more problematic because it can block while traversing the list. We
protect this with the read side of the rwsem.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
It's nice to not have to go digging in the code to see which event
occurred. It's easy to throw together a quick array that maps the ib
event enums to their strings. I didn't see anything in the stack that
does this translation for us, but I also didn't look very hard.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Flushing FMRs is somewhat expensive, and is currently kicked off when
the interrupt handler notices that we are getting low. The result of
this is that FMR flushing only happens from the interrupt cpus.
This spreads the load more effectively by triggering flushes just before
we allocate a new FMR.
Signed-off-by: Chris Mason <chris.mason@oracle.com>