Commit 86bcebafc5 ("tcp: fix >2 iw selection") fixed a case
when congestion window initialization has been mistakenly omitted
by introducing cwnd label and putting backwards goto from the
end of the function.
This makes the code unnecessarily tricky to read and understand
on a first sight.
Shuffle the code around a little bit to make it more obvious.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfq_walk() runs without qdisc lock. By the time it selects a non empty
hash slot and sfq_dump_class_stats() is run (with lock held), slot might
have been freed : We then access q->slots[SFQ_EMPTY_SLOT], out of
bounds, and crash in slot_queue_walk()
On previous kernels, bug is here but out of bounds qs[SFQ_DEPTH] and
allot[SFQ_DEPTH] are located in struct sfq_sched_data, so no illegal
memory access happens, only possibly wrong data reported to user.
Also, slot_dequeue_tail() should make sure slot skb chain is correctly
terminated, or sfq_dump_class_stats() can access freed skbs.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can translate pseudo load instructions at filter check time to
dedicated instructions to speed up filtering and avoid one switch().
libpcap currently uses SKF_AD_PROTOCOL, but custom filters probably use
other ancillary accesses.
Note : I made the assertion that ancillary data was always accessed with
BPF_LD|BPF_?|BPF_ABS instructions, not with BPF_LD|BPF_?|BPF_IND ones
(offset given by K constant, not by K + X register)
On x86_64, this saves a few bytes of text :
# size net/core/filter.o.*
text data bss dec hex filename
4864 0 0 4864 1300 net/core/filter.o.new
4944 0 0 4944 1350 net/core/filter.o.old
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le vendredi 17 décembre 2010 à 10:26 +0100, Eric Dumazet a écrit :
>
> I think we can add this after latest Changli patch :
>
> He does one skb_clone() before calling the sniffers.
> We could set timestamp on this clone, instead of original skb.
>
> Problem solved.
>
[PATCH net-next-2.6] net: timestamp cloned packet in dev_queue_xmit_nit
Now we do one clone of skb if at least one sniffer might take packet,
we also can do the skb timestamping on the clone and let original packet
unchanged.
This is a generalization of commit 8caf153974 (net: sch_netem: Fix an
inconsistency in ingress netem timestamps.)
This way, we can have a good idea when packets are delivered to our
stack (tcpdump -i ifb0), while a tcpdump on original device gives
timestamps right before ingressing.
This also speedup our stack, avoiding taking timestamps if not needed.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Changli Gao <xiaosuo@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the default initial receive window to 10 mss
(defined constant). The default window is limited to the maximum
of 10*1460 and 2*mss (when mss > 1460).
draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends
increasing TCP's initial congestion window to 10 mss or about 15KB.
Leading up to this proposal were several large-scale live Internet
experiments with an initial congestion window of 10 mss (IW10), where
we showed that the average latency of HTTP responses improved by
approximately 10%. This was accompanied by a slight increase in
retransmission rate (0.5%), most of which is coming from applications
opening multiple simultaneous connections. To understand the extreme
worst case scenarios, and fairness issues (IW10 versus IW3), we further
conducted controlled testbed experiments. We came away finding minimal
negative impact even under low link bandwidths (dial-ups) and small
buffers. These results are extremely encouraging to adopting IW10.
However, an initial congestion window of 10 mss is useless unless a TCP
receiver advertises an initial receive window of at least 10 mss.
Fortunately, in the large-scale Internet experiments we found that most
widely used operating systems advertised large initial receive windows
of 64KB, allowing us to experiment with a wide range of initial
congestion windows. Linux systems were among the few exceptions that
advertised a small receive window of 6KB. The purpose of this patch is
to fix this shortcoming.
References:
1. A comprehensive list of all IW10 references to date.
http://code.google.com/speed/protocols/tcpm-IW10.html
2. Paper describing results from large-scale Internet experiments with IW10.
http://ccr.sigcomm.org/drupal/?q=node/621
3. Controlled testbed experiments under worst case scenarios and a
fairness study.
http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf
4. Raw test data from testbed experiments (Linux senders/receivers)
with initial congestion and receive windows of both 10 mss.
http://research.csc.ncsu.edu/netsrv/?q=content/iw10
5. Internet-Draft. Increasing TCP's Initial Window.
https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is a respin of patch.
I'll send a short patch to make SFQ more fair in presence of large
packets as well.
Thanks
[PATCH v3 net-next-2.6] net_sched: sch_sfq: better struct layouts
This patch shrinks sizeof(struct sfq_sched_data)
from 0x14f8 (or more if spinlocks are bigger) to 0x1180 bytes, and
reduce text size as well.
text data bss dec hex filename
4821 152 0 4973 136d old/net/sched/sch_sfq.o
4627 136 0 4763 129b new/net/sched/sch_sfq.o
All data for a slot/flow is now grouped in a compact and cache friendly
structure, instead of being spreaded in many different points.
struct sfq_slot {
struct sk_buff *skblist_next;
struct sk_buff *skblist_prev;
sfq_index qlen; /* number of skbs in skblist */
sfq_index next; /* next slot in sfq chain */
struct sfq_head dep; /* anchor in dep[] chains */
unsigned short hash; /* hash value (index in ht[]) */
short allot; /* credit for this slot */
};
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When deploying SFQ/IFB here at work, I found the allot management was
pretty wrong in sfq, even changing allot from short to int...
We should init allot for each new flow, not using a previous value found
in slot.
Before patch, I saw bursts of several packets per flow, apparently
denying the default "quantum 1514" limit I had on my SFQ class.
class sfq 11:1 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 7p requeues 0
allot 11546
class sfq 11:46 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 1p requeues 0
allot -23873
class sfq 11:78 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 5p requeues 0
allot 11393
After patch, better fairness among each flow, allot limit being
respected, allot is positive :
class sfq 11:e parent 11:
(dropped 0, overlimits 0 requeues 86)
backlog 0b 3p requeues 86
allot 596
class sfq 11:94 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 1468
class sfq 11:a4 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 4p requeues 0
allot 650
class sfq 11:bb parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 596
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently return for each active SFQ slot the number of packets in
queue. We can also give number of bytes accounted for these packets.
tc -s class show dev ifb0
Before patch :
class sfq 11:3d9 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 1266
After patch :
class sfq 11:3e4 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 4380b 3p requeues 0
allot 1212
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flush the routing cache only of entries that match the
network namespace in which the purge event occurred.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
hna_local_fill_buffer must return the number of added hna entries and
not the last checked hash bucket.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove "pktgen: " prefix string from one pr_info.
pr_fmt adds it, so this is a duplicate.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In dev_queue_xmit_nit(), we have to clone skbs as we need to mangle skbs,
however, we don't need to clone skbs for all the packet_types.
Except for the first packet_type, we increase skb->users instead of
skb_clone().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch modifies IPsec6 to fragment IPv6 packets that are
locally generated as needed.
This version of the patch only fragments in tunnel mode, so that fragment
headers will not be obscured by ESP in transport mode.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When device is being set to down, neigh_ifdown was being called
twice. Once from addrconf notifier and once from ndisc notifier.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove (unnecessary) casts to make code cleaner.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When loopback device is being brought down, then keep the route table
entries because they are special. The entries in the local table for
linklocal routes and ::1 address should not be purged.
This is a sub optimal solution to the problem and should be replaced
by a better fix in future.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get the sctp partial delivery point using SCTP_PARTIAL_DELIVERY_POINT
socket option should return 0 if success, not -ENOTSUPP.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().
Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a missing ntohs() for bridge IPv6 multicast snooping.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Special care is taken inside sk_port_alloc to avoid overwriting
skc_node/skc_nulls_node. We should also avoid overwriting
skc_bind_node/skc_portaddr_node.
The patch fixes the following crash:
BUG: unable to handle kernel paging request at fffffffffffffff0
IP: [<ffffffff812ec6dd>] udp4_lib_lookup2+0xad/0x370
[<ffffffff812ecc22>] __udp4_lib_lookup+0x282/0x360
[<ffffffff812ed63e>] __udp4_lib_rcv+0x31e/0x700
[<ffffffff812bba45>] ? ip_local_deliver_finish+0x65/0x190
[<ffffffff812bbbf8>] ? ip_local_deliver+0x88/0xa0
[<ffffffff812eda35>] udp_rcv+0x15/0x20
[<ffffffff812bba45>] ip_local_deliver_finish+0x65/0x190
[<ffffffff812bbbf8>] ip_local_deliver+0x88/0xa0
[<ffffffff812bb2cd>] ip_rcv_finish+0x32d/0x6f0
[<ffffffff8128c14c>] ? netif_receive_skb+0x99c/0x11c0
[<ffffffff812bb94b>] ip_rcv+0x2bb/0x350
[<ffffffff8128c14c>] netif_receive_skb+0x99c/0x11c0
Signed-off-by: Leonard Crestez <lcrestez@ixiacom.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dev_close_many and dev_deactivate_many to factorize another
sync-rcu operation on the netdevice unregister path.
$ modprobe dummy numdummies=10000
$ ip link set dev dummy* up
$ time rmmod dummy
Without the patch With the patch
real 0m 24.63s real 0m 5.15s
user 0m 0.00s user 0m 0.00s
sys 0m 6.05s sys 0m 5.14s
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
B.A.T.M.A.N. (better approach to mobile ad-hoc networking) is a routing
protocol for multi-hop ad-hoc mesh networks. The networks may be wired or
wireless. See http://www.open-mesh.org/ for more information and user space
tools.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the calcualation of the Tx hash for a given hash range into a separate
function and define the skb_tx_hash(), which calculates a Tx hash for a
[0; dev->real_num_tx_queues - 1] hash values range, using this
function (__skb_tx_hash()).
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first big packets sent to a "low-MTU" client correctly
triggers the creation of a temporary route containing the reduced MTU.
But after the temporary route has expired, new ICMP6 "packet too big"
will be sent, rt6_pmtu_discovery will find the previous EXPIRED route
check that its mtu isn't bigger then in icmp packet and do nothing
before the temporary route will not deleted by gc.
I make the simple experiment:
while :; do
time ( dd if=/dev/zero bs=10K count=1 | ssh hostname dd of=/dev/null ) || break;
done
The "time" reports real 0m0.197s if a temporary route isn't expired, but
it reports real 0m52.837s (!!!!) immediately after a temporare route has
expired.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like RTAX_ADVMSS, make the default calculation go through a dst_ops
method rather than caching the computation in the routing cache
entries.
Now dst metrics are pretty much left as-is when new entries are
created, thus optimizing metric sharing becomes a real possibility.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make all RTAX_ADVMSS metric accesses go through a new helper function,
dst_metric_advmss().
Leave the actual default metric as "zero" in the real metric slot,
and compute the actual default value dynamically via a new dst_ops
AF specific callback.
For stacked IPSEC routes, we use the advmss of the path which
preserves existing behavior.
Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
advmss on pmtu updates. This inconsistency in advmss handling
results in more raw metric accesses than I wish we ended up with.
Signed-off-by: David S. Miller <davem@davemloft.net>
On suspend, there might be usb wireless drivers which wrongly trigger
the warning in ieee80211_work_work. If an usb driver doesn't have a
suspend hook, the usb stack will disconnect the device. On disconnect,
a mac80211 driver calls ieee80211_unregister_hw, which calls dev_close,
which calls ieee80211_stop, and in the end calls ieee80211_work_purge->
ieee80211_work_work.
The problem is that this call to ieee80211_work_purge comes after
mac80211 is suspended, triggering the warning even when we don't have
work queued in work_list (the expected case when already suspended),
because it always calls ieee80211_work_work.
So, just call ieee80211_work_work in ieee80211_work_purge if we really
have to abort work. This addresses the warning reported at
https://bugzilla.kernel.org/show_bug.cgi?id=24402
Signed-off-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
dev_open will eventually call ieee80211_ibss_join which sets up the
skb used for beacons/probe-responses however it is possible to
receive beacons that attempt to merge before this occurs causing
a null pointer dereference. Check ssid_len as that is the last
thing set in ieee80211_ibss_join.
This occurs quite easily in the presence of adhoc nodes with hidden SSID's
revised previous patch to check further up based on irc feedback
Signed-off-by: Tim Harvey <harvey.tim@gmail.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Always go through a new ip4_dst_hoplimit() helper, just like ipv6.
This allowed several simplifications:
1) The interim dst_metric_hoplimit() can go as it's no longer
userd.
2) The sysctl_ip_default_ttl entry no longer needs to use
ipv4_doint_and_flush, since the sysctl is not cached in
routing cache metrics any longer.
3) ipv4_doint_and_flush no longer needs to be exported and
therefore can be marked static.
When ipv4_doint_and_flush_strategy was removed some time ago,
the external declaration in ip.h was mistakenly left around
so kill that off too.
We have to move the sysctl_ip_default_ttl declaration into
ipv4's route cache definition header net/route.h, because
currently net/ip.h (where the declaration lives now) has
a back dependency on net/route.h
Signed-off-by: David S. Miller <davem@davemloft.net>
This is for consistency with ipv4. Using "-1" makes
no sense.
It was made this way a long time ago merely to be consistent
with how the ipv6 socket hoplimit "default" is stored.
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit c1f19b51d1 (net: support time stamping in phy devices.),
kernel might crash if CONFIG_NETWORK_PHY_TIMESTAMPING=y and
skb_defer_rx_timestamp() handles a packet without an ethernet header.
Fixes kernel bugzilla #24102
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=24102
Reported-and-tested-by: Andrew Watts <akwatts@ymail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_pre_routing functions in bridging have collected two
distinct ways of returning NF_DROP over the years, inline and
via goto. There is no reason for preferring either one.
So this patch arbitrarily picks the inline variant and converts
the all the gotos.
Also removes a redundant comment.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
While an interface is down, many implementations of
ethtool_ops::get_link, including the default, ethtool_op_get_link(),
will report the last link state seen while the interface was up. In
general the current physical link state is not available if the
interface is down.
Define ETHTOOL_GLINK to reflect whether the interface *and* any
physical port have a working link, and consistently return 0 when the
interface is down.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ATM subsystem was incorrectly creating the 'device' link for ATM
nodes in sysfs. This led to incorrect device/parent relationships
exposed by sysfs and udev. Instead of rolling the 'device' link by hand
in the generic ATM code, pass each ATM driver's bus device down to the
sysfs code and let sysfs do this stuff correctly.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We know for sure pktgen is going to write skb->data right after
*_alloc_skb, causing unnecessary cache misses.
Idea is to add a prefetchw() call to prefetch the first cache line
indicated by skb->data. On systems with Adjacent Cache Line Prefetch,
it's probably two cache lines are prefetched.
With this patch, pktgen on Intel SR1625 server with two E5530
quad-core processors and a single ixgbe-based NIC went from 8.63Mpps
to 9.03Mpps, with 4.6% improvement.
Signed-off-by: Junchang Wang <junchangwang@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP_SET_PEER_PRIMARY_ADDR does not accpet v4mapped address, using
v4mapped address in SCTP_SET_PEER_PRIMARY_ADDR socket option will
get -EADDRNOTAVAIL error if v4map is enabled. This patch try to
fix it by mapping v4mapped address to v4 address if allowed.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IS_ERR() already implies unlikely(), so it can be omitted here.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add TFC padding to all packets smaller than the boundary configured
on the xfrm state. If the boundary is larger than the PMTU, limit
padding to the PMTU.
Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>