Commit Graph

471225 Commits

Author SHA1 Message Date
Eric Dumazet
55a93b3ea7 qdisc: validate skb without holding lock
Validation of skb can be pretty expensive :

GSO segmentation and/or checksum computations.

We can do this without holding qdisc lock, so that other cpus
can queue additional packets.

Trick is that requeued packets were already validated, so we carry
a boolean so that sch_direct_xmit() can validate a fresh skb list,
or directly use an old one.

Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
host.

Turning TSO on or off had no effect on throughput, only few more cpu
cycles. Lock contention on qdisc lock disappeared.

Same if disabling TX checksum offload.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 15:36:11 -07:00
Tobias Klauser
6a05880a8b net: ethernet: Remove superfluous ether_setup after alloc_etherdev
There is no need to call ether_setup after alloc_ethdev since it was
already called there.

Follow commits c706471b26 ("net: axienet: remove unnecessary
ether_setup after alloc_etherdev") and 3c87dcbfb3 ("net: ll_temac:
Remove unnecessary ether_setup after alloc_etherdev") and fix the
pattern in all remaining ethernet drivers.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 15:31:40 -07:00
David S. Miller
c2bf5ec204 Merge branch 'qdisc_bulk_dequeue'
Jesper Dangaard Brouer says:

====================
qdisc: bulk dequeue support

This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
from the qdisc layer, to implement dequeue bulking.

Patch01: "qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE"
 - Implement basic qdisc dequeue bulking
 - This time, 100% relying on BQL limits, no magic safe-guard constants

Patch02: "qdisc: dequeue bulking also pickup GSO/TSO packets"
 - Extend bulking to bulk several GSO/TSO packets
 - Seperate patch, as it introduce a small regression, see test section.

We do have a patch03, which exports a userspace tunable as a BQL
tunable, that can byte-cap or disable the bulking/bursting.  But we
could not agree on it internally, thus not sending it now.  We
basically strive to avoid adding any new userspace tunable.

Testing patch01:
================
 Demonstrating the performance improvement of qdisc dequeue bulking, is
tricky because the effect only "kicks-in" once the qdisc system have a
backlog. Thus, for a backlog to form, we need either 1) to exceed wirespeed
of the link or 2) exceed the capability of the device driver.

For practical use-cases, the measureable effect of this will be a
reduction in CPU usage

01-TCP_STREAM:
--------------
Testing effect for TCP involves disabling TSO and GSO, because TCP
already benefit from bulking, via TSO and especially for GSO segmented
packets.  This patch view TSO/GSO as a seperate kind of bulking, and
avoid further bulking of these packet types.

The measured perf diff benefit (at 10Gbit/s) for a single netperf
TCP_STREAM were 9.24% less CPU used on calls to _raw_spin_lock()
(mostly from sch_direct_xmit).

If my E5-2695v2(ES) CPU is tuned according to:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
Then it is possible that a single netperf TCP_STREAM, with GSO and TSO
disabled, can utilize all bandwidth on a 10Gbit/s link.  This will
then cause a standing backlog queue at the qdisc layer.

Trying to pressure the system some more CPU util wise, I'm starting
24x TCP_STREAMs and monitoring the overall CPU utilization.  This
confirms bulking saves CPU cycles when it "kicks-in".

Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows:
 * Disabled bulking: sys:2.58%  soft:8.50%  idle:88.78%
 * Enabled  bulking: sys:2.43%  soft:7.66%  idle:89.79%

02-UDP_STREAM
-------------
The measured perf diff benefit for UDP_STREAM were 6.41% less CPU used
on calls to _raw_spin_lock().  24x UDP_STREAM with packet size -m 1472 (to
avoid sending UDP/IP fragments).

03-trafgen driver test
----------------------
The performance of the 10Gbit/s ixgbe driver is limited due to
updating the HW ring-queue tail-pointer on every packet.  As
previously demonstrated with pktgen.

Using trafgen to send RAW frames from userspace (via AF_PACKET), and
forcing it through qdisc path (with option --qdisc-path and -t0),
sending with 12 CPUs.

I can demonstrate this driver layer limitation:
 * 12.8 Mpps with no qdisc bulking
 * 14.8 Mpps with qdisc bulking (full 10G-wirespeed)

Testing patch02:
================
Testing Bulking several GSO/TSO packets:

Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec for 10G while transmitting
approx 813Kpps).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec, for 10G
while transmitting approx 813Kpps).

 Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

Corrosponding to:
 (10000*10^6)*((0.50-0.47)/10^3)/8 = 37500 bytes
 (10000*10^6)*((0.50-0.38)/10^3)/8 = 150000 bytes
 37500/1500  = 25 pkts
 150000/1500 = 100 pkts

 Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
diff of 0.12ms corrosponding to 1500 bytes at 100Mbit/s. Bulking
several GSOs together, result in a stable high-prio queue delay of
2.23ms.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 12:37:23 -07:00
Jesper Dangaard Brouer
808e7ac0bd qdisc: dequeue bulking also pickup GSO/TSO packets
The TSO and GSO segmented packets already benefit from bulking
on their own.

The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.

The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7f9 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f2a ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list.  And via commit
ce93718fb7 ("net: Don't keep around original SKB when we
software segment GSO frames.").

This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.

Testing:
 Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).

 Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

 Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 12:37:06 -07:00
Jesper Dangaard Brouer
5772e9a346 qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.  In-effect bulking will only happen for BQL enabled drivers.

Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.

For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets.  They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.

Jointed work with Hannes, Daniel and Florian.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 12:37:06 -07:00
Mark Einon
38df6492eb et131x: Add PCIe gigabit ethernet driver et131x to drivers/net
This adds the ethernet driver for Agere et131x devices to
drivers/net/ethernet.

The driver being added has been in the staging tree for some time, and will be
removed from there in a seperate patch. This one merely disables the staging
version to prevent two instances being built.

Signed-off-by: Mark Einon <mark.einon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 12:22:19 -07:00
David S. Miller
739e4a758e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/r8152.c
	net/netfilter/nfnetlink.c

Both r8152 and nfnetlink conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-02 11:25:43 -07:00
Linus Torvalds
50dddff3cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Don't halt the firmware in r8152 driver, from Hayes Wang.

 2) Handle full sized 802.1ad frames in bnx2 and tg3 drivers properly,
    from Vlad Yasevich.

 3) Don't sleep while holding tx_clean_lock in netxen driver, fix from
    Manish Chopra.

 4) Certain kinds of ipv6 routes can end up endlessly failing the route
    validation test, causing it to be re-looked up over and over again.
    This particularly kills input route caching in TCP sockets.  Fix
    from Hannes Frederic Sowa.

 5) netvsc_start_xmit() has a use-after-free access to skb->len, fix
    from K Y Srinivasan.

 6) Fix matching of inverted containers in ematch module, from Ignacy
    Gawędzki.

 7) Aggregation of GRO frames via SKB ->frag_list for linear skbs isn't
    handled properly, regression fix from Eric Dumazet.

 8) Don't test return value of ipv4_neigh_lookup(), which returns an
    error pointer, against NULL.  From WANG Cong.

 9) Fix an old regression where we mistakenly allow a double add of the
    same tunnel.  Fixes from Steffen Klassert.

10) macvtap device delete and open can run in parallel and corrupt lists
    etc., fix from Vlad Yasevich.

11) Fix build error with IPV6=m NETFILTER_XT_TARGET_TPROXY=y, from Pablo
    Neira Ayuso.

12) rhashtable_destroy() triggers lockdep splats, fix also from Pablo.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
  bna: Update Maintainer Email
  r8152: disable power cut for RTL8153
  r8152: remove clearing bp
  bnx2: Correctly receive full sized 802.1ad fragmes
  tg3: Allow for recieve of full-size 8021AD frames
  r8152: fix setting RTL8152_UNPLUG
  netxen: Fix bug in Tx completion path.
  netxen: Fix BUG "sleeping function called from invalid context"
  ipv6: remove rt6i_genid
  hyperv: Fix a bug in netvsc_start_xmit()
  net: stmmac: fix stmmac_pci_probe failed when CONFIG_HAVE_CLK is selected
  ematch: Fix matching of inverted containers.
  gro: fix aggregation for skb using frag_list
  neigh: check error pointer instead of NULL for ipv4_neigh_lookup()
  ip6_gre: Return an error when adding an existing tunnel.
  ip6_vti: Return an error when adding an existing tunnel.
  ip6_tunnel: Return an error when adding an existing tunnel.
  ip6gre: add a rtnl link alias for ip6gretap
  net/mlx4_core: Allow not to specify probe_vf in SRIOV IB mode
  r8152: fix the carrier off when autoresuming
  ...
2014-10-01 21:29:06 -07:00
Rasesh Mody
439e9575e7 bna: Update Maintainer Email
Update the maintainer email for BNA driver.

Signed-off-by: Rasesh Mody <rasesh.mody@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:13:41 -04:00
Petri Gynther
d068b02cfd net: phy: add BCM7425 and BCM7429 PHYs
Signed-off-by: Petri Gynther <pgynther@google.com>
Acked-by: Florian Fainelli <f.fainelli@gmai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:12:48 -04:00
Petri Gynther
bc23333ba1 net: bcmgenet: fix bcmgenet_put_tx_csum()
bcmgenet_put_tx_csum() needs to return skb pointer back to the caller
because it reallocates a new one in case of lack of skb headroom.

Signed-off-by: Petri Gynther <pgynther@google.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:11:49 -04:00
Alexei Starovoitov
38b2cf2982 net: pktgen: packet bursting via skb->xmit_more
This patch demonstrates the effect of delaying update of HW tailptr.
(based on earlier patch by Jesper)

burst=1 is the default. It sends one packet with xmit_more=false
burst=2 sends one packet with xmit_more=true and
        2nd copy of the same packet with xmit_more=false
burst=3 sends two copies of the same packet with xmit_more=true and
        3rd copy with xmit_more=false

Performance with ixgbe (usec 30):
burst=1  tx:9.2 Mpps
burst=2  tx:13.5 Mpps
burst=3  tx:14.5 Mpps full 10G line rate

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:08:12 -04:00
Florian Fainelli
775dd692bd net: bridge: add a br_set_state helper function
In preparation for being able to propagate port states to e.g: notifiers
or other kernel parts, do not manipulate the port state directly, but
instead use a helper function which will allow us to do a bit more than
just setting the state.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:03:50 -04:00
WANG Cong
a0efb80ce3 net_sched: avoid calling tcf_unbind_filter() in call_rcu callback
This fixes the following crash:

[   63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648
[   63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000
[   63.980094] RIP: 0010:[<ffffffff817e6d07>]  [<ffffffff817e6d07>] u32_destroy_key+0x27/0x6d
[   63.980094] RSP: 0018:ffff880117dffcc0  EFLAGS: 00010202
[   63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000
[   63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
[   63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000
[   63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001
[   63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30
[   63.980094] FS:  0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[   63.980094] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0
[   63.980094] Stack:
[   63.980094]  ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68
[   63.980094]  ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8
[   63.980094]  ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a
[   63.980094] Call Trace:
[   63.980094]  [<ffffffff817e6d68>] u32_delete_key_freepf_rcu+0x1b/0x1d
[   63.980094]  [<ffffffff810cb4c7>] rcu_process_callbacks+0x3bb/0x691
[   63.980094]  [<ffffffff810cb3cd>] ? rcu_process_callbacks+0x2c1/0x691
[   63.980094]  [<ffffffff817e6d4d>] ? u32_destroy_key+0x6d/0x6d
[   63.980094]  [<ffffffff810780a4>] __do_softirq+0x142/0x323
[   63.980094]  [<ffffffff810782a8>] run_ksoftirqd+0x23/0x53
[   63.980094]  [<ffffffff81092126>] smpboot_thread_fn+0x203/0x221
[   63.980094]  [<ffffffff81091f23>] ? smpboot_unpark_thread+0x33/0x33
[   63.980094]  [<ffffffff8108e44d>] kthread+0xc9/0xd1
[   63.980094]  [<ffffffff819e00ea>] ? do_wait_for_common+0xf8/0x125
[   63.980094]  [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61
[   63.980094]  [<ffffffff819e43ec>] ret_from_fork+0x7c/0xb0
[   63.980094]  [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61

tp could be freed in call_rcu callback too, the order is not guaranteed.

John Fastabend says:

====================
Its worth noting why this is safe. Any running schedulers will either
read the valid class field or it will be zeroed.

All schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.
====================

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:00:42 -04:00
WANG Cong
6e0565697a net_sched: fix another crash in cls_tcindex
This patch fixes the following crash:

[  166.670795] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  166.674230] IP: [<ffffffff814b739f>] __list_del_entry+0x5c/0x98
[  166.674230] PGD d0ea5067 PUD ce7fc067 PMD 0
[  166.674230] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  166.674230] CPU: 1 PID: 775 Comm: tc Not tainted 3.17.0-rc6+ #642
[  166.674230] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  166.674230] task: ffff8800d03c4d20 ti: ffff8800cae7c000 task.ti: ffff8800cae7c000
[  166.674230] RIP: 0010:[<ffffffff814b739f>]  [<ffffffff814b739f>] __list_del_entry+0x5c/0x98
[  166.674230] RSP: 0018:ffff8800cae7f7d0  EFLAGS: 00010207
[  166.674230] RAX: 0000000000000000 RBX: ffff8800cba8d700 RCX: ffff8800cba8d700
[  166.674230] RDX: 0000000000000000 RSI: dead000000200200 RDI: ffff8800cba8d700
[  166.674230] RBP: ffff8800cae7f7d0 R08: 0000000000000001 R09: 0000000000000001
[  166.674230] R10: 0000000000000000 R11: 000000000000859a R12: ffffffffffffffe8
[  166.674230] R13: ffff8800cba8c5b8 R14: 0000000000000001 R15: ffff8800cba8d700
[  166.674230] FS:  00007fdb5f04a740(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[  166.674230] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  166.674230] CR2: 0000000000000000 CR3: 00000000cf929000 CR4: 00000000000006e0
[  166.674230] Stack:
[  166.674230]  ffff8800cae7f7e8 ffffffff814b73e8 ffff8800cba8d6e8 ffff8800cae7f828
[  166.674230]  ffffffff817caeec 0000000000000046 ffff8800cba8c5b0 ffff8800cba8c5b8
[  166.674230]  0000000000000000 0000000000000001 ffff8800cf8e33e8 ffff8800cae7f848
[  166.674230] Call Trace:
[  166.674230]  [<ffffffff814b73e8>] list_del+0xd/0x2b
[  166.674230]  [<ffffffff817caeec>] tcf_action_destroy+0x4c/0x71
[  166.674230]  [<ffffffff817ca0ce>] tcf_exts_destroy+0x20/0x2d
[  166.674230]  [<ffffffff817ec2b5>] tcindex_delete+0x196/0x1b7

struct list_head can not be simply copied and we should always init it.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:00:42 -04:00
David S. Miller
25e379c475 Merge branch 'udp_gso'
Tom Herbert says:

====================
udp: Generalize GSO for UDP tunnels

This patch set generalizes the UDP tunnel segmentation functions so
that they can work with various protocol encapsulations. The primary
change is to set the inner_protocol field in the skbuff when creating
the encapsulated packet, and then in skb_udp_tunnel_segment this data
is used to determine the function for segmenting the encapsulated
packet. The inner_protocol field is overloaded to take either an
Ethertype or IP protocol.

The inner_protocol is set on transmit using skb_set_inner_ipproto or
skb_set_inner_protocol functions. VXLAN and IP tunnels (for fou GSO)
were modified to call these.

Notes:
  - GSO for GRE/UDP where GRE checksum is enabled does not work.
    Handling this will require some special case code.
  - Software GSO now supports many varieties of encapsulation with
    SKB_GSO_UDP_TUNNEL{_CSUM}. We still need a mechanism to query
    for device support of particular combinations (I intend to
    add ndo_gso_check for that).
  - MPLS seems to be the only previous user of inner_protocol. I don't
    believe these patches can affect that. For supporting GSO with
    MPLS over UDP, the inner_protocol should be set using the
    helper functions in this patch.
  - GSO for L2TP/UDP should also be straightforward now.

v2:
  - Respin for Eric's restructuring of skbuff.

Tested GRE, IPIP, and SIT over fou as well as VLXAN. This was
done using 200 TCP_STREAMs in netperf.

 GRE
    IPv4, FOU, UDP checksum enabled
      TCP_STREAM TSO enabled on tun interface
        14.04% TX CPU utilization
        13.17% RX CPU utilization
        9211 Mbps
      TCP_STREAM TSO disabled on tun interface
        27.82% TX CPU utilization
        25.41% RX CPU utilization
        9336 Mbps
    IPv4, FOU, UDP checksum disabled
      TCP_STREAM TSO enabled on tun interface
        13.14% TX CPU utilization
        23.18% RX CPU utilization
        9277 Mbps
      TCP_STREAM TSO disabled on tun interface
        30.00% TX CPU utilization
        31.28% RX CPU utilization
        9327 Mbps

  IPIP
    FOU, UDP checksum enabled
      TCP_STREAM TSO enabled on tun interface
        15.28% TX CPU utilization
        13.92% RX CPU utilization
        9342 Mbps
      TCP_STREAM TSO disabled on tun interface
        27.82% TX CPU utilization
        25.41% RX CPU utilization
        9336 Mbps
    FOU, UDP checksum disabled
      TCP_STREAM TSO enabled on tun interface
        15.08% TX CPU utilization
        24.64% RX CPU utilization
        9226 Mbps
      TCP_STREAM TSO disabled on tun interface
        30.00% TX CPU utilization
        31.28% RX CPU utilization
        9327 Mbps

  SIT
    FOU, UDP checksum enabled
      TCP_STREAM TSO enabled on tun interface
        14.47% TX CPU utilization
        14.58% RX CPU utilization
        9106 Mbps
      TCP_STREAM TSO disabled on tun interface
        31.82% TX CPU utilization
        30.82% RX CPU utilization
        9204 Mbps
    FOU, UDP checksum disabled
      TCP_STREAM TSO enabled on tun interface
        15.70% TX CPU utilization
        27.93% RX CPU utilization
        9097 Mbps
      TCP_STREAM TSO disabled on tun interface
        33.48% TX CPU utilization
        37.36% RX CPU utilization
        9197 Mbps

   VXLAN
      TCP_STREAM TSO enabled on tun interface
        16.42% TX CPU utilization
        23.66% RX CPU utilization
        9081 Mbps
      TCP_STREAM TSO disabled on tun interface
        30.32% TX CPU utilization
        30.55% RX CPU utilization
        9185 Mbps

   Baseline (no encp, TSO and LRO enabled)
      TCP_STREAM
        11.85% TX CPU utilization
        15.13% RX CPU utilization
        9452 Mbps
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:58 -04:00
Tom Herbert
996c9fd167 vxlan: Set inner protocol before transmit
Call skb_set_inner_protocol to set inner Ethernet protocol to
ETH_P_TEB before transmit. This is needed for GSO with UDP tunnels.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:52 -04:00
Tom Herbert
54bc9bac30 gre: Set inner protocol in v4 and v6 GRE transmit
Call skb_set_inner_protocol to set inner Ethernet protocol to
protocol being encapsulation by GRE before tunnel_xmit. This is
needed for GSO if UDP encapsulation (fou) is being done.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:51 -04:00
Tom Herbert
077c5a0948 ipip: Set inner IP protocol in ipip
Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV4
before tunnel_xmit. This is needed if UDP encapsulation (fou) is
being done.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:51 -04:00
Tom Herbert
469471cdfc sit: Set inner IP protocol in sit
Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV6
before tunnel_xmit. This is needed if UDP encapsulation (fou) is
being done.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:51 -04:00
Tom Herbert
8bce6d7d0d udp: Generalize skb_udp_segment
skb_udp_segment is the function called from udp4_ufo_fragment to
segment a UDP tunnel packet. This function currently assumes
segmentation is transparent Ethernet bridging (i.e. VXLAN
encapsulation). This patch generalizes the function to
operate on either Ethertype or IP protocol.

The inner_protocol field must be set to the protocol of the inner
header. This can now be either an Ethertype or an IP protocol
(in a union). A new flag in the skbuff indicates which type is
effective. skb_set_inner_protocol and skb_set_inner_ipproto
helper functions were added to set the inner_protocol. These
functions are called from the point where the tunnel encapsulation
is occuring.

When skb_udp_tunnel_segment is called, the function to segment the
inner packet is selected based on the inner IP or Ethertype. In the
case of an IP protocol encapsulation, the function is derived from
inet[6]_offloads. In the case of Ethertype, skb->protocol is
set to the inner_protocol and skb_mac_gso_segment is called. (GRE
currently does this, but it might be possible to lookup the protocol
in offload_base and call the appropriate segmenation function
directly).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:51 -04:00
David S. Miller
f44d61cdd3 Merge branch 'bpf-next'
Alexei Starovoitov says:

====================
bpf: add search pruning optimization and tests

patch #1 commit log explains why eBPF verifier has to examine some
instructions multiple times and describes the search pruning optimization
that improves verification speed for branchy programs and allows more
complex programs to be verified successfully.
This patch completes the core verifier logic.

patch #2 adds more verifier tests related to branches and search pruning

I'm still working on Andy's 'bitmask for stack slots' suggestion. It will be
done on top of this patch.

The current verifier algorithm is brute force depth first search with
state pruning. If anyone can come up with another algorithm that demonstrates
better results, we'll replace the algorithm without affecting user space.

Note verifier doesn't guarantee that all possible valid programs are accepted.
Overly complex programs may still be rejected.
Verifier improvements/optimizations will guarantee that if a program
was passing verification in the past, it will still be passing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:30:46 -04:00
Alexei Starovoitov
fd10c2ef3e bpf: add tests to verifier testsuite
add 4 extra tests to cover jump verification better

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:30:33 -04:00
Alexei Starovoitov
f1bca824da bpf: add search pruning optimization to verifier
consider C program represented in eBPF:
int filter(int arg)
{
    int a, b, c, *ptr;

    if (arg == 1)
        ptr = &a;
    else if (arg == 2)
        ptr = &b;
    else
        ptr = &c;

    *ptr = 0;
    return 0;
}
eBPF verifier has to follow all possible paths through the program
to recognize that '*ptr = 0' instruction would be safe to execute
in all situations.
It's doing it by picking a path towards the end and observes changes
to registers and stack at every insn until it reaches bpf_exit.
Then it comes back to one of the previous branches and goes towards
the end again with potentially different values in registers.
When program has a lot of branches, the number of possible combinations
of branches is huge, so verifer has a hard limit of walking no more
than 32k instructions. This limit can be reached and complex (but valid)
programs could be rejected. Therefore it's important to recognize equivalent
verifier states to prune this depth first search.

Basic idea can be illustrated by the program (where .. are some eBPF insns):
    1: ..
    2: if (rX == rY) goto 4
    3: ..
    4: ..
    5: ..
    6: bpf_exit
In the first pass towards bpf_exit the verifier will walk insns: 1, 2, 3, 4, 5, 6
Since insn#2 is a branch the verifier will remember its state in verifier stack
to come back to it later.
Since insn#4 is marked as 'branch target', the verifier will remember its state
in explored_states[4] linked list.
Once it reaches insn#6 successfully it will pop the state recorded at insn#2 and
will continue.
Without search pruning optimization verifier would have to walk 4, 5, 6 again,
effectively simulating execution of insns 1, 2, 4, 5, 6
With search pruning it will check whether state at #4 after jumping from #2
is equivalent to one recorded in explored_states[4] during first pass.
If there is an equivalent state, verifier can prune the search at #4 and declare
this path to be safe as well.
In other words two states at #4 are equivalent if execution of 1, 2, 3, 4 insns
and 1, 2, 4 insns produces equivalent registers and stack.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:30:33 -04:00
Nimrod Andy
1b7bde6d65 net: fec: implement rx_copybreak to improve rx performance
- Copy short frames and keep the buffers mapped, re-allocate skb instead of
  memory copy for long frames.
- Add support for setting/getting rx_copybreak using generic ethtool tunable

Changes V3:
* As Eric Dumazet's suggestion that removing the copybreak module parameter
  and only keep the ethtool API support for rx_copybreak.

Changes V2:
* Implements rx_copybreak
* Rx_copybreak provides module parameter to change this value
* Add tunable_ops support for rx_copybreak

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Frank Li <Frank.Li@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:28:21 -04:00
Eric Dumazet
ce1a4ea3f1 net: avoid one atomic operation in skb_clone()
Fast clone cloning can actually avoid an atomic_inc(), if we
guarantee prior clone_ref value is 1.

This requires a change kfree_skbmem(), to perform the
atomic_dec_and_test() on clone_ref before setting fclone to
SKB_FCLONE_UNAVAILABLE.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:27:23 -04:00
Fabian Frederick
e500f488c2 net/dccp/ccid.c: add __init to ccid_activate
ccid_activate is only called by __init ccid_initialize_builtins in same module.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 18:33:13 -04:00
Fabian Frederick
0c5b8a4629 net/dccp/proto.c: add __init to dccp_mib_init
dccp_mib_init is only called by __init dccp_init in same module.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 18:33:13 -04:00
David S. Miller
0754476419 Merge branch 'r8152'
Hayes Wang says:

====================
r8152: patches about firmware

The patches fix the issues when the firmware exists.

For the multiple OS, the firmware may be loaded by the
driver of the other OS. And the Linux driver has influences
on it.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:46:41 -04:00
hayeswang
49be17235c r8152: disable power cut for RTL8153
The firmware would be clear when the power cut is enabled for
RTL8153.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:46:34 -04:00
hayeswang
204c870412 r8152: remove clearing bp
The xxx_clear_bp() is used to halt the firmware. It only necessary
for updating the new firmware. Besides, depend on the version of
the current firmware, it may have problem to halt the firmware
directly. Finally, halt the firmware would let the firmware code
useless, and the bugs which are fixed by the firmware would occur.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:46:34 -04:00
Vlad Yasevich
1b0ecb28b0 bnx2: Correctly receive full sized 802.1ad fragmes
This driver, similar to tg3, has a check that will
cause full sized 802.1ad frames to be dropped.  The
frame will be larger then the standard mtu due to the
presense of vlan header that has not been stripped.
The driver should not drop this frame and should process
it just like it does for 802.1q.

CC: Sony Chacko <sony.chacko@qlogic.com>
CC: Dept-HSGLinuxNICDev@qlogic.com
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:43:45 -04:00
Vlad Yasevich
7d3083ee36 tg3: Allow for recieve of full-size 8021AD frames
When receiving a vlan-tagged frame that still contains
a vlan header, the length of the packet will be greater
then MTU+ETH_HLEN since it will account of the extra
vlan header.  TG3 checks this for the case for 802.1Q,
but not for 802.1ad.  As a result, full sized 802.1ad
frames get dropped by the card.

Add a check for 802.1ad protocol when receving full
sized frames.

Suggested-by: Prashant Sreedharan <prashant@broadcom.com>
CC: Prashant Sreedharan <prashant@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:43:45 -04:00
Florian Westphal
1e91887685 r8169: add support for Byte Queue Limits
tested on RTL8168d/8111d model using 'super_netperf 40' with TCP/UDP_STREAM.

Output of
while true; do
    for n in inflight limit; do
          echo -n $n\ ; cat $n;
    done;
    sleep 1;
done

during netperf run, 100mbit peer:

inflight 0
limit 3028
inflight 6056
limit 4542

[ trimmed output for brevity, no limit/inflight changes during
  test steady-state ]

limit 4542
inflight 3028
limit 6122
inflight 0
limit 6122
[ changed cable to 1gbit peer, restart netperf ]
inflight 37850
limit 36336
inflight 33308
limit 31794
inflight 33308
limit 31794
inflight 27252
limit 25738
[ again, no changes during test ]
inflight 27252
limit 25738
inflight 0
limit 28766
[ change cable to 100mbit peer, restart netperf ]
limit 28766
inflight 27370
limit 28766
inflight 4542
limit 5990
inflight 6056
limit 4542
[ .. ]
inflight 6056
limit 4542
inflight 0

[end of test]

Cc: Francois Romieu <romieu@fr.zoreil.com>
Cc: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:35:43 -04:00
Eric Dumazet
d0bf4a9e92 net: cleanup and document skb fclone layout
Lets use a proper structure to clearly document and implement
skb fast clones.

Then, we might experiment more easily alternative layouts.

This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm,
to stop leaking of implementation details.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:34:25 -04:00
Yuchung Cheng
b248230c34 tcp: abort orphan sockets stalling on zero window probes
Currently we have two different policies for orphan sockets
that repeatedly stall on zero window ACKs. If a socket gets
a zero window ACK when it is transmitting data, the RTO is
used to probe the window. The socket is aborted after roughly
tcp_orphan_retries() retries (as in tcp_write_timeout()).

But if the socket was idle when it received the zero window ACK,
and later wants to send more data, we use the probe timer to
probe the window. If the receiver always returns zero window ACKs,
icsk_probes keeps getting reset in tcp_ack() and the orphan socket
can stall forever until the system reaches the orphan limit (as
commented in tcp_probe_timer()). This opens up a simple attack
to create lots of hanging orphan sockets to burn the memory
and the CPU, as demonstrated in the recent netdev post "TCP
connection will hang in FIN_WAIT1 after closing if zero window is
advertised." http://www.spinics.net/lists/netdev/msg296539.html

This patch follows the design in RTO-based probe: we abort an orphan
socket stalling on zero window when the probe timer reaches both
the maximum backoff and the maximum RTO. For example, an 100ms RTT
connection will timeout after roughly 153 seconds (0.3 + 0.6 +
.... + 76.8) if the receiver keeps the window shut. If the orphan
socket passes this check, but the system already has too many orphans
(as in tcp_out_of_resources()), we still abort it but we'll also
send an RST packet as the connection may still be active.

In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
sockets stalled on zero-window probes. This changes the semantics
of TCP_USER_TIMEOUT slightly because it previously only applies
when the socket has pending transmission.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:27:52 -04:00
Linus Torvalds
a44f867247 Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
 "This fixes a data corruption bug introduced by the v3.16 xdr encoding
  rewrite.  I haven't managed to reproduce it myself yet, but it's
  apparently not hard to hit given the right workload"

* 'for-3.17' of git://linux-nfs.org/~bfields/linux:
  nfsd4: fix corruption of NFSv4 read data
2014-10-01 13:22:00 -07:00
Fabian Frederick
cb57659a15 cipso: add __init to cipso_v4_cache_init
cipso_v4_cache_init is only called by __init cipso_v4_init

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:46:20 -04:00
Fabian Frederick
57a02c39c1 inet: frags: add __init to ip4_frags_ctl_register
ip4_frags_ctl_register is only called by __init ipfrag_init

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:46:19 -04:00
Fabian Frederick
47d7a88c18 tcp: add __init to tcp_init_mem
tcp_init_mem is only called by __init tcp_init.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:41:14 -04:00
Chun-Hao Lin
ee7a1beb97 r8169:call "rtl8168_driver_start" "rtl8168_driver_stop" only when hardware dash function is enabled
These two functions are used to inform dash firmware that driver is been
brought up or brought down. So call these two functions only when hardware dash
function is enabled.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:18 -04:00
Chun-Hao Lin
2a9b4d9670 r8169:modify the behavior of function "rtl8168_oob_notify"
In function "rtl8168_oob_notify", using function "rtl_eri_write" to access
eri register 0xe8, instead of using MAC register "ERIDR" and "ERIAR" to
access it.

For using function "rtl_eri_write" in function "rtl8168_oob_notify", need to
move down "rtl8168_oob_notify" related functions under the function
"rtl_eri_write".

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:18 -04:00
Chun-Hao Lin
2f8c040ce6 r8169:change the name of function "r8168dp_check_dash" to "r8168_check_dash"
DASH function not only RTL8168DP can support, but also RTL8168EP.
So change the name of function "r8168dp_check_dash" to "r8168_check_dash".

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:18 -04:00
Chun-Hao Lin
706123d06c r8169:change the name of function"rtl_w1w0_eri"
Change the name of function "rtl_w1w0_eri" to "rtl_w0w1_eri".

In this function, the local variable "val" is "write zeros then write ones".
Please see below code.

(val & ~m) | p

In this patch, change the function name from "xx_w1w0_xx" to "xx_w0w1_xx".
The changed function name is more suitable for it's behavior.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:18 -04:00
Chun-Hao Lin
7656442824 r8169:for function "rtl_w1w0_phy" change its name and behavior
Change function name from "rtl_w1w0_phy" to "rtl_w0w1_phy".
And its behavior from "write ones then write zeros" to
"write zeros then write ones".

In Realtek internal driver, bitwise operations are almost "write zeros then
write ones". For easy to port hardware parameters from Realtek internal driver
to Linux kernal driver "r8169", we would like to change this function's
behavior and its name.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:17 -04:00
Chun-Hao Lin
ac85bcdbc0 r8169:add more chips to support magic packet v2
For RTL8168F RTL8168FB RTL8168G RTL8168GU RTL8411 RTL8411B RTL8402 RTL8107E,
the magic packet enable bit is changed to eri 0xde bit0.

In this patch, change magic packet enable bit of these chips to eri 0xde bit0.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:17 -04:00
Chun-Hao Lin
89cceb2729 r8169:add support more chips to get mac address from backup mac address register
RTL8168FB RTL8168G RTL8168GU RTL8411 RTL8411B RTL8106EUS RTL8402 can
support get mac address from backup mac address register.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:17 -04:00
Chun-Hao Lin
42fde73710 r8169:add disable/enable RTL8411B pll function
RTL8411B can support disable/enable pll function.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:17 -04:00
Chun-Hao Lin
b8e5e6ad71 r8169:add disable/enable RTL8168G pll function
RTL8168G also can disable/enable pll function.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:17 -04:00
Chun-Hao Lin
05b9687bb3 r8169:change uppercase number to lowercase number
Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 15:33:17 -04:00