Validation of skb can be pretty expensive :
GSO segmentation and/or checksum computations.
We can do this without holding qdisc lock, so that other cpus
can queue additional packets.
Trick is that requeued packets were already validated, so we carry
a boolean so that sch_direct_xmit() can validate a fresh skb list,
or directly use an old one.
Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
host.
Turning TSO on or off had no effect on throughput, only few more cpu
cycles. Lock contention on qdisc lock disappeared.
Same if disabling TX checksum offload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TSO and GSO segmented packets already benefit from bulking
on their own.
The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.
The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7f9 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f2a ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list. And via commit
ce93718fb7 ("net: Don't keep around original SKB when we
software segment GSO frames.").
This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.
Testing:
Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).
Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).
Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms. Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.
Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.
This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().
The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
(1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet. The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
(2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.
One restriction of the new API is that every SKB must belong to the
same TXQ. This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.
Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()). In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit(). In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits. In-effect bulking will only happen for BQL enabled drivers.
Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.
For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets. They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.
Jointed work with Hannes, Daniel and Florian.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/r8152.c
net/netfilter/nfnetlink.c
Both r8152 and nfnetlink conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the following crash:
[ 63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648
[ 63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000
[ 63.980094] RIP: 0010:[<ffffffff817e6d07>] [<ffffffff817e6d07>] u32_destroy_key+0x27/0x6d
[ 63.980094] RSP: 0018:ffff880117dffcc0 EFLAGS: 00010202
[ 63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000
[ 63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
[ 63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000
[ 63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001
[ 63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30
[ 63.980094] FS: 0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[ 63.980094] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0
[ 63.980094] Stack:
[ 63.980094] ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68
[ 63.980094] ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8
[ 63.980094] ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a
[ 63.980094] Call Trace:
[ 63.980094] [<ffffffff817e6d68>] u32_delete_key_freepf_rcu+0x1b/0x1d
[ 63.980094] [<ffffffff810cb4c7>] rcu_process_callbacks+0x3bb/0x691
[ 63.980094] [<ffffffff810cb3cd>] ? rcu_process_callbacks+0x2c1/0x691
[ 63.980094] [<ffffffff817e6d4d>] ? u32_destroy_key+0x6d/0x6d
[ 63.980094] [<ffffffff810780a4>] __do_softirq+0x142/0x323
[ 63.980094] [<ffffffff810782a8>] run_ksoftirqd+0x23/0x53
[ 63.980094] [<ffffffff81092126>] smpboot_thread_fn+0x203/0x221
[ 63.980094] [<ffffffff81091f23>] ? smpboot_unpark_thread+0x33/0x33
[ 63.980094] [<ffffffff8108e44d>] kthread+0xc9/0xd1
[ 63.980094] [<ffffffff819e00ea>] ? do_wait_for_common+0xf8/0x125
[ 63.980094] [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61
[ 63.980094] [<ffffffff819e43ec>] ret_from_fork+0x7c/0xb0
[ 63.980094] [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61
tp could be freed in call_rcu callback too, the order is not guaranteed.
John Fastabend says:
====================
Its worth noting why this is safe. Any running schedulers will either
read the valid class field or it will be zeroed.
All schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.
====================
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After previous patches to simplify qstats the qstats can be
made per cpu with a packed union in Qdisc struct.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the use of qstats->qlen variable from the classifiers
and makes it an explicit argument to gnet_stats_copy_queue().
The qlen represents the qdisc queue length and is packed into
the qstats at the last moment before passnig to user space. By
handling it explicitely we avoid, in the percpu stats case, having
to figure out which per_cpu variable to put it in.
It would probably be best to remove it from qstats completely
but qstats is a user space ABI and can't be broken. A future
patch could make an internal only qstats structure that would
avoid having to allocate an additional u32 variable on the
Qdisc struct. This would make the qstats struct 128bits instead
of 128+32.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds helpers to manipulate qstats logic and replaces locations
that touch the counters directly. This simplifies future patches
to push qstats onto per cpu counters.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to run qdisc's without locking statistics and estimators
need to be handled correctly.
To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.
Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Negated expressions and sub-expressions need to have their flags checked for
TCF_EM_INVERT and their result negated accordingly.
Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This completes the cls_rsvp conversion to RCU safe
copy, update semantics.
As a result all cases of tcf_exts_change occur on
empty lists now.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clearly the following change is not expected:
- if (!cp.perfect && !cp.h)
- cp.alloc_hash = cp.hash;
+ if (!cp->perfect && cp->h)
+ cp->alloc_hash = cp->hash;
Fixes: commit 331b72922c ("net: sched: RCU cls_tcindex")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When kmemdup() fails, we should return -ENOMEM.
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <hadi@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While using a MQ + NETEM setup, I had confirmation that the default
timer migration ( /proc/sys/kernel/timer_migration ) is killing us.
Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
queues) :
EST="est 1sec 4sec"
for ETH in eth1
do
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
done
We can see that timers get migrated into a single cpu, presumably idle
at the time timers are set up.
Then all qdisc dequeues run from this cpu and huge lock contention
happens. This single cpu is stuck in softirq mode and cannot dequeue
fast enough.
39.24% [kernel] [k] _raw_spin_lock
2.65% [kernel] [k] netem_enqueue
1.80% [kernel] [k] netem_dequeue
1.63% [kernel] [k] copy_user_enhanced_fast_string
1.45% [kernel] [k] _raw_spin_lock_bh
By pinning qdisc timers on the cpu running the qdisc, we respect proper
XPS setting and remove this lock contention.
5.84% [kernel] [k] netem_enqueue
4.83% [kernel] [k] _raw_spin_lock
2.92% [kernel] [k] copy_user_enhanced_fast_string
Current Qdiscs that benefit from this change are :
netem, cbq, fq, hfsc, tbf, htb.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
$ grep CONFIG_CLS_U32_MARK .config
# CONFIG_CLS_U32_MARK is not set
net/sched/cls_u32.c: In function 'u32_change':
net/sched/cls_u32.c:852:1: warning: label 'errout' defined but not used
[-Wunused-label]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes to the cls_u32 classifier must appear atomic to the
readers. Before this patch if a change is requested for both
the exts and ifindex, first the ifindex is updated then the
exts with tcf_exts_change(). This opens a small window where
a reader can have a exts chain with an incorrect ifindex. This
violates the the RCU semantics.
Here we resolve this by always passing u32_set_parms() a copy
of the tc_u_knode to work on and then inserting it into the hash
table after the updates have been successfully applied.
Tested with the following short script:
#tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 handle 1: \
u32 divisor 256
#tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 \
u32 link 1: hashkey mask ffffff00 at 12 \
match ip src 192.168.8.0/2
#tc filter add dev p3p2 parent 8001:0 protocol ip prio 102 \
handle 1::10 u32 classid 1:2 ht 1: \
match ip src 192.168.8.0/8 match ip tos 0x0a 1e
#tc filter change dev p3p2 parent 8001:0 protocol ip prio 102 \
handle 1::10 u32 classid 1:2 ht 1: \
match ip src 1.1.0.0/8 match ip tos 0x0b 1e
CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a missed free_percpu in the unwind code path and when
keys are destroyed.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
or increasing skb->cb[] size.
Commit e0f31d8498 ("flow_keys: Record IP layer protocol in
skb_flow_dissect()") broke IPoIB.
Only current offender is sch_choke, and this one do not need an
absolutely precise flow key.
If we store 17 bytes of flow key, its more than enough. (Its the actual
size of flow_keys if it was a packed structure, but we might add new
fields at the end of it later)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: e0f31d8498 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
Signed-off-by: David S. Miller <davem@davemloft.net>
tc_u32_sel 'sel' in tc_u_knode expects to be the last element in the
structure and pads the structure with tc_u32_key fields for each key.
kzalloc(sizeof(*n) + s->nkeys*sizeof(struct tc_u32_key), GFP_KERNEL)
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pfifo_fast and htb use skb lists, without needing their spinlocks.
(They instead use the standard qdisc lock)
We can use __skb_queue_head_init() instead of skb_queue_head_init()
to be consistent.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This ensures the tcf_exts_init() is called for all cases.
Fixes: 952313bd62 ("net: sched: cls_cgroup use RCU")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When allocating a new structure we also need to call tcf_exts_init
to initialize exts.
A follow up patch might be in order to remove some of this code
and do tcf_exts_assign(). With this we could remove the
tcf_exts_init/tcf_exts_change pattern for some of the classifiers.
As part of the future tcf_actions RCU series this will need to be
done. For now fix the call here.
Fixes e35a8ee599 ("net: sched: fw use RCU")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head: 54996b529a
commit: c7953ef230 [625/646] net: sched: cls_cgroup use RCU
net/sched/cls_cgroup.c:130 cls_cgroup_change() warn: possible memory leak of 'new'
net/sched/cls_cgroup.c:135 cls_cgroup_change() warn: possible memory leak of 'new'
net/sched/cls_cgroup.c:139 cls_cgroup_change() warn: possible memory leak of 'new'
Fixes: c7953ef230 ("net: sched: cls_cgroup use RCU")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kbuild test robot reported an unused variable cpu in cls_u32.c
after the patch below. This happens when PERF and MARK config
variables are disabled
Fix this is to use separate variables for perf and mark
and define the cpu variable inside the ifdef logic.
Fixes: 459d5f626d ("net: sched: make cls_u32 per cpu")'
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: commit 331b72922c ("net: sched: RCU cls_tcindex")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-By: John Fastabend <john.r.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: commit 331b72922c ("net: sched: RCU cls_tcindex")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: commit 331b72922c ("net: sched: RCU cls_tcindex")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: commit 1f947bf151 ("net: sched: rcu'ify cls_bpf")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the cls_bpf classifier RCU safe. The tcf_lock
was being used to protect a list of cls_bpf_prog now this list
is RCU safe and updates occur with rcu_replace.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make cls_u32 classifier safe to run without holding lock. This patch
converts statistics that are kept in read section u32_classify into
per cpu counters.
This patch was tested with a tight u32 filter add/delete loop while
generating traffic with pktgen. By running pktgen on vlan devices
created on top of a physical device we can hit the qdisc layer
correctly. For ingress qdisc's a loopback cable was used.
for i in {1..100}; do
q=`echo $i%8|bc`;
echo -n "u32 tos: iteration $i on queue $q";
tc filter add dev p3p2 parent $p prio $i u32 match ip tos 0x10 0xff \
action skbedit queue_mapping $q;
sleep 1;
tc filter del dev p3p2 prio $i;
echo -n "u32 tos hash table: iteration $i on queue $q";
tc filter add dev p3p2 parent $p protocol ip prio $i handle 628: u32 divisor 1
tc filter add dev p3p2 parent $p protocol ip prio $i u32 \
match ip protocol 17 0xff link 628: offset at 0 mask 0xf00 shift 6 plus 0
tc filter add dev p3p2 parent $p protocol ip prio $i u32 \
ht 628:0 match ip tos 0x10 0xff action skbedit queue_mapping $q
sleep 2;
tc filter del dev p3p2 prio $i
sleep 1;
done
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This uses per cpu counters in cls_u32 in preparation
to convert over to rcu.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make cls_tcindex RCU safe.
This patch addds a new RCU routine rcu_dereference_bh_rtnl() to check
caller either holds the rcu read lock or RTNL. This is needed to
handle the case where tcindex_lookup() is being called in both cases.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCUify the route classifier. For now however spinlock's are used to
protect fastmap cache.
The issue here is the fastmap may be read by one CPU while the
cache is being updated by another. An array of pointers could be
one possible solution.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCU'ify fw classifier.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make cgroup classifier safe for RCU.
Also drops the calls in the classify routine that were doing a
rcu_read_lock()/rcu_read_unlock(). If the rcu_read_lock() isn't held
entering this routine we have issues with deleting the classifier
chain so remove the unnecessary rcu_read_lock()/rcu_read_unlock()
pair noting all paths AFAIK hold rcu_read_lock.
If there is a case where classify is called without the rcu read lock
then an rcu splat will occur and we can correct it.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable basic classifier for RCU.
Dereferencing tp->root may look a bit strange here but it is needed
by my accounting because it is allocated at init time and needs to
be kfree'd at destroy time. However because it may be referenced in
the classify() path we must wait an RCU grace period before free'ing
it. We use kfree_rcu() and rcu_ APIs to enforce this. This pattern
is used in all the classifiers.
Also the hgenerator can be incremented without concern because it
is always incremented under RTNL.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rcu'ify tcf_proto this allows calling tc_classify() without holding
any locks. Updaters are protected by RTNL.
This patch prepares the core net_sched infrastracture for running
the classifier/action chains without holding the qdisc lock however
it does nothing to ensure cls_xxx and act_xxx types also work without
locking. Additional patches are required to address the fall out.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.
Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
not used anymore since ddecf0f
(net_sched: sfq: add optional RED on top of SFQ).
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More minor fixes to merge commit 53fda7f7f9 (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.
Fixing exit cases in qdisc_reset() and qdisc_destroy(), where a
leftover requeued SKB (qdisc->gso_skb) can have the potential of
being a skb list, thus use kfree_skb_list().
This is a followup to commit 10770bc2d1 ("qdisc: adjustments for
API allowing skb list xmits").
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor adjustments for merge commit 53fda7f7f9 (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.
Update code doc to function sch_direct_xmit().
In handle_dev_cpu_collision() use kfree_skb_list() in error handling.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just maintain the list properly by returning the head of the remaining
SKB list from dev_hard_start_xmit().
Signed-off-by: David S. Miller <davem@davemloft.net>