While running TCP tests involving netem storing millions of packets,
I had the idea to speed up tfifo_reset() and did experiments.
I tried the rbtree_postorder_for_each_entry_safe() method that is
used in skb_rbtree_purge() but discovered it was slower than the
current tfifo_reset() method.
I measured time taken to release skbs with three occupation levels :
10^4, 10^5 and 10^6 skbs with three methods :
1) (current 'naive' method)
while ((p = rb_first(&q->t_root))) {
struct sk_buff *skb = netem_rb_to_skb(p);
rb_erase(p, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
2) Use rb_next() instead of rb_first() in the loop :
p = rb_first(&q->t_root);
while (p) {
struct sk_buff *skb = netem_rb_to_skb(p);
p = rb_next(p);
rb_erase(&skb->rbnode, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
3) "optimized" method using rbtree_postorder_for_each_entry_safe()
struct sk_buff *skb, *next;
rbtree_postorder_for_each_entry_safe(skb, next,
&q->t_root, rbnode) {
rtnl_kfree_skbs(skb, skb);
}
q->t_root = RB_ROOT;
Results :
method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 10000 skbs in 541846 ns (54 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb)
method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 100000 skbs in 5942456 ns (59 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb)
method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 1000000 skbs in 82619635 ns (82 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb)
Method 2) is simply faster, probably because it maintains a smaller
working size set.
Note that this is the method we use in tcp_ofo_queue() already.
I will also change skb_rbtree_purge() in a second patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: c15ab236d6 ("net/sched: Change cls_flower to use IDR")
Cc: Chris Mi <chrism@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If real-time or fair-share curves are enabled in hfsc_change_class()
class isn't inserted into rb-trees yet. Thus init_ed() and init_vf()
must be called in place of update_ed() and update_vf().
Remove isn't required because for now curves cannot be disabled.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
SKB stored in qdisc->gso_skb also counted into backlog.
Some qdiscs don't reset backlog to zero in ->reset(),
for example sfq just dequeue and free all queued skb.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: David S. Miller <davem@davemloft.net>
gen estimator has been rewritten in commit 1c0d32fde5
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller no longer needs to wait for a grace period. So this
patch gets rid of it.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).
Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.
This saves some overhead in both TCP and netem.
v2: removes the swtstamp field from struct tcp_skb_cb
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
HTB qdisc overlimits counter is properly increased, but we have no per
class counter, meaning it is difficult to diagnose HTB problems.
This patch adds this counter, visible in "tc -s class show dev eth0",
with current iproute2.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pointer tcm is being initialized and is never read, it is only being used
to determine the size of struct tcmsg. Clean this up by removing
variable tcm and explicitly using the sizeof struct tcmsg rather than *tcm.
Cleans up clang warning:
warning: Value stored to 'tcm' during its initialization is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
this script, edited from Linux Advanced Routing and Traffic Control guide
tc q a dev en0 root handle 1: htb default a
tc c a dev en0 parent 1: classid 1:1 htb rate 6mbit burst 15k
tc c a dev en0 parent 1:1 classid 1:a htb rate 5mbit ceil 6mbit burst 15k
tc c a dev en0 parent 1:1 classid 1:b htb rate 1mbit ceil 6mbit burst 15k
tc f a dev en0 parent 1:0 prio 1 $clsname $clsargs classid 1:b
ping $address -c1
tc -s c s dev en0
classifies traffic to 1:b or 1:a, depending on whether the packet matches
or not the pattern $clsargs of filter $clsname. However, when $clsname is
'matchall', a systematic crash can be observed in htb_classify(). HTB and
classful qdiscs don't assign initial value to struct tcf_result, but then
they expect it to contain valid values after filters have been run. Thus,
current 'matchall' ignores the TCA_MATCHALL_CLASSID attribute, configured
by user, and makes HTB (and classful qdiscs) dereference random pointers.
By assigning head->res to *res in mall_classify(), before the actions are
invoked, we fix this crash and enable TCA_MATCHALL_CLASSID functionality,
that had no effect on 'matchall' classifier since its first introduction.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1460213
Reported-by: Jiri Benc <jbenc@redhat.com>
Fixes: b87f7936a9 ("net/sched: introduce Match-all classifier")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent commit d7fb60b9ca ("net_sched: get rid of tcfa_rcu") removed
freeing in call_rcu, which changed already existing hard-to-hit
race condition into 100% hit:
[ 598.599825] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 598.607782] IP: tcf_action_destroy+0xc0/0x140
Or:
[ 40.858924] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 40.862840] IP: tcf_generic_walker+0x534/0x820
Fix this by storing the ops and use them directly for module_put call.
Fixes: a85a970af2 ("net_sched: move tc_action into tcf_common")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As pointed out by Jiri, there is still a race condition between
tcf_block_put() and tcf_chain_destroy() in a RCU callback. There
is no way to make it correct without proper locking or synchronization,
because both operate on a shared list.
Locking is hard, because the only lock we can pick here is a spinlock,
however, in tc_dump_tfilter() we iterate this list with a sleeping
function called (tcf_chain_dump()), which makes using a lock to protect
chain_list almost impossible.
Jiri suggested the idea of holding a refcnt before flushing, this works
because it guarantees us there would be no parallel tcf_chain_destroy()
during the loop, therefore the race condition is gone. But we have to
be very careful with proper synchronization with RCU callbacks.
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the following ugliness of tc filter chain refcnt:
a) tp proto should hold a refcnt to the chain too. This significantly
simplifies the logic.
b) Chain 0 is no longer special, it is created with refcnt=1 like any
other chains. All the ugliness in tcf_chain_put() can be gone!
c) No need to handle the flushing oddly, because block still holds
chain 0, it can not be released, this guarantees block is the last
user.
d) The race condition with RCU callbacks is easier to handle with just
a rcu_barrier(). Much easier to understand, nothing to hide. Thanks
to the previous patch. Please see also the comments in code.
e) Make the code understandable by humans, much less error-prone.
Fixes: 744a4cf63e ("net: sched: fix use after free when tcf_chain_destroy is called multiple times")
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen estimator has been rewritten in commit 1c0d32fde5
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller is no longer needed to wait for a grace period.
So this patch gets rid of it.
This also completely closes a race condition between action free
path and filter chain add/remove path for the following patch.
Because otherwise the nested RCU callback can't be caught by
rcu_barrier().
Please see also the comments in code.
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes sparse warning about pointer in gen_handle:
net/sched/cls_rsvp.h:392:40: warning: Using plain integer as NULL pointer
Fixes: 8113c09567 ("net_sched: use void pointer for filter handle")
Signed-off-by: Josh Hunt <johunt@akamai.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a memleak happening for chain 0. The thing is, chain 0 needs to
be always present, not created on demand. Therefore tcf_block_get upon
creation of block calls the tcf_chain_create function directly. The
chain is created with refcnt == 1, which is not correct in this case and
causes the memleak. So move the refcnt increment into tcf_chain_get
function even for the case when chain needs to be created.
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 520ac30f45 ("net_sched: drop packets after root qdisc lock
is released) made a big change of tc for performance. There are two points
left in sch_prio and sch_qfq which are not changed with that commit. Now
enhance them now with __qdisc_drop.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TC filters when used as classifiers are bound to TC classes.
However, there is a hidden difference when adding them in different
orders:
1. If we add tc classes before its filters, everything is fine.
Logically, the classes exist before we specify their ID's in
filters, it is easy to bind them together, just as in the current
code base.
2. If we add tc filters before the tc classes they bind, we have to
do dynamic lookup in fast path. What's worse, this happens all
the time not just once, because on fast path tcf_result is passed
on stack, there is no way to propagate back to the one in tc filters.
This hidden difference hurts performance silently if we have many tc
classes in hierarchy.
This patch intends to close this gap by doing the reverse binding when
we create a new class, in this case we can actually search all the
filters in its parent, match and fixup by classid. And because
tcf_result is specific to each type of tc filter, we have to introduce
a new ops for each filter to tell how to bind the class.
Note, we still can NOT totally get rid of those class lookup in
->enqueue() because cgroup and flow filters have no way to determine
the classid at setup time, they still have to go through dynamic lookup.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently only a memory allocation failure can lead to this, so let's
initialize the timer first.
Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is very unlikely to happen but the backlogs memory allocation
could fail and will free q->flows, but then ->destroy() will free
q->flows too. For correctness remove the first free and let ->destroy
clean up.
Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Depending on where ->init fails we can get a null pointer deref due to
uninitialized hires timer (watchdog) or a double free of the qdisc hash
because it is already freed by ->destroy().
Fixes: 8d55373875 ("net/sched/hfsc: allocate tcf block for hfsc root class")
Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Typically, each TC filter has its own action. All the actions of the
same type are saved in its hash table. But the hash buckets are too
small that it degrades to a list. And the performance is greatly
affected. For example, it takes about 0m11.914s to insert 64K rules.
If we convert the hash table to IDR, it only takes about 0m1.500s.
The improvement is huge.
But please note that the test result is based on previous patch that
cls_flower uses IDR.
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, all filters with the same priority are linked in a doubly
linked list. Every filter should have a unique handle. To make the
handle unique, we need to iterate the list every time to see if the
handle exists or not when inserting a new filter. It is time-consuming.
For example, it takes about 5m3.169s to insert 64K rules.
This patch changes cls_flower to use IDR. With this patch, it
takes about 0m1.127s to insert 64K rules. The improvement is huge.
But please note that in this testing, all filters share the same action.
If every filter has a unique action, that is another bottleneck.
Follow-up patch in this patchset addresses that.
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch handles a default IFE type if it's not given by user space
netlink api. The default IFE type will be the registered ethertype by
IEEE for IFE ForCES.
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 520ac30f45 ("net_sched: drop packets after root qdisc lock
is released) made a big change of tc for performance. But there are
some points which are not changed in SFQ enqueue operation.
1. Fail to find the SFQ hash slot;
2. When the queue is full;
Now use qdisc_drop instead free skb directly.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is ugly to hide a u32-filter-specific pointer inside Qdisc,
this breaks the TC layers:
1. Qdisc is a generic representation, should not have any specific
data of any type
2. Qdisc layer is above filter layer, should only save filters in
the list of struct tcf_proto.
This pointer is used as the head of the chain of u32 hash tables,
that is struct tc_u_hnode, because u32 filter is very special,
it allows to create multiple hash tables within one qdisc and
across multiple u32 filters.
Instead of using this ugly pointer, we can just save it in a global
hash table key'ed by (dev ifindex, qdisc handle), therefore we can
still treat it as a per qdisc basis data structure conceptually.
Of course, because of network namespaces, this key is not unique
at all, but it is fine as we already have a pointer to Qdisc in
struct tc_u_common, we can just compare the pointers when collision.
And this only affects slow paths, has no impact to fast path,
thanks to the pointer ->tp_c.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:
1) For class modification and dumping paths, we already hold RTNL lock,
so all of these ->get(),->change(),->put() are atomic.
2) For filter bindiing/unbinding, we use other reference counter than
this one, and they should have RTNL lock too.
3) For ->qlen_notify(), it is special because it is called on ->enqueue()
path, but we already hold qdisc tree lock there, and we hold this
tree lock when graft or delete the class too, so it should not be gone
or changed until we release the tree lock.
Therefore, this patch removes ->get() and ->put(), but:
1) Adds a new ->find() to find the pointer to a class by classid, no
refcnt.
2) Move the original class destroy upon the last refcnt into ->delete(),
right after releasing tree lock. This is fine because the class is
already removed from hash when holding the lock.
For those who also use ->put() as ->unbind(), just rename them to reflect
this change.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like for TC actions, ->delete() is a special case,
we have to prepare and fill the notification before delete
otherwise would get use-after-free after we remove the
reference count.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is not needed if we move them up properly.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller reported a refcount_t warning [1]
Issue here is that noop_qdisc refcnt was never really considered as
a true refcount, since qdisc_destroy() found TCQ_F_BUILTIN set :
if (qdisc->flags & TCQ_F_BUILTIN ||
!refcount_dec_and_test(&qdisc->refcnt)))
return;
Meaning that all atomic_inc() we did on noop_qdisc.refcnt were not
really needed, but harmless until refcount_t came.
To fix this problem, we simply need to not increment noop_qdisc.refcnt,
since we never decrement it.
[1]
refcount_t: increment on 0; use-after-free.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 21754 at lib/refcount.c:152 refcount_inc+0x47/0x50 lib/refcount.c:152
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 21754 Comm: syz-executor7 Not tainted 4.13.0-rc6+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
panic+0x1e4/0x417 kernel/panic.c:180
__warn+0x1c4/0x1d9 kernel/panic.c:541
report_bug+0x211/0x2d0 lib/bug.c:183
fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:refcount_inc+0x47/0x50 lib/refcount.c:152
RSP: 0018:ffff8801c43477a0 EFLAGS: 00010282
RAX: 000000000000002b RBX: ffffffff86093c14 RCX: 0000000000000000
RDX: 000000000000002b RSI: ffffffff8159314e RDI: ffffed0038868ee8
RBP: ffff8801c43477a8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff86093ac0
R13: 0000000000000001 R14: ffff8801d0f3bac0 R15: dffffc0000000000
attach_default_qdiscs net/sched/sch_generic.c:792 [inline]
dev_activate+0x7d3/0xaa0 net/sched/sch_generic.c:833
__dev_open+0x227/0x330 net/core/dev.c:1380
__dev_change_flags+0x695/0x990 net/core/dev.c:6726
dev_change_flags+0x88/0x140 net/core/dev.c:6792
dev_ifsioc+0x5a6/0x930 net/core/dev_ioctl.c:256
dev_ioctl+0x2bc/0xf90 net/core/dev_ioctl.c:554
sock_do_ioctl+0x94/0xb0 net/socket.c:968
sock_ioctl+0x2c2/0x440 net/socket.c:1058
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
Fixes: 7b93640502 ("net, sched: convert Qdisc.refcnt from atomic_t to refcount_t")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Reshetova, Elena <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcf_chain_flush needs to be called with RTNL. However, on
free_tcf->
tcf_action_goto_chain_fini->
tcf_chain_put->
tcf_chain_destroy->
tcf_chain_flush
callpath, it is called without RTNL.
This issue was notified by following warning:
[ 155.599052] WARNING: suspicious RCU usage
[ 155.603165] 4.13.0-rc5jiri+ #54 Not tainted
[ 155.607456] -----------------------------
[ 155.611561] net/sched/cls_api.c:195 suspicious rcu_dereference_protected() usage!
Since on this callpath, the chain is guaranteed to be already empty
by check in tcf_chain_put, move the tcf_chain_flush call out and call it
only where it is needed - into tcf_block_put.
Fixes: db50514f9a ("net: sched: add termination action to allow goto chain")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The goto_chain termination action takes a reference of a chain. In that
case, there is an issue when block_put is called tcf_chain_destroy
directly. The follo-up call of tcf_chain_put by goto_chain action free
works with memory that is already freed. This was caught by kasan:
[ 220.337908] BUG: KASAN: use-after-free in tcf_chain_put+0x1b/0x50
[ 220.344103] Read of size 4 at addr ffff88036d1f2cec by task systemd-journal/261
[ 220.353047] CPU: 0 PID: 261 Comm: systemd-journal Not tainted 4.13.0-rc5jiri+ #54
[ 220.360661] Hardware name: Mellanox Technologies Ltd. Mellanox switch/Mellanox x86 mezzanine board, BIOS 4.6.5 08/02/2016
[ 220.371784] Call Trace:
[ 220.374290] <IRQ>
[ 220.376355] dump_stack+0xd5/0x150
[ 220.391485] print_address_description+0x86/0x410
[ 220.396308] kasan_report+0x181/0x4c0
[ 220.415211] tcf_chain_put+0x1b/0x50
[ 220.418949] free_tcf+0x95/0xc0
So allow tcf_chain_destroy to be called multiple times, free only in
case the reference count drops to 0.
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
High order GFP_KERNEL allocations can stress the host badly.
Use modern kvmalloc_array()/kvfree() instead of custom
allocations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we know in some target's checkentry it may dereference par.entryinfo
to check entry stuff inside. But when sched action calls xt_check_target,
par.entryinfo is set with NULL. It would cause kernel panic when calling
some targets.
It can be reproduce with:
# tc qd add dev eth1 ingress handle ffff:
# tc filter add dev eth1 parent ffff: u32 match u32 0 0 action xt \
-j ECN --ecn-tcp-remove
It could also crash kernel when using target CLUSTERIP or TPROXY.
By now there's no proper value for par.entryinfo in ipt_init_target,
but it can not be set with NULL. This patch is to void all these
panics by setting it with an ipt_entry obj with all members = 0.
Note that this issue has been there since the very beginning.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dereference before check is wrong and leads to an oops when
p_filter_chain is NULL. The check needs to be done on the pointer to
prevent NULL dereference.
Fixes: f93e1cdcf4 ("net/sched: fix filter flushing")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I made a stupid mistake using TC_CLSFLOWER_STATS instead of
TC_SETUP_CLSFLOWER. Funny thing is that both are defined as "2" so it
actually did not cause any harm. Anyway, fixing it now.
Fixes: 2572ac53c4 ("net: sched: make type an argument for ndo_setup_tc")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The main purpose of this tracepoint is to monitor bulk dequeue
in the network qdisc layer, as it cannot be deducted from the
existing qdisc stats.
The txq_state can be used for determining the reason for zero packet
dequeues, see enum netdev_queue_state_t.
Notice all packets doesn't necessary activate this tracepoint. As
qdiscs with flag TCQ_F_CAN_BYPASS, can directly invoke
sch_direct_xmit() when qdisc_qlen is zero.
Remember that perf record supports filters like:
perf record -e qdisc:qdisc_dequeue \
--filter 'ifindex == 4 && (packets > 1 || txq_state > 0)'
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Any move comment abount update_vf() into right place.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
This callback is used for deactivating class in parent qdisc.
This is cheaper to test queue length right here.
Also this allows to catch draining screwed backlog and prevent
second deactivation of already inactive parent class which will
crash kernel for sure. Kernel with print warning at destruction
of child qdisc where no packets but backlog is not zero.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was added in commit e57a784d8c ("pkt_sched: set root qdisc
before change() in attach_default_qdiscs()") to hide duplicates
from "tc qdisc show" for incative deivices.
After 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
it triggered when classful qdisc is added to inactive device because
default qdiscs are added before switching root qdisc.
Anyway after commit ea32746953 ("net: sched: avoid duplicates in
qdisc dump") duplicates are filtered right in dumper.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>