linux_dsm_epyc7002/net
Daniel Borkmann d936377414 net, sched: respect rcu grace period on cls destruction
Roi reported a crash in flower where tp->root was NULL in ->classify()
callbacks. Reason is that in ->destroy() tp->root is set to NULL via
RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
this doesn't respect RCU grace period for them, and as a result, still
outstanding readers from tc_classify() will try to blindly dereference
a NULL tp->root.

The tp->root object is strictly private to the classifier implementation
and holds internal data the core such as tc_ctl_tfilter() doesn't know
about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
is only checked for NULL in ->get() callback, but nowhere else. This is
misleading and seemed to be copied from old classifier code that was not
cleaned up properly. For example, d3fa76ee6b ("[NET_SCHED]: cls_basic:
fix NULL pointer dereference") moved tp->root initialization into ->init()
routine, where before it was part of ->change(), so ->get() had to deal
with tp->root being NULL back then, so that was indeed a valid case, after
d3fa76ee6b, not really anymore. We used to set tp->root to NULL long
ago in ->destroy(), see 47a1a1d4be ("pkt_sched: remove unnecessary xchg()
in packet classifiers"); but the NULLifying was reintroduced with the
RCUification, but it's not correct for every classifier implementation.

In the cases that are fixed here with one exception of cls_cgroup, tp->root
object is allocated and initialized inside ->init() callback, which is always
performed at a point in time after we allocate a new tp, which means tp and
thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
handler, same for the tp which is kfree_rcu()'ed right when we return
from ->destroy() in tcf_destroy(). This means, the head object's lifetime
for such classifiers is always tied to the tp lifetime. The RCU callback
invocation for the two kfree_rcu() could be out of order, but that's fine
since both are independent.

Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
means that 1) we don't need a useless NULL check in fast-path and, 2) that
outstanding readers of that tp in tc_classify() can still execute under
respect with RCU grace period as it is actually expected.

Things that haven't been touched here: cls_fw and cls_route. They each
handle tp->root being NULL in ->classify() path for historic reasons, so
their ->destroy() implementation can stay as is. If someone actually
cares, they could get cleaned up at some point to avoid the test in fast
path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
!head should anyone actually be using/testing it, so it at least aligns with
cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
destruction (to a sleepable context) after RCU grace period as concurrent
readers might still access it. (Note that in this case we need to hold module
reference to keep work callback address intact, since we only wait on module
unload for all call_rcu()s to finish.)

This fixes one race to bring RCU grace period guarantees back. Next step
as worked on by Cong however is to fix 1e052be69d ("net_sched: destroy
proto tp when all filters are gone") to get the order of unlinking the tp
in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
RCU_INIT_POINTER() before tcf_destroy() and let the notification for
removal be done through the prior ->delete() callback. Both are independant
issues. Once we have that right, we can then clean tp->root up for a number
of classifiers by not making them RCU pointers, which requires a new callback
(->uninit) that is triggered from tp's RCU callback, where we just kfree()
tp->root from there.

Fixes: 1f947bf151 ("net: sched: rcu'ify cls_bpf")
Fixes: 9888faefe1 ("net: sched: cls_basic use RCU")
Fixes: 70da9f0bf9 ("net: sched: cls_flow use RCU")
Fixes: 77b9900ef5 ("tc: introduce Flower classifier")
Fixes: bf3994d2ed ("net/sched: introduce Match-all classifier")
Fixes: 952313bd62 ("net: sched: cls_cgroup use RCU")
Reported-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Roi Dayan <roid@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 10:47:35 -05:00
..
6lowpan
9p
802
8021q net: add recursion limit to GRO 2016-10-20 14:32:22 -04:00
appletalk
atm
ax25
batman-adv batman-adv: Detect missing primaryif during tp_send as error 2016-11-04 12:27:39 +01:00
bluetooth Bluetooth: Fix using the correct source address type 2016-11-22 22:50:46 +01:00
bridge
caif
can can: bcm: fix support for CAN FD frames 2016-11-23 15:22:18 +01:00
ceph libceph: initialize last_linger_id with a large integer 2016-11-10 20:13:08 +01:00
core Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2016-11-27 20:21:48 -05:00
dcb
dccp ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped 2016-11-03 16:50:27 -04:00
decnet
dns_resolver
dsa net: dsa: fix fixed-link-phy device leaks 2016-11-27 20:01:15 -05:00
ethernet net: add recursion limit to GRO 2016-10-20 14:32:22 -04:00
hsr
ieee802154
ipv4 udplite: call proper backlog handlers 2016-11-24 15:32:14 -05:00
ipv6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2016-11-27 20:21:48 -05:00
ipx
irda
iucv
kcm
key
l2tp net: revert "net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit" 2016-11-23 20:18:36 -05:00
l3mdev
lapb
llc
mac80211 mac80211: fix A-MSDU aggregation with fast-xmit + txq 2016-11-15 14:37:30 +01:00
mac802154
mpls
ncsi net/ncsi: Improve HNCDSC AEN handler 2016-10-20 11:23:08 -04:00
netfilter netfilter: nf_tables: fix oops when inserting an element into a verdict map 2016-11-08 23:53:39 +01:00
netlabel
netlink genetlink: fix a memory leak on error path 2016-11-03 16:52:29 -04:00
netrom
nfc
openvswitch
packet packet: on direct_xmit, limit tso and csum to supported devices 2016-10-29 15:02:15 -04:00
phonet
qrtr
rds rds: debug messages are enabled by default 2016-10-29 15:55:57 -04:00
rfkill
rose
rxrpc
sched net, sched: respect rcu grace period on cls destruction 2016-11-28 10:47:35 -05:00
sctp sctp: change sk state only when it has assocs in sctp_shutdown 2016-11-14 16:22:33 -05:00
strparser
sunrpc One fix for an NFS/RDMA crash. 2016-11-18 16:32:21 -08:00
switchdev
tipc tipc: fix link statistics counter errors 2016-11-27 20:35:55 -05:00
unix af_unix: conditionally use freezable blocking calls in read 2016-11-18 13:58:39 -05:00
vmw_vsock
wimax
wireless cfg80211: limit scan results cache size 2016-11-18 08:44:44 +01:00
x25
xfrm xfrm: unbreak xfrm_sk_policy_lookup 2016-11-18 07:00:05 +01:00
compat.c
Kconfig
Makefile
socket.c xattr: Fix setting security xattrs on sockfs 2016-11-17 00:00:23 -05:00
sysctl_net.c