linux_dsm_epyc7002/net
Chieh-Min Wang 13f5251fd1 netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
For bridge(br_flood) or broadcast/multicast packets, they could clone
skb with unconfirmed conntrack which break the rule that unconfirmed
skb->_nfct is never shared.  With nfqueue running on my system, the race
can be easily reproduced with following warning calltrace:

[13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P        W       4.4.60 #7744
[13257.707568] Hardware name: Qualcomm (Flattened Device Tree)
[13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14)
[13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8)
[13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0)
[13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24)
[13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618)
[13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc)
[13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c)
[13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4)
[13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac)
[13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170)
[13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420)
[13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248)
[13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0)
[13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c)
[13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368)
[13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44)
[13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200)
[13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64)
[13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34)

The original code just triggered the warning but do nothing. It will
caused the shared conntrack moves to the dying list and the packet be
droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack).

- Reproduce steps:

+----------------------------+
|          br0(bridge)       |
|                            |
+-+---------+---------+------+
  | eth0|   | eth1|   | eth2|
  |     |   |     |   |     |
  +--+--+   +--+--+   +---+-+
     |         |          |
     |         |          |
  +--+-+     +-+--+    +--+-+
  | PC1|     | PC2|    | PC3|
  +----+     +----+    +----+

iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass

ps: Our nfq userspace program will set mark on packets whose connection
has already been processed.

PC1 sends broadcast packets simulated by hping3:

hping3 --rand-source --udp 192.168.1.255 -i u100

- Broadcast racing flow chart is as follow:

br_handle_frame
  BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish)
  // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage
  br_handle_frame_finish
    // check if this packet is broadcast
    br_flood_forward
      br_flood
        list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port
          maybe_deliver
            deliver_clone
              skb = skb_clone(skb)
              __br_forward
                BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...)
                // queue in our nfq and received by our userspace program
                // goto __nf_conntrack_confirm with process context on CPU 1
    br_pass_frame_up
      BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...)
      // goto __nf_conntrack_confirm with softirq context on CPU 0

Because conntrack confirm can happen at both INPUT and POSTROUTING
stage.  So with NFQUEUE running, skb->_nfct with the same unconfirmed
conntrack could race on different core.

This patch fixes a repeating kernel splat, now it is only displayed
once.

Signed-off-by: Chieh-Min Wang <chiehminw@synology.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 11:14:51 +01:00
..
6lowpan 6lowpan: convert to DEFINE_SHOW_ATTRIBUTE 2018-12-19 00:28:05 +01:00
9p 9p/net: put a lower bound on msize 2018-12-25 17:07:49 +09:00
802
8021q net: core: dev: Add extack argument to dev_change_flags() 2018-12-06 13:26:07 -08:00
appletalk
atm Revert "net: simplify sock_poll_wait" 2018-10-23 10:57:06 -07:00
ax25 ax25: fix possible use-after-free 2019-01-23 11:18:00 -08:00
batman-adv bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls 2019-01-22 17:18:08 -08:00
bluetooth Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2018-12-27 13:53:32 -08:00
bpf bpf: add BPF_PROG_TEST_RUN support for flow dissector 2019-01-29 01:08:29 +01:00
bpfilter net: bpfilter: change section name of bpfilter UMH blob. 2019-01-16 15:46:46 -08:00
bridge Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2019-01-28 17:34:38 -08:00
caif Revert "net: simplify sock_poll_wait" 2018-10-23 10:57:06 -07:00
can can: bcm: check timer values before ktime conversion 2019-01-22 11:33:46 +01:00
ceph libceph: avoid KEEPALIVE_PENDING races in ceph_con_keepalive() 2019-01-21 14:53:12 +01:00
core Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2019-01-28 19:38:33 -08:00
dcb
dccp tcp: Refactor pingpong code 2019-01-27 13:29:43 -08:00
decnet net, decnet: use struct_size() in kzalloc() 2019-01-16 13:22:10 -08:00
dns_resolver dns: Allow the dns resolver to retrieve a server set 2018-10-04 09:40:52 -07:00
dsa switchdev: Add extack argument to call_switchdev_notifiers() 2019-01-17 15:18:47 -08:00
ethernet net: ethernet: provide nvmem_get_mac_address() 2018-12-03 15:40:30 -08:00
hsr
ieee802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-12-24 16:19:56 -08:00
ife
ipv4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2019-01-28 17:34:38 -08:00
ipv6 netfilter: ipv6: avoid indirect calls for IPV6=y case 2019-02-04 18:21:12 +01:00
iucv iucv: Remove SKB list assumptions. 2018-11-10 16:55:11 -08:00
kcm Revert "kcm: remove any offset before parsing messages" 2018-09-17 18:43:42 -07:00
key af_key: fix indentation on declaration statement 2018-11-15 18:09:32 +01:00
l2tp ppp: Move PFC decompression to PPP generic layer 2018-12-20 16:49:30 -08:00
l3mdev l3mdev: add function to retreive upper master 2018-12-03 14:15:26 -08:00
lapb
llc llc: do not use sk_eat_skb() 2018-10-22 19:59:20 -07:00
mac80211 mac80211: Add attribute aligned(2) to struct 'action' 2019-01-25 10:17:25 +01:00
mac802154 mac802154: Remove VLA usage of skcipher 2018-09-28 12:46:07 +08:00
mpls net: mpls: netconf: perform strict checks also for doit handlers 2019-01-19 10:09:59 -08:00
ncsi net/ncsi: Add NCSI Mellanox OEM command 2018-11-27 16:37:20 -08:00
netfilter netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm 2019-02-12 11:14:51 +01:00
netlabel netlabel: check for IPV4MASK in addrinfo_get 2018-09-21 18:58:34 -07:00
netlink net: netlink: add helper to retrieve NETLINK_F_STRICT_CHK 2019-01-19 10:09:58 -08:00
netrom netrom: fix locking in nr_find_socket() 2018-12-30 20:24:16 -08:00
nfc net: Revert recent Spectre-v1 patches. 2018-12-23 16:01:35 -08:00
nsh
openvswitch Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2019-01-28 17:34:38 -08:00
packet af_packet: fix raw sockets over 6in4 tunnel 2019-01-17 15:54:45 -08:00
phonet net: Revert recent Spectre-v1 patches. 2018-12-23 16:01:35 -08:00
psample
qrtr
rds rds: use DIV_ROUND_UP instead of ceil 2019-01-07 07:22:36 -08:00
rfkill rfkill: gpio: Remove unused include 2018-12-18 13:13:56 +01:00
rose
rxrpc Revert "rxrpc: Allow failed client calls to be retried" 2019-01-15 21:33:36 -08:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-01-21 14:41:32 -08:00
sctp sctp: set flow sport from saddr only when it's 0 2019-01-24 18:13:57 -08:00
smc smc: move unhash as early as possible in smc_release() 2019-01-07 14:40:27 -05:00
strparser bpf, sockmap: convert to generic sk_msg interface 2018-10-15 12:23:19 -07:00
sunrpc SUNRPC: Address Kerberos performance/behavior regression 2019-01-15 15:36:41 -05:00
switchdev switchdev: Add extack argument to call_switchdev_notifiers() 2019-01-17 15:18:47 -08:00
tipc tipc: remove dead code in struct tipc_topsrv 2019-01-24 22:31:23 -08:00
tls net/tls: free ctx in sock destruct 2019-01-22 11:30:54 -08:00
unix Revert "net: simplify sock_poll_wait" 2018-10-23 10:57:06 -07:00
vmw_vsock Fix ERROR:do not initialise statics to 0 in af_vsock.c 2019-01-15 20:38:29 -08:00
wimax
wireless cfg80211: extend range deviation for DMG 2019-01-25 10:18:51 +01:00
x25 net/x25: handle call collisions 2018-11-29 14:25:36 -08:00
xdp Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2019-01-28 19:38:33 -08:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-12-20 11:53:36 -08:00
compat.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
Kconfig net: convert bridge_nf to use skb extension infrastructure 2018-12-19 11:21:37 -08:00
Makefile
socket.c y2038: more syscalls and cleanups 2018-12-28 12:45:04 -08:00
sysctl_net.c