linux_dsm_epyc7002/net/core
Eric Dumazet 39e6c8208d net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1                                 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is
available.
                                          device hard irq

                                          // does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();

napi_complete_done(napi, 2);
rearm_irq();

Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

In v4, I added the ideas given by Alexander Duyck in v3 review

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 09:50:58 -08:00
..
datagram.c udp: properly cope with csum errors 2017-02-07 11:19:00 -05:00
dev_addr_lists.c net: fix spelling for synchronized 2014-11-18 15:26:32 -05:00
dev_ioctl.c dev_ioctl: use sizeof(x) instead of sizeof x 2014-11-18 15:27:32 -05:00
dev.c net: solve a NAPI race 2017-03-01 09:50:58 -08:00
devlink.c devlink: allow to fillup eswitch attrs even if mode_get op does not exist 2017-02-10 14:43:00 -05:00
drop_monitor.c drop_monitor: consider inserted data in genlmsg_end 2017-01-03 11:09:44 -05:00
dst_cache.c net: dst_cache_per_cpu_dst_set() can be static 2016-03-18 17:45:08 -04:00
dst.c net: pending_confirm is not used anymore 2017-02-07 13:07:47 -05:00
ethtool.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-07 16:29:30 -05:00
fib_rules.c net: core: add missing check for uid_range in rule_exists. 2016-11-09 13:28:10 -05:00
filter.c bpf: Fix bpf_xdp_event_output 2017-02-23 13:53:42 -05:00
flow_dissector.c flow dissector: check if arp_eth is null rather than arp 2017-01-16 13:48:48 -05:00
flow.c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-12-12 19:25:04 -08:00
gen_estimator.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
gen_stats.c net_sched: gen_estimator: complete rewrite of rate estimators 2016-12-05 15:21:59 -05:00
gro_cells.c gro_cells: move to net/core/gro_cells.c 2017-02-08 14:38:18 -05:00
hwbm.c net: hwbm: Fix unbalanced spinlock in error case 2016-05-25 12:35:09 -07:00
link_watch.c dev: introduce dev_get_iflink() 2015-04-02 14:04:59 -04:00
lwt_bpf.c lwtunnel: remove device arg to lwtunnel_build_state 2017-01-30 15:14:22 -05:00
lwtunnel.c lwtunnel: remove device arg to lwtunnel_build_state 2017-01-30 15:14:22 -05:00
Makefile gro_cells: move to net/core/gro_cells.c 2017-02-08 14:38:18 -05:00
neighbour.c net: neigh: Fix netevent NETEVENT_DELAY_PROBE_TIME_UPDATE notification 2017-02-15 12:38:43 -05:00
net_namespace.c Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit 2016-12-14 14:06:40 -08:00
net-procfs.c net: remove NETDEV_TX_LOCKED support 2016-04-26 15:53:05 -04:00
net-sysfs.c net: Add support for XPS with QoS via traffic classes 2016-10-31 15:00:48 -04:00
net-sysfs.h net: netdev_kobject_init: annotate with __init 2014-01-05 20:27:54 -05:00
net-traces.c net: IPv6 fib lookup tracepoint 2015-11-22 11:54:10 -05:00
netclassid_cgroup.c core: remove unneded headers for net cgroup controllers. 2016-02-17 15:31:27 -05:00
netevent.c netevent: remove automatic variable in register_netevent_notifier() 2015-05-31 00:03:21 -07:00
netpoll.c netpoll: more efficient locking 2016-11-16 18:32:02 -05:00
netprio_cgroup.c net: cgroups: fix build errors when linux/phy*.h is removed from net/dsa.h 2017-02-10 13:51:01 -05:00
pktgen.c net-tc: convert tc_verd to integer bitfields 2017-01-08 20:58:52 -05:00
ptp_classifier.c ptp: Change ptp_class to a proper bitmask 2015-11-03 11:08:22 -05:00
request_sock.c ipv4: Namespaceify tcp_max_syn_backlog knob 2016-12-29 11:38:31 -05:00
rtnetlink.c rtnl: simplify error return path in rtnl_create_link() 2017-02-21 12:17:43 -05:00
scm.c scm: remove use CMSG{_COMPAT}_ALIGN(sizeof(struct {compat_}cmsghdr)) 2017-01-04 13:04:37 -05:00
secure_seq.c secure_seq: fix sparse errors 2017-01-12 15:57:10 -05:00
skbuff.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2017-02-03 16:58:20 -05:00
sock_diag.c sock_diag: align nlattr properly when needed 2016-04-26 12:00:48 -04:00
sock_reuseport.c soreuseport: do not export reuseport_add_sock() 2016-10-18 14:18:23 -04:00
sock.c net: sock: Use USEC_PER_SEC macro instead of literal 1000000 2017-02-21 12:25:21 -05:00
stream.c net: fix sleeping for sk_wait_event() 2016-11-14 13:17:21 -05:00
sysctl_net_core.c bpf: make jited programs visible in traces 2017-02-17 13:40:05 -05:00
timestamping.c net: skb_defer_rx_timestamp should check for phydev before setting up classify 2015-07-09 14:17:15 -07:00
tso.c net: tso: add support for IPv6 2015-10-26 22:24:22 -07:00
utils.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00