linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2025-01-18 10:56:12 +07:00

History

Ido Schimmel d965465b60 mlxsw: core: Fix possible deadlock When an EMAD is transmitted, a timeout work item is scheduled with a delay of 200ms, so that another EMAD will be retried until a maximum of five retries. In certain situations, it's possible for the function waiting on the EMAD to be associated with a work item that is queued on the same workqueue (`mlxsw_core`) as the timeout work item. This results in flushing a work item on the same workqueue. According to commit `e159489baa` ("workqueue: relax lockdep annotation on flush_work()") the above may lead to a deadlock in case the workqueue has only one worker active or if the system in under memory pressure and the rescue worker is in use. The latter explains the very rare and random nature of the lockdep splats we have been seeing: [ 52.730240] ============================================ [ 52.736179] WARNING: possible recursive locking detected [ 52.742119] 4.14.0-rc3jiri+ #4 Not tainted [ 52.746697] -------------------------------------------- [ 52.752635] kworker/1:3/599 is trying to acquire lock: [ 52.758378] (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c4fa4>] flush_work+0x3a4/0x5e0 [ 52.767837] but task is already holding lock: [ 52.774360] (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0 [ 52.784495] other info that might help us debug this: [ 52.791794] Possible unsafe locking scenario: [ 52.798413] CPU0 [ 52.801144] ---- [ 52.803875] lock(mlxsw_core_driver_name); [ 52.808556] lock(mlxsw_core_driver_name); [ 52.813236] * DEADLOCK * [ 52.819857] May be due to missing lock nesting notation [ 52.827450] 3 locks held by kworker/1:3/599: [ 52.832221] #0: (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0 [ 52.842846] #1: ((&(&bridge->fdb_notify.dw)->work)){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0 [ 52.854537] #2: (rtnl_mutex){+.+.}, at: [<ffffffff822ad8e7>] rtnl_lock+0x17/0x20 [ 52.863021] stack backtrace: [ 52.867890] CPU: 1 PID: 599 Comm: kworker/1:3 Not tainted 4.14.0-rc3jiri+ #4 [ 52.875773] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016 [ 52.886267] Workqueue: mlxsw_core mlxsw_sp_fdb_notify_work [mlxsw_spectrum] [ 52.894060] Call Trace: [ 52.909122] __lock_acquire+0xf6f/0x2a10 [ 53.025412] lock_acquire+0x158/0x440 [ 53.047557] flush_work+0x3c4/0x5e0 [ 53.087571] __cancel_work_timer+0x3ca/0x5e0 [ 53.177051] cancel_delayed_work_sync+0x13/0x20 [ 53.182142] mlxsw_reg_trans_bulk_wait+0x12d/0x7a0 [mlxsw_core] [ 53.194571] mlxsw_core_reg_access+0x586/0x990 [mlxsw_core] [ 53.225365] mlxsw_reg_query+0x10/0x20 [mlxsw_core] [ 53.230882] mlxsw_sp_fdb_notify_work+0x2a3/0x9d0 [mlxsw_spectrum] [ 53.237801] process_one_work+0x8f1/0x12f0 [ 53.321804] worker_thread+0x1fd/0x10c0 [ 53.435158] kthread+0x28e/0x370 [ 53.448703] ret_from_fork+0x2a/0x40 [ 53.453017] mlxsw_spectrum 0000:01:00.0: EMAD retries (2/5) (tid=bf4549b100000774) [ 53.453119] mlxsw_spectrum 0000:01:00.0: EMAD retries (5/5) (tid=bf4549b100000770) [ 53.453132] mlxsw_spectrum 0000:01:00.0: EMAD reg access failed (tid=bf4549b100000770,reg_id=200b(sfn),type=query,status=0(operation performed)) [ 53.453143] mlxsw_spectrum 0000:01:00.0: Failed to get FDB notifications Fix this by creating another workqueue for EMAD timeouts, thereby preventing the situation of a work item trying to flush a work item queued on the same workqueue. Fixes: `caf7297e7a` ("mlxsw: core: Introduce support for asynchronous EMAD register access") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>		2017-10-18 12:19:15 +01:00
..
appletalk
arcnet
bonding	net: bonding: fix tlb_dynamic_lb default value	2017-09-12 20:58:12 -07:00
caif
can
cris
dsa	net: dsa: mv88e6060: fix switch MAC address	2017-10-14 18:40:03 -07:00
ethernet	mlxsw: core: Fix possible deadlock	2017-10-18 12:19:15 +01:00
fddi	net: defxx: constify eisa_device_id	2017-08-19 17:13:41 -07:00
fjes
hamradio
hippi
hyperv	hv_netvsc: fix send buffer failure on MTU change	2017-09-21 15:17:16 -07:00
ieee802154	ieee802154: ca8210: Fix a potential NULL pointer dereference	2017-08-20 20:51:30 +02:00
ipvlan	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next	2017-09-03 17:08:42 -07:00
phy	net: phy: Fix truncation of large IRQ numbers in phy_attached_print()	2017-09-21 20:35:17 -07:00
plip
ppp	ppp: fix race in ppp device destruction	2017-10-06 10:16:34 -07:00
slip
team
usb	cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan	2017-10-09 16:03:32 -07:00
vmxnet3
wan	- For the randstruct plugin, enable automatic randomization of structures	2017-09-07 20:30:19 -07:00
wimax	wimax/i2400m: Remove VLAIS	2017-10-10 12:35:05 -07:00
wireless	iwlwifi: nvm: set the correct offsets to 3168 series	2017-10-06 13:59:44 +03:00
xen-netback	xen-netback: update ubuf_info initialization to anonymous union	2017-08-28 15:11:50 -07:00
dummy.c
eql.c
geneve.c
gtp.c
ifb.c
Kconfig	x86/lguest: Remove lguest support	2017-08-24 09:57:28 +02:00
LICENSE.SRC
loopback.c
macsec.c	macsec: fix memory leaks when skb_to_sgvec fails	2017-10-11 14:07:20 -07:00
macvlan.c
macvtap.c
Makefile	irda: move drivers/net/irda to drivers/staging/irda/drivers	2017-08-28 16:42:57 -07:00
mdio.c
mii.c
netconsole.c
nlmon.c
ntb_netdev.c
rionet.c
sb1000.c
Space.c
sungem_phy.c
tap.c
tun.c	tun: call dev_get_valid_name() before register_netdevice()	2017-10-16 21:02:54 +01:00
veth.c
virtio_net.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-09-01 17:42:05 -07:00
vrf.c	net: vrf: avoid gcc-4.6 warning	2017-09-15 14:22:21 -07:00
vsockmon.c
vxlan.c	vxlan: factor out VXLAN-GPE next protocol	2017-08-29 15:16:52 -07:00
xen-netfront.c	xen-netfront: be more drop monitor friendly	2017-08-30 15:56:16 -07:00