linux_dsm_epyc7002/drivers
Kaike Wan f92e487188 IB/rdmavt: Reset all QPs when the device is shut down
When the hfi1 device is shut down during a system reboot, it is possible
that some QPs might have not not freed by ULPs. More requests could be
post sent and a lingering timer could be triggered to schedule more packet
sends, leading to a crash:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000102
  IP: [ffffffff810a65f2] __queue_work+0x32/0x3c0
  PGD 0
  Oops: 0000 1 SMP
  Modules linked in: nvmet_rdma(OE) nvmet(OE) nvme(OE) dm_round_robin nvme_rdma(OE) nvme_fabrics(OE) nvme_core(OE) pal_raw(POE) pal_pmt(POE) pal_cache(POE) pal_pile(POE) pal(POE) pal_compatible(OE) rpcrdma sunrpc ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support mxm_wmi ipmi_ssif pcspkr ses enclosure joydev scsi_transport_sas i2c_i801 sg mei_me lpc_ich mei ioatdma shpchp ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad dm_multipath hangcheck_timer ip_tables ext4 mbcache jbd2 mlx4_en
  sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core crct10dif_pclmul crct10dif_common hfi1(OE) igb crc32c_intel rdmavt(OE) ahci ib_core libahci libata ptp megaraid_sas pps_core dca i2c_algo_bit i2c_core devlink dm_mirror dm_region_hash dm_log dm_mod
  CPU: 23 PID: 0 Comm: swapper/23 Tainted: P OE ------------ 3.10.0-693.el7.x86_64 #1
  Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0028.121720182203 12/17/2018
  task: ffff8808f4ec4f10 ti: ffff8808f4ed8000 task.ti: ffff8808f4ed8000
  RIP: 0010:[ffffffff810a65f2] [ffffffff810a65f2] __queue_work+0x32/0x3c0
  RSP: 0018:ffff88105df43d48 EFLAGS: 00010046
  RAX: 0000000000000086 RBX: 0000000000000086 RCX: 0000000000000000
  RDX: ffff880f74e758b0 RSI: 0000000000000000 RDI: 000000000000001f
  RBP: ffff88105df43d80 R08: ffff8808f3c583c8 R09: ffff8808f3c58000
  R10: 0000000000000002 R11: ffff88105df43da8 R12: ffff880f74e758b0
  R13: 000000000000001f R14: 0000000000000000 R15: ffff88105a300000
  FS: 0000000000000000(0000) GS:ffff88105df40000(0000) knlGS:0000000000000000
  CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000102 CR3: 00000000019f2000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Stack:
  ffff88105b6dd708 0000001f00000286 0000000000000086 ffff88105a300000
  ffff880f74e75800 0000000000000000 ffff88105a300000 ffff88105df43d98
  ffffffff810a6b85 ffff88105a301e80 ffff88105df43dc8 ffffffffc0224cde
  Call Trace:
  IRQ

  [ffffffff810a6b85] queue_work_on+0x45/0x50
  [ffffffffc0224cde] _hfi1_schedule_send+0x6e/0xc0 [hfi1]
  [ffffffffc0170570] ? get_map_page+0x60/0x60 [rdmavt]
  [ffffffffc0224d62] hfi1_schedule_send+0x32/0x70 [hfi1]
  [ffffffffc0170644] rvt_rc_timeout+0xd4/0x120 [rdmavt]
  [ffffffffc0170570] ? get_map_page+0x60/0x60 [rdmavt]
  [ffffffff81097316] call_timer_fn+0x36/0x110
  [ffffffffc0170570] ? get_map_page+0x60/0x60 [rdmavt]
  [ffffffff8109982d] run_timer_softirq+0x22d/0x310
  [ffffffff81090b3f] __do_softirq+0xef/0x280
  [ffffffff816b6a5c] call_softirq+0x1c/0x30
  [ffffffff8102d3c5] do_softirq+0x65/0xa0
  [ffffffff81090ec5] irq_exit+0x105/0x110
  [ffffffff816b76c2] smp_apic_timer_interrupt+0x42/0x50
  [ffffffff816b5c1d] apic_timer_interrupt+0x6d/0x80
  EOI

  [ffffffff81527a02] ? cpuidle_enter_state+0x52/0xc0
  [ffffffff81527b48] cpuidle_idle_call+0xd8/0x210
  [ffffffff81034fee] arch_cpu_idle+0xe/0x30
  [ffffffff810e7bca] cpu_startup_entry+0x14a/0x1c0
  [ffffffff81051af6] start_secondary+0x1b6/0x230
  Code: 89 e5 41 57 41 56 49 89 f6 41 55 41 89 fd 41 54 49 89 d4 53 48 83 ec 10 89 7d d4 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 be 02 00 00 41 f6 86 02 01 00 00 01 0f 85 58 02 00 00 49 c7 c7 28 19 01 00
  RIP [ffffffff810a65f2] __queue_work+0x32/0x3c0
  RSP ffff88105df43d48
  CR2: 0000000000000102

The solution is to reset the QPs before the device resources are freed.
This reset will change the QP state to prevent post sends and delete
timers to prevent callbacks.

Fixes: 0acb0cc7ec ("IB/rdmavt: Initialize and teardown of qpn table")
Link: https://lore.kernel.org/r/20200210131040.87408.38161.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-02-11 11:41:32 -04:00
..
accessibility
acpi Additional ACPI updates for 5.6-rc1 2020-02-07 12:51:54 -08:00
amba
android for-5.6/io_uring-vfs-2020-01-29 2020-01-29 18:53:37 -08:00
ata libata-5.6-2020-02-05 2020-02-06 06:11:50 +00:00
atm Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-01-28 16:02:33 -08:00
auxdisplay
base ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
bcma Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-01-28 16:02:33 -08:00
block Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-02-08 13:26:41 -08:00
bluetooth Bluetooth: btrtl: Use kvmalloc for FW allocations 2020-01-24 19:57:53 +01:00
bus ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
cdrom
char treewide: remove redundant IS_ERR() before error code check 2020-02-04 03:05:27 +00:00
clk ARM: SoC: late updates 2020-02-08 14:17:27 -08:00
clocksource ARM: SoC: late updates 2020-02-08 14:17:27 -08:00
connector
counter
cpufreq ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
cpuidle ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
crypto Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-01-28 16:02:33 -08:00
dax
dca
devfreq
dio
dma ARM: Device-tree updates 2020-02-08 13:58:44 -08:00
dma-buf
edac ioremap changes for 5.6 2020-01-27 13:03:00 -08:00
eisa
extcon
firewire
firmware ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
fpga
fsi
gnss
gpio treewide: remove redundant IS_ERR() before error code check 2020-02-04 03:05:27 +00:00
gpu Kbuild updates for v5.6 (2nd) 2020-02-09 16:05:50 -08:00
greybus
hid drm pull for 5.6-rc1 2020-01-30 08:04:01 -08:00
hsi
hv - Most of the commits here are work to enable host-initiated hibernation 2020-02-03 14:42:03 +00:00
hwmon ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
hwspinlock hwspinlock: sirf: Use devm_hwspin_lock_register() to register hwlock controller 2020-01-21 16:16:36 -08:00
hwtracing
i2c Merge branch 'i2c/for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2020-02-07 12:54:13 -08:00
i3c
ide proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
idle intel_idle: Introduce 'states_off' module parameter 2020-02-03 11:57:18 +01:00
iio chrome platform changes for 5.6 2020-02-04 07:17:41 +00:00
infiniband IB/rdmavt: Reset all QPs when the device is shut down 2020-02-11 11:41:32 -04:00
input Merge branch 'akpm' (patches from Andrew) 2020-02-04 07:24:48 +00:00
interconnect
iommu IOMMU Updates for Linux v5.6 2020-02-05 17:49:54 +00:00
ipack
irqchip irqchip/gic-v4.1: Avoid 64bit division for the sake of 32bit ARM 2020-02-09 15:47:37 -08:00
isdn proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
leds leds: lm3532: add pointer to documentation and fix typo 2020-01-22 21:08:24 +01:00
lightnvm
macintosh powerpc updates for 5.6 2020-02-04 13:06:46 +00:00
mailbox
mcb
md Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-02-08 13:04:49 -08:00
media chrome platform changes for 5.6 2020-02-04 07:17:41 +00:00
memory
memstick
message Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
mfd chrome platform changes for 5.6 2020-02-04 07:17:41 +00:00
misc Merge branch 'i2c/for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2020-02-07 12:54:13 -08:00
mmc ioremap changes for 5.6 2020-01-27 13:03:00 -08:00
mtd treewide: remove redundant IS_ERR() before error code check 2020-02-04 03:05:27 +00:00
mux
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-02-08 17:15:08 -08:00
nfc Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
ntb
nubus
nvdimm mm: Cleanup __put_devmap_managed_page() vs ->page_free() 2020-01-31 10:30:37 -08:00
nvme block-5.6-2020-02-05 2020-02-06 06:15:23 +00:00
nvmem Merge branch 'i2c/for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2020-02-07 12:54:13 -08:00
of ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
opp ioremap changes for 5.6 2020-01-27 13:03:00 -08:00
oprofile
parisc proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
parport
pci pci-v5.6-fixes-1 2020-02-06 14:17:38 +00:00
pcmcia
perf
phy treewide: remove redundant IS_ERR() before error code check 2020-02-04 03:05:27 +00:00
pinctrl pinctrl: fix pxa2xx.c build warnings 2020-02-04 03:05:24 +00:00
platform Merge branch 'akpm' (patches from Andrew) 2020-02-04 07:24:48 +00:00
pnp proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
power ARM: SoC platform updates 2020-02-08 13:55:25 -08:00
powercap
pps
ps3
ptp Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
pwm pwm: Remove set but not set variable 'pwm' 2020-01-20 15:40:49 +01:00
rapidio
ras
regulator - New Drivers 2020-02-03 14:51:57 +00:00
remoteproc remoteproc: qcom: q6v5-mss: Improve readability of reset_assert 2020-01-24 09:34:07 -08:00
reset
rpmsg rpmsg: add rpmsg support for mt8183 SCP. 2020-01-20 10:29:56 -08:00
rtc chrome platform changes for 5.6 2020-02-04 07:17:41 +00:00
s390 s390 updates for the 5.6 merge window #2 2020-02-05 17:33:35 +00:00
sbus
scsi SCSI misc on 20200208 2020-02-08 17:24:41 -08:00
sfi
sh
siox
slimbus
soc ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
soundwire
spi treewide: remove redundant IS_ERR() before error code check 2020-02-04 03:05:27 +00:00
spmi
ssb
staging proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
target SCSI misc on 20200129 2020-01-29 18:16:16 -08:00
tc The main MIPS changes for 5.6: 2020-01-31 11:28:31 -08:00
tee ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
thermal - Fix a SEVERE docs build failure for cpu idle cooling device (Randy Dunlap) 2020-01-31 14:39:21 -08:00
thunderbolt
tty Kbuild updates for v5.6 (2nd) 2020-02-09 16:05:50 -08:00
uio
usb Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-02-08 13:26:41 -08:00
vfio VFIO updates for v5.6-rc1 2020-02-03 22:22:05 +00:00
vhost
video Kbuild updates for v5.6 (2nd) 2020-02-09 16:05:50 -08:00
virt
virtio virtio_balloon: Fix memory leaks on errors in virtballoon_probe() 2020-02-06 03:40:27 -05:00
visorbus
vlynq
vme Char/Misc driver changes for 5.6-rc1 2020-01-29 10:35:54 -08:00
w1 Char/Misc driver changes for 5.6-rc1 2020-01-29 10:35:54 -08:00
watchdog linux-watchdog 5.6-rc1 tag 2020-02-07 12:30:16 -08:00
xen xen: branch for v5.6-rc1 2020-02-05 17:44:14 +00:00
zorro Kbuild updates for v5.6 (2nd) 2020-02-09 16:05:50 -08:00
Kconfig
Makefile