linux_dsm_epyc7002/drivers/gpu/drm/amd/amdkfd
Philip Yang 2c99a547bc drm/amdkfd: don't use dqm lock during device reset/suspend/resume
If device reset/suspend/resume failed for some reason, dqm lock is
hold forever and this causes deadlock. Below is a kernel backtrace when
application open kfd after suspend/resume failed.

Instead of holding dqm lock in pre_reset and releasing dqm lock in
post_reset, add dqm->sched_running flag which is modified in
dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
because write/read are all inside dqm lock.

For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
sched_running flag before sending the updated runlist.

v2: For no-HWS case, when device is stopped, don't call
load/destroy_mqd for eviction, restore and create queue, and avoid
debugfs dump hdqs.

Backtrace of dqm lock deadlock:

[Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
than 120 seconds.
[Thu Oct 17 16:43:37 2019]       Not tainted
5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
[Thu Oct 17 16:43:37 2019] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
0x80000000
[Thu Oct 17 16:43:37 2019] Call Trace:
[Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
[Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
[Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
[Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
[Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
[Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]
kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
[Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
[amdgpu]
[Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
[Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
[Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
[Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
[Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
[Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
[Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
[Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
[Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
[Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
[Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-25 16:50:10 -04:00
..
cik_event_interrupt.c drm/amdkfd: Eliminate get_atc_vmid_pasid_mapping_valid 2019-10-03 09:11:04 -05:00
cik_int.h drm/amdkfd: Clean up reference of radeon 2018-07-11 22:33:08 -04:00
cik_regs.h drm/amdkfd: Delete a duplicate statement in set_pasid_vmid_mapping() 2018-11-05 14:21:13 -05:00
cwsr_trap_handler_gfx8.asm drm/amdkfd: Remove dead code from gfx8/gfx9 trap handlers 2019-07-30 23:22:18 -05:00
cwsr_trap_handler_gfx9.asm drm/amdkfd: Remove dead code from gfx8/gfx9 trap handlers 2019-07-30 23:22:18 -05:00
cwsr_trap_handler_gfx10.asm drm/amdkfd: Fix race in gfx10 context restore handler 2019-10-03 09:11:04 -05:00
cwsr_trap_handler.h drm/amdkfd: Fix race in gfx10 context restore handler 2019-10-03 09:11:04 -05:00
Kconfig treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
kfd_chardev.c drm/amdkfd: Improve KFD IOCTL printing 2019-10-03 09:11:05 -05:00
kfd_crat.c drm/amdkfd: use navi12 specific family id for navi12 code path 2019-10-03 09:11:03 -05:00
kfd_crat.h drm/amdkfd: Adjust weight to represent num_hops info when report xgmi iolink 2019-05-24 12:20:48 -05:00
kfd_dbgdev.c drm/amdkfd: Eliminate get_atc_vmid_pasid_mapping_valid 2019-10-03 09:11:04 -05:00
kfd_dbgdev.h drm/amdkfd: Clean up reference of radeon 2018-07-11 22:33:08 -04:00
kfd_dbgmgr.c drm/amdkfd: Use hex print format for pasid 2019-10-03 09:11:03 -05:00
kfd_dbgmgr.h drm/amdkfd: Clean up KFD style errors and warnings v2 2017-08-15 23:00:04 -04:00
kfd_debugfs.c amdkfd: no need to check return value of debugfs_create functions 2019-06-13 13:59:49 -05:00
kfd_device_queue_manager_cik.c drm/amdkfd: Introduce asic-specific mqd_manager_init function 2019-05-24 12:21:02 -05:00
kfd_device_queue_manager_v9.c drm/amdkfd: Consistently apply noretry setting 2019-07-16 13:02:55 -05:00
kfd_device_queue_manager_v10.c drm/amdkfd: Add navi10 support to amdkfd. (v3) 2019-06-21 18:59:24 -05:00
kfd_device_queue_manager_vi.c drm/amdkfd: Introduce asic-specific mqd_manager_init function 2019-05-24 12:21:02 -05:00
kfd_device_queue_manager.c drm/amdkfd: don't use dqm lock during device reset/suspend/resume 2019-10-25 16:50:10 -04:00
kfd_device_queue_manager.h drm/amdkfd: don't use dqm lock during device reset/suspend/resume 2019-10-25 16:50:10 -04:00
kfd_device.c drm/amdkfd: don't use dqm lock during device reset/suspend/resume 2019-10-25 16:50:10 -04:00
kfd_doorbell.c drm/amdkfd: Fix kernel queue 64 bit doorbell offset calculation 2018-07-11 22:33:01 -04:00
kfd_events.c drm/amdkfd: Use hex print format for pasid 2019-10-03 09:11:03 -05:00
kfd_events.h drm/amdkfd: Implement GPU reset handlers in KFD 2018-07-11 22:32:56 -04:00
kfd_flat_memory.c drm/amdkfd: Check against device cgroup 2019-10-07 15:11:38 -05:00
kfd_int_process_v9.c drm/amdkfd: Query vmid pasid mapping through stored info for non HWS 2019-10-03 09:11:03 -05:00
kfd_interrupt.c drm/amdkfd: fix a potential NULL pointer dereference (v2) 2019-10-03 09:11:00 -05:00
kfd_iommu.c drm/amdkfd: Use hex print format for pasid 2019-10-03 09:11:03 -05:00
kfd_iommu.h drm/amdkfd: Centralize IOMMUv2 code and make it conditional 2017-12-08 19:22:12 -05:00
kfd_kernel_queue_cik.c drm/amdkfd: Add 64-bit doorbell and wptr support to kernel queue 2018-04-08 22:03:51 -04:00
kfd_kernel_queue_v9.c drm/amdkfd: Support bigger gds size 2019-07-18 14:18:03 -05:00
kfd_kernel_queue_v10.c drm/amdkfd: Add navi10 support to amdkfd. (v3) 2019-06-21 18:59:24 -05:00
kfd_kernel_queue_vi.c drm/amdkfd: Delete alloc_format field from map_queue struct 2019-05-24 12:21:03 -05:00
kfd_kernel_queue.c drm/amdkfd: use navi12 specific family id for navi12 code path 2019-10-03 09:11:03 -05:00
kfd_kernel_queue.h drm/amdkfd: Add navi10 support to amdkfd. (v3) 2019-06-21 18:59:24 -05:00
kfd_module.c drm/amdkfd: add missing void argument to function kgd2kfd_init 2019-10-07 15:10:26 -05:00
kfd_mqd_manager_cik.c drm/amdkfd: Separate mqd allocation and initialization 2019-06-11 12:56:59 -05:00
kfd_mqd_manager_v9.c drm/amdkfd: Extend CU mask to 8 SEs (v3) 2019-08-02 10:19:11 -05:00
kfd_mqd_manager_v10.c drm/amdkfd: Move the control stack on GFX10 to userspace buffer 2019-10-03 09:11:03 -05:00
kfd_mqd_manager_vi.c drm/amdkfd: Separate mqd allocation and initialization 2019-06-11 12:56:59 -05:00
kfd_mqd_manager.c drm/amdkfd: Extend CU mask to 8 SEs (v3) 2019-08-02 10:19:11 -05:00
kfd_mqd_manager.h drm/amdkfd: Extend CU mask to 8 SEs (v3) 2019-08-02 10:19:11 -05:00
kfd_packet_manager.c drm/amdkfd: use navi12 specific family id for navi12 code path 2019-10-03 09:11:03 -05:00
kfd_pasid.c drm/amdkfd: Simplify kfd2kgd interface 2018-11-05 14:21:07 -05:00
kfd_pm4_headers_ai.h drm/amdkfd: Support bigger gds size 2019-07-18 14:18:03 -05:00
kfd_pm4_headers_diq.h drm/amdkfd: Add skeleton H/W debugger module support 2015-06-03 11:32:28 +03:00
kfd_pm4_headers_vi.h drm/amdkfd: Delete alloc_format field from map_queue struct 2019-05-24 12:21:03 -05:00
kfd_pm4_headers.h drm/amdkfd: Update PM4 packet headers 2017-08-15 23:00:15 -04:00
kfd_pm4_opcodes.h amdkfd: Add kernel queue module 2014-07-17 00:45:35 +03:00
kfd_priv.h drm/amdkfd: update for drmP.h removal 2019-10-09 12:04:48 -05:00
kfd_process_queue_manager.c drm/amdkfd: Use hex print format for pasid 2019-10-03 09:11:03 -05:00
kfd_process.c drm/amdkfd: Use hex print format for pasid 2019-10-03 09:11:03 -05:00
kfd_queue.c drm/amdkfd: use %px to print user space address instead of %p 2018-05-01 17:56:04 -04:00
kfd_topology.c drm/amdkfd: Check against device cgroup 2019-10-07 15:11:38 -05:00
kfd_topology.h drm/amdkfd: Store kfd_dev in iolink and cache properties 2019-10-03 09:11:03 -05:00
Makefile drm/amdkfd: Add navi10 support to amdkfd. (v3) 2019-06-21 18:59:24 -05:00
soc15_int.h drm/amdkfd: Add SOC15 interrupt processing support 2018-04-10 17:33:10 -04:00