mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2025-01-26 19:29:37 +07:00
2c99a547bc
If device reset/suspend/resume failed for some reason, dqm lock is hold forever and this causes deadlock. Below is a kernel backtrace when application open kfd after suspend/resume failed. Instead of holding dqm lock in pre_reset and releasing dqm lock in post_reset, add dqm->sched_running flag which is modified in dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection because write/read are all inside dqm lock. For HWS case, map_queues_cpsch and unmap_queues_cpsch checks sched_running flag before sending the updated runlist. v2: For no-HWS case, when device is stopped, don't call load/destroy_mqd for eviction, restore and create queue, and avoid debugfs dump hdqs. Backtrace of dqm lock deadlock: [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more than 120 seconds. [Thu Oct 17 16:43:37 2019] Not tainted 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 [Thu Oct 17 16:43:37 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Oct 17 16:43:37 2019] rocminfo D 0 3024 2947 0x80000000 [Thu Oct 17 16:43:37 2019] Call Trace: [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 [amdgpu] [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 [amdgpu] [Thu Oct 17 16:43:37 2019] kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 [amdgpu] [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> |
||
---|---|---|
.. | ||
cik_event_interrupt.c | ||
cik_int.h | ||
cik_regs.h | ||
cwsr_trap_handler_gfx8.asm | ||
cwsr_trap_handler_gfx9.asm | ||
cwsr_trap_handler_gfx10.asm | ||
cwsr_trap_handler.h | ||
Kconfig | ||
kfd_chardev.c | ||
kfd_crat.c | ||
kfd_crat.h | ||
kfd_dbgdev.c | ||
kfd_dbgdev.h | ||
kfd_dbgmgr.c | ||
kfd_dbgmgr.h | ||
kfd_debugfs.c | ||
kfd_device_queue_manager_cik.c | ||
kfd_device_queue_manager_v9.c | ||
kfd_device_queue_manager_v10.c | ||
kfd_device_queue_manager_vi.c | ||
kfd_device_queue_manager.c | ||
kfd_device_queue_manager.h | ||
kfd_device.c | ||
kfd_doorbell.c | ||
kfd_events.c | ||
kfd_events.h | ||
kfd_flat_memory.c | ||
kfd_int_process_v9.c | ||
kfd_interrupt.c | ||
kfd_iommu.c | ||
kfd_iommu.h | ||
kfd_kernel_queue_cik.c | ||
kfd_kernel_queue_v9.c | ||
kfd_kernel_queue_v10.c | ||
kfd_kernel_queue_vi.c | ||
kfd_kernel_queue.c | ||
kfd_kernel_queue.h | ||
kfd_module.c | ||
kfd_mqd_manager_cik.c | ||
kfd_mqd_manager_v9.c | ||
kfd_mqd_manager_v10.c | ||
kfd_mqd_manager_vi.c | ||
kfd_mqd_manager.c | ||
kfd_mqd_manager.h | ||
kfd_packet_manager.c | ||
kfd_pasid.c | ||
kfd_pm4_headers_ai.h | ||
kfd_pm4_headers_diq.h | ||
kfd_pm4_headers_vi.h | ||
kfd_pm4_headers.h | ||
kfd_pm4_opcodes.h | ||
kfd_priv.h | ||
kfd_process_queue_manager.c | ||
kfd_process.c | ||
kfd_queue.c | ||
kfd_topology.c | ||
kfd_topology.h | ||
Makefile | ||
soc15_int.h |