Commit Graph

3416 Commits

Author SHA1 Message Date
Chunming Zhou
6f16b4fb60 drm/amdgpu: use dep_sync for CS dependency/syncobj
Otherwise, they could be optimized by scheduled fence.

Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06 12:47:21 -05:00
Xiangliang.Yu
75737cb4eb drm/amdgpu/gfx8: Fix compute ring failure after resetting
Do ring clear before ring test, otherwise compute ring test will
fail after gpu resetting. Still can't find the root cause, just
workaround it.

Signed-off-by: Xiangliang.Yu <Xiangliang.Yu@amd.com>
Acked-by: Monk Liu <Monk.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06 12:47:20 -05:00
Pixel Ding
1daee8b472 drm/amdgpu: revise retry init to fully cleanup driver
Retry at drm_dev_register instead of amdgpu_device_init.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Pixel Ding <Pixel.Ding@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06 12:47:18 -05:00
Monk Liu
75bc6099bc drm/amdgpu:read VRAMLOST from gim
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:45 -05:00
pding
0c03b912d7 drm/amdgpu: bypass FB resizing for SRIOV VF
It introduces 900ms latency in exclusive mode which causes failure
of driver loading. Host can resize the BAR before guest staring,
so the resizing is not necessary here.

Signed-off-by: Pixel Ding <Pixel.Ding@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:45 -05:00
pding
c6332b97fa drm/amdgpu: release exclusive mode after hw_init
Signed-off-by: pding <Pixel.Ding@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:44 -05:00
pding
1884734a03 drm/amdkfd: initialise kfd inside amdgpu_device_init
Also finalize kfd inside amdgpu_device_fini. kfd device_init needs
SRIOV exclusive accessing. Try to gather exclusive accessing to
reduce time consuming.

Signed-off-by: pding <Pixel.Ding@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:44 -05:00
Christian König
40575732b6 drm/amdgpu: don't use ttm_bo_move_ttm in amdgpu_ttm_bind v2
Just allocate the GART space and fill it.

This prevents forcing the BO to be idle.

v2: don't unbind/bind at all, just fill the allocated GART space

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:44 -05:00
Christian König
c5835bbb11 drm/amdgpu: rename amdgpu_ttm_bind to amdgpu_ttm_alloc_gart
We actually don't bind here, but rather allocate GART space if necessary.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:43 -05:00
Hawking Zhang
b2b7e457ba drm/amdgpu: switch to use new SOC15 reg read/write macros for soc15 ih
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:43 -05:00
Christian König
d6895ad39f drm/amdgpu: resize VRAM BAR for CPU access v6
Try to resize BAR0 to let CPU access all of VRAM.

v2: rebased, style cleanups, disable mem decode before resize,
    handle gmc_v9 as well, round size up to power of two.
v3: handle gmc_v6 as well, release and reassign all BARs in the driver.
v4: rename new function to amdgpu_device_resize_fb_bar,
    reenable mem decoding only if all resources are assigned.
v5: reorder resource release, return -ENODEV instead of BUG_ON().
v6: squash in rebase fix

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:42 -05:00
Horace Chen
3c7388936a drm/amdgpu: refine SR-IOV firmware VRAM reservation to protect data
The previous solution will create a zero buffer on the system
domain and then move the zeroes to the VRAM. This will break the
original data on the VRAM.

Refine the code to create bo on VRAM domain directly and then remove
and re-create mem node to the exact position before bo_pin. This can
avoid breaking the data and will not cause eviction.

Signed-off-by: Horace Chen <horace.chen@amd.com>
Reviewed-by: monk liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:42 -05:00
pding
5ffa61c1bd drm/amdgpu: retry init if exclusive mode request is failed
This is caused of that hypervisor fails to handle request, one known
issue is MMIO unblocking timeout. In theory we can retry init here.

Signed-off-by: pding <Pixel.Ding@amd.com>
Reviewed-by: Xiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:42 -05:00
pding
f47110330c drm/amdgpu: return error when sriov access requests get timeout
Reported-by: Sun Gary <Gary.Sun@amd.com>
Signed-off-by: pding <Pixel.Ding@amd.com>
Reviewed-by: Xiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:41 -05:00
Michel Dänzer
8fb0450c94 amdgpu: Remove AMDGPU_{HPD,CRTC_IRQ,PAGEFLIP_IRQ}_LAST
Not used anymore.

Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:40 -05:00
Michel Dänzer
d794b9f827 amdgpu/dce: Use actual number of CRTCs and HPDs in set_irq_funcs
Hardcoding the maximum numbers could result in spurious error messages
from the IRQ state callbacks, e.g. on Polaris 11/12:

[drm:dce_v11_0_set_pageflip_irq_state [amdgpu]] *ERROR* invalid pageflip crtc 5
[drm:amdgpu_irq_disable_all [amdgpu]] *ERROR* error disabling interrupt (-22)

Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:40 -05:00
Christian König
c1c7ce8f56 drm/amdgpu: move GART recovery into GTT manager v2
The GTT manager handles the GART address space anyway, so it is
completely pointless to keep the same information around twice.

v2: rebased

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:33 -05:00
Christian König
3da917b6c6 drm/amdgpu: nuke amdgpu_ttm_is_bound() v2
Rename amdgpu_gtt_mgr_is_allocated() to amdgpu_gtt_mgr_has_gart_addr() and use
that instead.

v2: rename the function as well.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:32 -05:00
Monk Liu
34a4d2bf06 drm/amdgpu:fix random missing of FLR NOTIFY
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:32 -05:00
Monk Liu
77a3c96b1b drm/amdgpu/sriov:fix memory leak in psp_load_fw
for SR-IOV when doing gpu reset this routine shouldn't do
resource allocating otherwise memory leak

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:31 -05:00
Monk Liu
503846e083 drm/amdgpu:cleanup ucode_init_bo
1,no sriov check since gpu recover is unified
2,need CPU_ACCESS_REQUIRED flag for VRAM if SRIOV
because otherwise after following PIN the first allocated
VRAM bo is wasted due to some TTM mgr reason.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:31 -05:00
Monk Liu
13a752e3a2 drm/amdgpu:cleanup in_sriov_reset and lock_reset
since now gpu reset is unified with gpu_recover
for both bare-metal and SR-IOV:

1)rename in_sriov_reset to in_gpu_reset
2)move lock_reset from adev->virt to adev

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:31 -05:00
Monk Liu
5740682e66 drm/amdgpu:implement new GPU recover(v3)
1,new imple names amdgpu_gpu_recover which gives more hint
on what it does compared with gpu_reset

2,gpu_recover unify bare-metal and SR-IOV, only the asic reset
part is implemented differently

3,gpu_recover will increase hang job karma and mark its entity/context
as guilty if exceeds limit

V2:

4,in scheduler main routine the job from guilty context  will be immedialy
fake signaled after it poped from queue and its fence be set with
"-ECANCELED" error

5,in scheduler recovery routine all jobs from the guilty entity would be
dropped

6,in run_job() routine the real IB submission would be skipped if @skip parameter
equales true or there was VRAM lost occured.

V3:

7,replace deprecated gpu reset, use new gpu recover

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:30 -05:00
Monk Liu
48f05f2955 amd/scheduler:imple job skip feature(v3)
jobs are skipped under two cases
1)when the entity behind this job marked guilty, the job
poped from this entity's queue will be dropped in sched_main loop.

2)in job_recovery(), skip the scheduling job if its karma detected
above limit, and also skipped as well for other jobs sharing the
same fence context. this approach is becuase job_recovery() cannot
access job->entity due to entity may already dead.

v2:
some logic fix

v3:
when entity detected guilty, don't drop the job in the poping
stage, instead set its fence error as -ECANCELED

in run_job(), skip the scheduling either:1) fence->error < 0
or 2) there was a VRAM LOST occurred on this job.
this way we can unify the job skipping logic.

with this feature we can introduce new gpu recover feature.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:30 -05:00
Christian König
3a393cf96a drm/amdgpu: fix indentation in amdgpu_display.h
That was somehow completely of.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:29 -05:00
Rex Zhu
433f1aa786 drm/amdgpu: delete duplicated code.
the variable ref_clock was assigned same
value twice in same function.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rex Zhu <Rex.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:19 -05:00
Rex Zhu
d668942bb8 drm/amdgpu: add new pp function point notify_smu_memory_info
Used to set up smu power logging.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rex Zhu <Rex.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:18 -05:00
Rex Zhu
c79563a316 drm/amdgpu: add header kgd_pp_interface.h
move powerplay and amdgpu shared structures
and definitions to kgd_pp_interface.h.  This
is the interface between the base driver
and powerplay.

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rex Zhu <Rex.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:18 -05:00
Rex Zhu
11dc9364bd drm/amdgpu: move struct amd_powerplay to amdgpu.h
Clean up the interface.

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rex Zhu <Rex.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:17 -05:00
Christian König
4ff23be3d5 drm/amdgpu: remove extra parameter from amdgpu_ttm_bind() v2
We always use the BO mem now.

v2: minor rebase

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:16 -05:00
Christian König
2a018f28a8 drm/amdgpu: don't wait interruptible while binding GART space
Display can't seem to handle this correctly.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:16 -05:00
Christian König
f5318959b5 drm/amdgpu: fix pin domain compatibility check
We need to test if any domain fits, not all of them.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:16 -05:00
Christian König
ead282a4f5 drm/amdgpu: always bind pinned BOs
We always need to bind pinned BOs, not just when the caller requested the
address.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:15 -05:00
Christian König
5e91fb57eb drm/amdgpu: use the actual placement for pin accounting
This allows us to specify multiple possible placements again.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:15 -05:00
pding
8840a3878d drm/amdgpu: retry init if it fails due to exclusive mode timeout (v3)
The exclusive mode has real-time limitation in reality, such like being
done in 300ms. It's easy observed if running many VF/VMs in single host
with heavy CPU workload.

If we find the init fails due to exclusive mode timeout, try it again.

v2:
 - rewrite the condition for readable value.

v3:
 - fix typo, add comments for sleep

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: pding <Pixel.Ding@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:14 -05:00
pding
b59142384e drm/amdgpu/virt: implement wait_reset callbacks for vi/ai
Reviewed-by: Monk Liu <monk.liu@amd.com>
Signed-off-by: pding <Pixel.Ding@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:14 -05:00
Evan Quan
7413d2faef drm/amd/powerplay: describe the PCIE link speed in right GT/s
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:14 -05:00
pding
b636176efd drm/amdgpu/virt: add wait_reset virt ops
Driver can use this interface to check if there's a function level
reset done in hypervisor. It's helpful when IRQ handler for reset
is not ready, or special handling is required.

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Monk Liu <monk.liu@amd.com>
Signed-off-by: pding <Pixel.Ding@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:13 -05:00
pding
a16f8f11c5 drm/amdgpu/virt: add function to check MMIO (v2)
MMIO space can be blocked on virtualised device. Add this
function to check if MMIO is blocked or not.

Todo: need a reliable method such like communation
with hypervisor.

v2:
 - add comments inline

Signed-off-by: pding <Pixel.Ding@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:13 -05:00
pding
1366b2d016 drm/amdgpu: avoid soft lockup when waiting for RLC serdes (v2)
Normally all waiting get timeout if there's one.
Release the lock and return immediately when timeout happens.

v2:
 - set the se_sh to broadcase before return

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: pding <Pixel.Ding@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:13 -05:00
pding
9953b72f9c drm/amdgpu: change redundant init logs to debug level
When this VF stays in exclusive mode for long, other VFs will be
impacted.

The redundant messages causes exclusive mode timeout when they're
redirected. That is a normal use case for cloud service to redirect
guest log to virtual serial port.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: pding <Pixel.Ding@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:12 -05:00
Monk Liu
bc1b1bf6e3 drm/amdgpu:implement ctx query2
this query will give flag bits to indicate what happend
on the given context

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:12 -05:00
Monk Liu
668ca1b44d drm/amdgpu:don't change ctx->reset_couner upon query
reset_counter marks the reset counter number once the context
is created, shouldn't be changed due to query.

To keep U/K interface on the ctx_query and keep ctx's reset_counter
logic compatible with GPU RESET feature, now use another var named
"reset_counter_query" to replace the original checked & updated in
amdgpu_ctx_query.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:11 -05:00
Andrey Grodzovsky
a4176cb484 drm/amdgpu: Remove job->s_entity to avoid keeping reference to stale pointer.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:11 -05:00
Monk Liu
a8a51a7041 drm/amdgpu:cleanup job reset routine(v2)
merge the setting guilty on context into this function
to avoid implement extra routine.

v2:
go through entity list and compare the fence_ctx
before operate on the entity, otherwise the entity
may be just a wild pointer

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <David1.Zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:10 -05:00
Monk Liu
7716ea564f drm/amdgpu:skip job for guilty ctx in parser_init
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:10 -05:00
Monk Liu
1102900de0 drm/amdgpu:pass ctx->guilty address to entity init
this way the real interested guilty is connected to entity->guilty
pointer, and we can use entity->pointer later in gpu recovery procedure

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <David1.Zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:09 -05:00
Monk Liu
b3eebe3d89 drm/amd/scheduler:introduce guilty pointer member
this member will be used later, it will points to
the real var inside of context and CS_SUBMIT & gpu schdduler
can decide if skip a job depends on context->guilty or *entity->guilty

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <David1.Zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:09 -05:00
Monk Liu
95aa9b1d97 drm/amdgpu:add hang_limit for sched(v2)
since gpu_scheduler source domain cannot access amdgpu variable
so need create the hang_limit membewr for sched, and it can
refer it for the upcoming GPU RESET patches

v2:
make hang_limit a parameter of sched_init()

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <David1.Zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:08 -05:00
Monk Liu
2f9d4084ca drm/amdgpu:cleanup force_completion
cleanups, now only operate on the given ring

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:08 -05:00