Always perform the requested reset, even if we believe the engine is
idle. Presumably there was a reason the caller wanted the reset, and in
the near future we lose the easy tracking for whether the engine is
idle.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190125132230.22221-5-chris@chris-wilson.co.uk
Trim the struct_mutex hold and exclude the call to i915_gem_set_wedged()
as a reminder that it must be callable without struct_mutex held.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190125132230.22221-4-chris@chris-wilson.co.uk
Now that the submission backends are controlled via their own spinlocks,
with a wave of a magic wand we can lift the struct_mutex requirement
around GPU reset. That is we allow the submission frontend (userspace)
to keep on submitting while we process the GPU reset as we can suspend
the backend independently.
The major change is around the backoff/handoff strategy for performing
the reset. With no mutex deadlock, we no longer have to coordinate with
any waiter, and just perform the reset immediately.
Testcase: igt/gem_mmap_gtt/hang # regresses
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190125132230.22221-3-chris@chris-wilson.co.uk
Track the temporary wakerefs used within the selftests so that leaks are
clear.
v2: Add a couple of coarse annotations for mock selftests as we now
loudly warn about the errors.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190114142129.24398-14-chris@chris-wilson.co.uk
The majority of runtime-pm operations are bounded and scoped within a
function; these are easy to verify that the wakeref are handled
correctly. We can employ the compiler to help us, and reduce the number
of wakerefs tracked when debugging, by passing around cookies provided
by the various rpm_get functions to their rpm_put counterpart. This
makes the pairing explicit, and given the required wakeref cookie the
compiler can verify that we pass an initialised value to the rpm_put
(quite handy for double checking error paths).
For regular builds, the compiler should be able to eliminate the unused
local variables and the program growth should be minimal. Fwiw, it came
out as a net improvement as gcc was able to refactor rpm_get and
rpm_get_if_in_use together,
v2: Just s/rpm_put/rpm_put_unchecked/ everywhere, leaving the manual
mark up for smaller more targeted patches.
v3: Mention the cookie in Returns
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190114142129.24398-2-chris@chris-wilson.co.uk
UAPI Changes:
Cross-subsystem Changes:
- Turn dma-buf fence sequence numbers into 64 bit numbers
Core Changes:
- Move to a common helper for the DP MST hotplug for radeon, i915 and
amdgpu
- i2c improvements for drm_dp_mst
- Removal of drm_syncobj_cb
- Introduction of an helper to create and attach the TV margin properties
Driver Changes:
- Improve cache flushes for v3d
- Reflection support for vc4
- HDMI overscan support for vc4
- Add implicit fencing support for rockchip and sun4i
- Switch to generic fbdev emulation for virtio
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCXDOTqAAKCRDj7w1vZxhR
xZ8QAQD4j8m9Ea3bzY5Rr8BYUx1k+Cjj6Y6abZmot2rSvdyOHwD+JzJFIFAPZjdd
uOKhLnDlubaaoa6OGPDQShjl9p3gyQE=
=WQGO
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-next-2019-01-07-1' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for 5.1:
UAPI Changes:
Cross-subsystem Changes:
- Turn dma-buf fence sequence numbers into 64 bit numbers
Core Changes:
- Move to a common helper for the DP MST hotplug for radeon, i915 and
amdgpu
- i2c improvements for drm_dp_mst
- Removal of drm_syncobj_cb
- Introduction of an helper to create and attach the TV margin properties
Driver Changes:
- Improve cache flushes for v3d
- Reflection support for vc4
- HDMI overscan support for vc4
- Add implicit fencing support for rockchip and sun4i
- Switch to generic fbdev emulation for virtio
Signed-off-by: Dave Airlie <airlied@redhat.com>
[airlied: applied amdgpu merge fixup]
From: Maxime Ripard <maxime.ripard@bootlin.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190107180333.amklwycudbsub3s5@flea
We currently require that our per-engine reset can be called from any
context, even hardirq, and in the future wish to perform the device
reset without holding struct_mutex (which requires some lockless
shenanigans that demand the lowlevel intel_reset_gpu() be able to be
used in atomic context). Test that we meet the current requirements by
calling i915_reset_engine() from under various atomic contexts.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181213091522.2926-2-chris@chris-wilson.co.uk
After declaring a terminally wedged device, we allow ourselves to
recover on the next GPU reset (manually triggered), or resume. Check
that resetting a wedged device does work.
v2: Add rpm (taken explicitly in the subtest in case we remove the outer
wakeref) and early warning to i915_reset() for missed wakerefs
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181213091522.2926-1-chris@chris-wilson.co.uk
Many errs of the form:
drivers/gpu/drm/i915/selftests/intel_hangcheck.c: In function ‘__igt_reset_evict_vma’:
./include/linux/kern_levels.h:5:18: error: format ‘%x’ expects argument of type ‘unsigned int’, but argum
Fixes: b312d8ca3a ("dma-buf: make fence sequence numbers 64 bit v2")
Cc: Christian König <christian.koenig@amd.com>
Cc: Chunming Zhou <david1.zhou@amd.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20181207123428.16257-1-mika.kuoppala@linux.intel.com
Impose a restraint that we have all vma pinned for a request prior to
its allocation. This is to simplify request construction, and should
facilitate unravelling the lock interdependencies later.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181204141522.13640-3-chris@chris-wilson.co.uk
Two simple selftests which test that both GT and engine workarounds are
not lost after either a full GPU reset, or after the per-engine ones.
(Including checks that one engine reset is not affecting workarounds not
belonging to itself.)
v2:
* Rebase for series refactoring.
* Add spinner for actual engine reset!
* Add idle reset test as well. (Chris Wilson)
* Share existing global_reset_lock. (Chris Wilson)
v3:
* intel_engine_verify_workarounds can be static.
* API rename. (Chris Wilson)
* Move global reset lock out of the loop. (Chris Wilson)
v4:
* Add missing rpm puts. (Chris Wilson)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20181203125014.3219-5-tvrtko.ursulin@linux.intel.com
On older HW, gen2/3, fence registers are used for detiling GPU commands
and as such changing those registers requires serialisation with the
requests on the GPU. Anything running on the GPU is subject to a hang,
and so we must be able to recover cleanly in the middle of a stuck wait
on a fence register.
We can simulate using the fence on the GPU simply by marking the fence
as active on the request for this vma, the interface being common to all
gen, thus broadening the test.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180719194746.19111-2-chris@chris-wilson.co.uk
To test eviction from a ppgtt, we just want a ppgtt i.e. something other
than the Global GTT which is shared and used by the kernel for HW
features like fencing and scanout. However, we also need it to pass
!i915_is_ggtt() and the simplest way is to emulate a full user context
rather than the internal kernel context that is used for the GGTT.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180719194746.19111-1-chris@chris-wilson.co.uk
We must be able to reset the GPU while we are waiting on it to perform
an eviction (unbinding an active vma). So attach a spinning request to a
target vma and try and it evict it from a thread to see if that blocks
indefinitely.
v2: Add a wait for the thread to start just in case that takes more than
10ms...
v3: complete() not completion_done() to signal the completion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180716134009.13143-1-chris@chris-wilson.co.uk
Handling such a late error in request construction is tricky, but to
accommodate future patches which may allocate here, we potentially could
err. To handle the error after already adjusting global state to track
the new request, we must finish and submit the request. But we don't
want to use the request as not everything is being tracked by it, so we
opt to cancel the commands inside the request.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180706103947.15919-3-chris@chris-wilson.co.uk
Replace the magic bit with the proper symbolic name for instructing
MI_STORE_DWORD_IMM to use a virtual address (on gen3) or the global GTT
address (still virtual!) on gen4+.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20180706142323.25699-1-chris@chris-wilson.co.uk
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
If the GPU is irrecoverably wedged on startup, it means that it failed
on initialisation and we have already tried to reset it but failed. We
can ignore all further testing, as it is already dead. Failing early,
prevents us from slowly failing in our endeavours later and timing out.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180705150214.28316-1-chris@chris-wilson.co.uk
For symmetry, simplicity and ensuring the request is always truly idle
upon its completion, always emit the closing flush prior to emitting the
request breadcrumb. Previously, we would only emit the flush if we had
started a user batch, but this just leaves all the other paths open to
speculation (do they affect the GPU caches or not?) With mm switching, a
key requirement is that the GPU is flushed and invalidated before hand,
so for absolute safety, we want that closing flush be mandatory.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180612105135.4459-1-chris@chris-wilson.co.uk
In the near future, I want to subclass gen6_hw_ppgtt as it contains a
few specialised members and I wish to add more. To avoid the ugliness of
using ppgtt->base.base, rename the i915_hw_ppgtt base member
(i915_address_space) as vm, which is our common shorthand for an
i915_address_space local.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180605153758.18422-1-chris@chris-wilson.co.uk
When testing reset, we wait for 1s on the main thread for the hang to
start. Meanwhile, we continue submitting requests on all the background
threads, and we may have more threads than cores and so potentially
starve the waiter from being woken within the timeout. As the hang
timeout and the active timeouts are the same, it is hard to distinguish
which caused the timeout. Bump the active thread timeouts to 5s,
compared to the 1s timeout for the hang, so that we preferentially
report the hang timing out, while hopefully ensuring that we do at least
wake up the hang thread first before declaring the background active
timeout.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20180517142442.16979-1-chris@chris-wilson.co.uk
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
In the next patch, we want to store the intel_context pointer inside
i915_request, as it is frequently access via a convoluted dance when
submitting the request to hw. Having two context pointers inside
i915_request leads to confusion so first rename the existing
i915_gem_context pointer to i915_request.gem_context.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180517212633.24934-1-chris@chris-wilson.co.uk
Even though we weren't injecting guilty requests to be reset, we could
still fall over the issue of resetting the same request too fast -- where
the GPU refuses to start again. (Although it is interesting to note that
reloading the driver is sufficient, suggesting that we could recover if
we delayed the setup after reset?) Continue to paper over the problem by
adding a small delay by waiting for the engine to idle between tests,
and ensure that the engines are idle before starting the idle tests.
v2: Replace single instance of 50 with a magic macro.
References: 028666793a ("drm/i915/selftests: Avoid repeatedly harming the same innocent context")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180411120346.27618-1-chris@chris-wilson.co.uk
Today we only want to pass along the priority to engine->schedule(), but
in the future we want to have much more control over the various aspects
of the GPU during a context's execution, for example controlling the
frequency allowed. As we need an ever growing number of parameters for
scheduling, move those into a struct for convenience.
v2: Move the anonymous struct into its own function for legibility and
ye olde gcc.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180418184052.7129-3-chris@chris-wilson.co.uk
Currently, we rely on inspecting the hangcheck state from within the
i915_reset() routines to determine which engines were guilty of the
hang. This is problematic for cases where we want to run
i915_handle_error() and call i915_reset() independently of hangcheck.
Instead of relying on the indirect parameter passing, turn it into an
explicit parameter providing the set of stalled engines which then are
treated as guilty until proven innocent.
While we are removing the implicit stalled parameter, also make the
reason into an explicit parameter to i915_reset(). We still need a
back-channel for i915_handle_error() to hand over the task to the locked
waiter, but let's keep that its own channel rather than incriminate
another.
This leaves stalled/seqno as being private to hangcheck, with no more
nefarious snooping by reset, be it whole-device or per-engine. \o/
The only real issue now is that this makes it crystal clear that we
don't actually do any testing of hangcheck per se in
drv_selftest/live_hangcheck, merely of resets!
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Jeff McGee <jeff.mcgee@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Michel Thierry <michel.thierry@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180406220354.18911-2-chris@chris-wilson.co.uk
If we are resetting just one engine, we know it has stalled. So we can
pass the stalled parameter directly to i915_gem_reset_engine(), which
alleviates the necessity to poke at the generic engine->hangcheck.stalled
magic variable, leaving that under control of hangcheck as its name
implies. Other than simplifying by removing the indirect parameter along
this path, this allows us to introduce new reset mechanisms that run
independently of hangcheck.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Jeff McGee <jeff.mcgee@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Michel Thierry <michel.thierry@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180406220354.18911-1-chris@chris-wilson.co.uk
Tvrtko mentioned that wait_for_hang() was confusing as it does not
actually wait for the aforementioned hang, just until the request is
running and we are *ready* to inject a hang. A quick
s/wait_for_hang/wait_until_running/ removes that confusion without
having to rethink the naming scheme, immediately at least.
Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180406100950.19033-1-chris@chris-wilson.co.uk
We don't handle resetting the kernel context very well, or presumably any
context executing its breadcrumb commands in the ring as opposed to the
batchbuffer and flush. If we trigger a device reset twice in quick
succession while the kernel context is executing, we may end up skipping
the breadcrumb. This is really only a problem for the selftest as
normally there is a large interlude between resets (hangcheck), or we
focus on resetting just one engine and so avoid repeatedly resetting
innocents.
Something to try would be a preempt-to-idle to quiesce the engine before
reset, so that innocent contexts would be spared the reset.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
CC: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180330131801.18327-1-chris@chris-wilson.co.uk
Watch what happens if we try to reset with a queue of requests with
varying priorities -- that may need reordering or preemption across the
reset.
v2: Tweak priorities to avoid starving the hanging thread.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20180322073533.5313-2-chris@chris-wilson.co.uk
Reviewed-by: Jeff McGee <jeff.mcgee@intel.com>
If we fail to reset the GPU in a timely fashion, dump the GEM trace so
that we can see what operations were in flight when the GPU got stuck.
v2: There's more than one timeout that deserves tracing!
v3: Silence checkpatch by not even using a product at all!
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Jeff McGee <jeff.mcgee@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180322074908.10838-1-chris@chris-wilson.co.uk
Not all callers want the GPU error to handled in the same way, so expose
a control parameter. In the first instance, some callers do not want the
heavyweight error capture so add a bit to request the state to be
captured and saved.
v2: Pass msg down to i915_reset/i915_reset_engine so that we include the
reason for the reset in the dev_notice(), superseding the earlier option
to not print that notice.
v3: Stash the reason inside the i915->gpu_error to handover to the direct
reset from the blocking waiter.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jeff McGee <jeff.mcgee@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Michel Thierry <michel.thierry@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180320100449.1360-2-chris@chris-wilson.co.uk
Mostly doc/print messages that were not updated after commit e61e0f51ba
("drm/i915: Rename drm_i915_gem_request to i915_request").
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20180222172405.11386-1-michel.thierry@intel.com
We want to de-emphasize the link between the request (dependency,
execution and fence tracking) from GEM and so rename the struct from
drm_i915_gem_request to i915_request. That is we may implement the GEM
user interface on top of requests, but they are an abstraction for
tracking execution rather than an implementation detail of GEM. (Since
they are not tied to HW, we keep the i915 prefix as opposed to intel.)
In short, the spatch:
@@
@@
- struct drm_i915_gem_request
+ struct i915_request
A corollary to contracting the type name, we also harmonise on using
'rq' shorthand for local variables where space if of the essence and
repetition makes 'request' unwieldy. For globals and struct members,
'request' is still much preferred for its clarity.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180221095636.6649-1-chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Avoid injecting hangs in to the i915->kernel_context in case the GPU
reset leaves corruption in the context image in its wake (leading to
continual failures and system hangs after the selftests are ostensibly
complete). Use a sacrificial kernel_context instead.
v2: Closing a context is tricky; export a function (for selftests) from
i915_gem_context.c to get it right.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180205152431.12163-2-chris@chris-wilson.co.uk
When injecting rapid resets, we must be careful to at least wait for the
previous reset to have taken effect and the engine restarted. If we
perform a second reset before that has happened, we will notice that the
engine hasn't recovered and declare it lost, wedging the device and
failing. In practice, since we wait for each hanging batch to start
before injecting the reset, this too-fast-reset condition can only be
triggered when moving onto the next engine in the test, so we need only
wait for the existing reset to complete before switching engines.
v2: Wrap up the wait inside a safety net to bail out in case of angry hw.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180205152431.12163-1-chris@chris-wilson.co.uk
Now that we skip a per-engine reset on an idle engine, we need to update
the selftest to take that into account. In the process, we find that we
were not stressing the per-engine reset very hard, so add those missing
active resets.
v2: Actually test i915_reset_engine() by loading it with requests.
Fixes: f6ba181ada ("drm/i915: Skip an engine reset if it recovered before our preparations")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104313
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171217132852.30642-3-chris@chris-wilson.co.uk
Reviewed-by: Michel Thierry <michel.thierry@intel.com>
Pass in a format string (and args) to specify the header to be emitted
along with the engine state when pretty-printing. This allows the header
to be emitted inside the drm_printer stream, so sharing the same prefix
and output characteristics (e.g. debug level and filtering).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171208012303.25504-2-chris@chris-wilson.co.uk
During request construction, after pinning the context we know whether
or not we have to emit a context switch. So move this common operation
from every caller into i915_gem_request_alloc() itself.
v2: Always submit the request if we emitted some commands during request
construction, as typically it also involves changes in global state.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171120102002.22254-2-chris@chris-wilson.co.uk
As the request will, in the following patch, implicitly invoke a
context-switch on construction, we should precede that with a GPU TLB
invalidation. Also, even before using GGTT, we always want to invalidate
the TLBs for any updates (as well as the ppgtt invalidates that are
unconditionally applied by execbuf). Since we almost always require the
TLB invalidate, do it unconditionally on request allocation and so we can
remove it from all other paths.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171120102002.22254-1-chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
The lowlevel reset functions expect the caller to be holding the rpm
wakeref for the device access across the reset. We were not explicitly
doing this in the sefltest, so for simplicity acquire the wakeref for
the duration of all subtests.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171009110301.21705-4-chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
During hangcheck testing, we try to execute requests following the GPU
reset, and in particular want to try and debug when those fail.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171009110301.21705-2-chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Currently, we are being fairly lazy and only using a wmb() following an
update to an active batch. Previously, we have found that to be
insufficient to ensure that a write from the CPU reaches memory in a
timely fashion, and in some caches we may need to flush a chipset cache.
To that end, we have i915_gem_chipset_flush() so use it.
Suggested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20170926153409.7928-1-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
If we see the seqno stop progressing, we abandon the test for fear that
the GPU died following the reset. However, during test teardown we still
wait for the GPU to idle before continuing, but we have already
confirmed that the GPU is dead. Furthermore, since we are inside a reset
test, we have disabled the hangchecker, and so there is no safety net and
we wait indefinitely. Detect the stuck GPU and declare it wedged as a
state of emergency so we can escape.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20170915130929.18892-1-chris@chris-wilson.co.uk
Tested-by: Jari Tahvanainen <jari.tahvanainen@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
MI_STORE_DWORD_IMM just doesn't work on the video decode engine under
Sandybridge, so refrain from using it. Then switch the selftests over to
using the now common test prior to using MI_STORE_DWORD_IMM.
Fixes: 7dd4f6729f ("drm/i915: Async GPU relocation processing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.13-rc1+
Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-1-chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>