2017-02-14 00:15:21 +07:00
|
|
|
/*
|
|
|
|
* Copyright © 2016 Intel Corporation
|
|
|
|
*
|
|
|
|
* Permission is hereby granted, free of charge, to any person obtaining a
|
|
|
|
* copy of this software and associated documentation files (the "Software"),
|
|
|
|
* to deal in the Software without restriction, including without limitation
|
|
|
|
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
|
|
|
|
* and/or sell copies of the Software, and to permit persons to whom the
|
|
|
|
* Software is furnished to do so, subject to the following conditions:
|
|
|
|
*
|
|
|
|
* The above copyright notice and this permission notice (including the next
|
|
|
|
* paragraph) shall be included in all copies or substantial portions of the
|
|
|
|
* Software.
|
|
|
|
*
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
|
|
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
|
|
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
|
|
|
|
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
|
|
|
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
|
|
|
|
* IN THE SOFTWARE.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2017-02-14 00:15:24 +07:00
|
|
|
#include <linux/prime_numbers.h>
|
|
|
|
|
2019-05-28 16:29:49 +07:00
|
|
|
#include "gem/i915_gem_pm.h"
|
|
|
|
#include "gem/selftests/mock_context.h"
|
|
|
|
|
2019-06-21 14:08:02 +07:00
|
|
|
#include "gt/intel_gt.h"
|
|
|
|
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
#include "i915_random.h"
|
2019-05-28 16:29:49 +07:00
|
|
|
#include "i915_selftest.h"
|
2019-01-22 05:20:47 +07:00
|
|
|
#include "igt_live_test.h"
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
#include "lib_sw_fence.h"
|
2017-02-14 00:15:21 +07:00
|
|
|
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
#include "mock_drm.h"
|
2017-02-14 00:15:21 +07:00
|
|
|
#include "mock_gem_device.h"
|
|
|
|
|
|
|
|
static int igt_add_request(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
2018-02-21 16:56:36 +07:00
|
|
|
struct i915_request *request;
|
2017-02-14 00:15:21 +07:00
|
|
|
|
|
|
|
/* Basic preliminary test to create a request and let it loose! */
|
|
|
|
|
2019-08-08 18:56:40 +07:00
|
|
|
request = mock_request(i915->engine[RCS0]->kernel_context, HZ / 10);
|
2017-02-14 00:15:21 +07:00
|
|
|
if (!request)
|
2019-10-04 20:40:02 +07:00
|
|
|
return -ENOMEM;
|
2017-02-14 00:15:21 +07:00
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_add(request);
|
2017-02-14 00:15:21 +07:00
|
|
|
|
2019-10-04 20:40:02 +07:00
|
|
|
return 0;
|
2017-02-14 00:15:21 +07:00
|
|
|
}
|
|
|
|
|
2017-02-14 00:15:22 +07:00
|
|
|
static int igt_wait_request(void *arg)
|
|
|
|
{
|
|
|
|
const long T = HZ / 4;
|
|
|
|
struct drm_i915_private *i915 = arg;
|
2018-02-21 16:56:36 +07:00
|
|
|
struct i915_request *request;
|
2017-02-14 00:15:22 +07:00
|
|
|
int err = -EINVAL;
|
|
|
|
|
|
|
|
/* Submit a request, then wait upon it */
|
|
|
|
|
2019-08-08 18:56:40 +07:00
|
|
|
request = mock_request(i915->engine[RCS0]->kernel_context, T);
|
2019-10-04 20:40:02 +07:00
|
|
|
if (!request)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2019-06-25 20:01:22 +07:00
|
|
|
i915_request_get(request);
|
2017-02-14 00:15:22 +07:00
|
|
|
|
2019-06-18 14:41:30 +07:00
|
|
|
if (i915_request_wait(request, 0, 0) != -ETIME) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request wait (busy query) succeeded (expected timeout before submit!)\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
2019-06-18 14:41:30 +07:00
|
|
|
if (i915_request_wait(request, 0, T) != -ETIME) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request wait succeeded (expected timeout before submit!)\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
if (i915_request_completed(request)) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request completed before submit!!\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_add(request);
|
2017-02-14 00:15:22 +07:00
|
|
|
|
2019-06-18 14:41:30 +07:00
|
|
|
if (i915_request_wait(request, 0, 0) != -ETIME) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request wait (busy query) succeeded (expected timeout after submit!)\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
if (i915_request_completed(request)) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request completed immediately!\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
2019-06-18 14:41:30 +07:00
|
|
|
if (i915_request_wait(request, 0, T / 2) != -ETIME) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request wait succeeded (expected timeout!)\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
2019-06-18 14:41:30 +07:00
|
|
|
if (i915_request_wait(request, 0, T) == -ETIME) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request wait timed out!\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
if (!i915_request_completed(request)) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request not complete after waiting!\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
2019-06-18 14:41:30 +07:00
|
|
|
if (i915_request_wait(request, 0, T) == -ETIME) {
|
2017-02-14 00:15:22 +07:00
|
|
|
pr_err("request wait timed out when already complete!\n");
|
2019-06-25 20:01:22 +07:00
|
|
|
goto out_request;
|
2017-02-14 00:15:22 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
err = 0;
|
2019-06-25 20:01:22 +07:00
|
|
|
out_request:
|
|
|
|
i915_request_put(request);
|
2017-02-14 00:15:22 +07:00
|
|
|
mock_device_flush(i915);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-02-14 00:15:23 +07:00
|
|
|
static int igt_fence_wait(void *arg)
|
|
|
|
{
|
|
|
|
const long T = HZ / 4;
|
|
|
|
struct drm_i915_private *i915 = arg;
|
2018-02-21 16:56:36 +07:00
|
|
|
struct i915_request *request;
|
2017-02-14 00:15:23 +07:00
|
|
|
int err = -EINVAL;
|
|
|
|
|
|
|
|
/* Submit a request, treat it as a fence and wait upon it */
|
|
|
|
|
2019-08-08 18:56:40 +07:00
|
|
|
request = mock_request(i915->engine[RCS0]->kernel_context, T);
|
2019-10-04 20:40:02 +07:00
|
|
|
if (!request)
|
|
|
|
return -ENOMEM;
|
2017-02-14 00:15:23 +07:00
|
|
|
|
|
|
|
if (dma_fence_wait_timeout(&request->fence, false, T) != -ETIME) {
|
|
|
|
pr_err("fence wait success before submit (expected timeout)!\n");
|
2019-10-04 20:40:02 +07:00
|
|
|
goto out;
|
2017-02-14 00:15:23 +07:00
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_add(request);
|
2017-02-14 00:15:23 +07:00
|
|
|
|
|
|
|
if (dma_fence_is_signaled(&request->fence)) {
|
|
|
|
pr_err("fence signaled immediately!\n");
|
2019-10-04 20:40:02 +07:00
|
|
|
goto out;
|
2017-02-14 00:15:23 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (dma_fence_wait_timeout(&request->fence, false, T / 2) != -ETIME) {
|
|
|
|
pr_err("fence wait success after submit (expected timeout)!\n");
|
2019-10-04 20:40:02 +07:00
|
|
|
goto out;
|
2017-02-14 00:15:23 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (dma_fence_wait_timeout(&request->fence, false, T) <= 0) {
|
|
|
|
pr_err("fence wait timed out (expected success)!\n");
|
2019-10-04 20:40:02 +07:00
|
|
|
goto out;
|
2017-02-14 00:15:23 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!dma_fence_is_signaled(&request->fence)) {
|
|
|
|
pr_err("fence unsignaled after waiting!\n");
|
2019-10-04 20:40:02 +07:00
|
|
|
goto out;
|
2017-02-14 00:15:23 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (dma_fence_wait_timeout(&request->fence, false, T) <= 0) {
|
|
|
|
pr_err("fence wait timed out when complete (expected success)!\n");
|
2019-10-04 20:40:02 +07:00
|
|
|
goto out;
|
2017-02-14 00:15:23 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
err = 0;
|
2019-10-04 20:40:02 +07:00
|
|
|
out:
|
2017-02-14 00:15:23 +07:00
|
|
|
mock_device_flush(i915);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-02-23 14:44:18 +07:00
|
|
|
static int igt_request_rewind(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
2018-02-21 16:56:36 +07:00
|
|
|
struct i915_request *request, *vip;
|
2017-02-23 14:44:18 +07:00
|
|
|
struct i915_gem_context *ctx[2];
|
2019-08-08 18:56:40 +07:00
|
|
|
struct intel_context *ce;
|
2017-02-23 14:44:18 +07:00
|
|
|
int err = -EINVAL;
|
|
|
|
|
|
|
|
ctx[0] = mock_context(i915, "A");
|
2019-10-04 20:40:02 +07:00
|
|
|
|
2019-08-08 18:56:40 +07:00
|
|
|
ce = i915_gem_context_get_engine(ctx[0], RCS0);
|
|
|
|
GEM_BUG_ON(IS_ERR(ce));
|
|
|
|
request = mock_request(ce, 2 * HZ);
|
|
|
|
intel_context_put(ce);
|
2017-02-23 14:44:18 +07:00
|
|
|
if (!request) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto err_context_0;
|
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_get(request);
|
|
|
|
i915_request_add(request);
|
2017-02-23 14:44:18 +07:00
|
|
|
|
|
|
|
ctx[1] = mock_context(i915, "B");
|
2019-10-04 20:40:02 +07:00
|
|
|
|
2019-08-08 18:56:40 +07:00
|
|
|
ce = i915_gem_context_get_engine(ctx[1], RCS0);
|
|
|
|
GEM_BUG_ON(IS_ERR(ce));
|
|
|
|
vip = mock_request(ce, 0);
|
|
|
|
intel_context_put(ce);
|
2017-02-23 14:44:18 +07:00
|
|
|
if (!vip) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto err_context_1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Simulate preemption by manual reordering */
|
|
|
|
if (!mock_cancel_request(request)) {
|
|
|
|
pr_err("failed to cancel request (already executed)!\n");
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_add(vip);
|
2017-02-23 14:44:18 +07:00
|
|
|
goto err_context_1;
|
|
|
|
}
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_get(vip);
|
|
|
|
i915_request_add(vip);
|
drm/i915: Use rcu instead of stop_machine in set_wedged
stop_machine is not really a locking primitive we should use, except
when the hw folks tell us the hw is broken and that's the only way to
work around it.
This patch tries to address the locking abuse of stop_machine() from
commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Tue Nov 22 14:41:21 2016 +0000
drm/i915: Stop the machine as we install the wedged submit_request handler
Chris said parts of the reasons for going with stop_machine() was that
it's no overhead for the fast-path. But these callbacks use irqsave
spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
To stay as close as possible to the stop_machine semantics we first
update all the submit function pointers to the nop handler, then call
synchronize_rcu() to make sure no new requests can be submitted. This
should give us exactly the huge barrier we want.
I pondered whether we should annotate engine->submit_request as __rcu
and use rcu_assign_pointer and rcu_dereference on it. But the reason
behind those is to make sure the compiler/cpu barriers are there for
when you have an actual data structure you point at, to make sure all
the writes are seen correctly on the read side. But we just have a
function pointer, and .text isn't changed, so no need for these
barriers and hence no need for annotations.
Unfortunately there's a complication with the call to
intel_engine_init_global_seqno:
- Without stop_machine we must hold the corresponding spinlock.
- Without stop_machine we must ensure that all requests are marked as
having failed with dma_fence_set_error() before we call it. That
means we need to split the nop request submission into two phases,
both synchronized with rcu:
1. Only stop submitting the requests to hw and mark them as failed.
2. After all pending requests in the scheduler/ring are suitably
marked up as failed and we can force complete them all, also force
complete by calling intel_engine_init_global_seqno().
This should fix the followwing lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G U
------------------------------------------------------
kworker/3:4/562 is trying to acquire lock:
(cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40
but task is already holding lock:
(&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&dev->struct_mutex){+.+.}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
__mutex_lock+0x86/0x9b0
mutex_lock_interruptible_nested+0x1b/0x20
i915_mutex_lock_interruptible+0x51/0x130 [i915]
i915_gem_fault+0x209/0x650 [i915]
__do_fault+0x1e/0x80
__handle_mm_fault+0xa08/0xed0
handle_mm_fault+0x156/0x300
__do_page_fault+0x2c5/0x570
do_page_fault+0x28/0x250
page_fault+0x22/0x30
-> #5 (&mm->mmap_sem){++++}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
__might_fault+0x68/0x90
_copy_to_user+0x23/0x70
filldir+0xa5/0x120
dcache_readdir+0xf9/0x170
iterate_dir+0x69/0x1a0
SyS_getdents+0xa5/0x140
entry_SYSCALL_64_fastpath+0x1c/0xb1
-> #4 (&sb->s_type->i_mutex_key#5){++++}:
down_write+0x3b/0x70
handle_create+0xcb/0x1e0
devtmpfsd+0x139/0x180
kthread+0x152/0x190
ret_from_fork+0x27/0x40
-> #3 ((complete)&req.done){+.+.}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
wait_for_common+0x58/0x210
wait_for_completion+0x1d/0x20
devtmpfs_create_node+0x13d/0x160
device_add+0x5eb/0x620
device_create_groups_vargs+0xe0/0xf0
device_create+0x3a/0x40
msr_device_create+0x2b/0x40
cpuhp_invoke_callback+0xc9/0xbf0
cpuhp_thread_fun+0x17b/0x240
smpboot_thread_fn+0x18a/0x280
kthread+0x152/0x190
ret_from_fork+0x27/0x40
-> #2 (cpuhp_state-up){+.+.}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
cpuhp_issue_call+0x133/0x1c0
__cpuhp_setup_state_cpuslocked+0x139/0x2a0
__cpuhp_setup_state+0x46/0x60
page_writeback_init+0x43/0x67
pagecache_init+0x3d/0x42
start_kernel+0x3a8/0x3fc
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x6d/0x70
verify_cpu+0x0/0xfb
-> #1 (cpuhp_state_mutex){+.+.}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
__mutex_lock+0x86/0x9b0
mutex_lock_nested+0x1b/0x20
__cpuhp_setup_state_cpuslocked+0x53/0x2a0
__cpuhp_setup_state+0x46/0x60
page_alloc_init+0x28/0x30
start_kernel+0x145/0x3fc
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x6d/0x70
verify_cpu+0x0/0xfb
-> #0 (cpu_hotplug_lock.rw_sem){++++}:
check_prev_add+0x430/0x840
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
cpus_read_lock+0x3d/0xb0
stop_machine+0x1c/0x40
i915_gem_set_wedged+0x1a/0x20 [i915]
i915_reset+0xb9/0x230 [i915]
i915_reset_device+0x1f6/0x260 [i915]
i915_handle_error+0x2d8/0x430 [i915]
hangcheck_declare_hang+0xd3/0xf0 [i915]
i915_hangcheck_elapsed+0x262/0x2d0 [i915]
process_one_work+0x233/0x660
worker_thread+0x4e/0x3b0
kthread+0x152/0x190
ret_from_fork+0x27/0x40
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&dev->struct_mutex);
lock(&mm->mmap_sem);
lock(&dev->struct_mutex);
lock(cpu_hotplug_lock.rw_sem);
*** DEADLOCK ***
3 locks held by kworker/3:4/562:
#0: ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
#1: ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
#2: (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
stack backtrace:
CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G U 4.14.0-rc3-CI-CI_DRM_3179+ #1
Hardware name: /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
Workqueue: events_long i915_hangcheck_elapsed [i915]
Call Trace:
dump_stack+0x68/0x9f
print_circular_bug+0x235/0x3c0
? lockdep_init_map_crosslock+0x20/0x20
check_prev_add+0x430/0x840
? irq_work_queue+0x86/0xe0
? wake_up_klogd+0x53/0x70
__lock_acquire+0x1420/0x15e0
? __lock_acquire+0x1420/0x15e0
? lockdep_init_map_crosslock+0x20/0x20
lock_acquire+0xb0/0x200
? stop_machine+0x1c/0x40
? i915_gem_object_truncate+0x50/0x50 [i915]
cpus_read_lock+0x3d/0xb0
? stop_machine+0x1c/0x40
stop_machine+0x1c/0x40
i915_gem_set_wedged+0x1a/0x20 [i915]
i915_reset+0xb9/0x230 [i915]
i915_reset_device+0x1f6/0x260 [i915]
? gen8_gt_irq_ack+0x170/0x170 [i915]
? work_on_cpu_safe+0x60/0x60
i915_handle_error+0x2d8/0x430 [i915]
? vsnprintf+0xd1/0x4b0
? scnprintf+0x3a/0x70
hangcheck_declare_hang+0xd3/0xf0 [i915]
? intel_runtime_pm_put+0x56/0xa0 [i915]
i915_hangcheck_elapsed+0x262/0x2d0 [i915]
process_one_work+0x233/0x660
worker_thread+0x4e/0x3b0
kthread+0x152/0x190
? process_one_work+0x660/0x660
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x27/0x40
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang
v2: Have 1 global synchronize_rcu() barrier across all engines, and
improve commit message.
v3: We need to protect the seqno update with the timeline spinlock (in
set_wedged) to avoid racing with other updates of the seqno, like we
already do in nop_submit_request (Chris).
v4: Use two-phase sequence to plug the race Chris spotted where we can
complete requests before they're marked up with -EIO.
v5: Review from Chris:
- simplify nop_submit_request.
- Add comment to rcu_read_lock section.
- Align comments with the new style.
v6: Remove unused variable to appease CI.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marta Lofstedt <marta.lofstedt@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171011091019.1425-1-daniel.vetter@ffwll.ch
2017-10-11 16:10:19 +07:00
|
|
|
rcu_read_lock();
|
2017-02-23 14:44:18 +07:00
|
|
|
request->engine->submit_request(request);
|
drm/i915: Use rcu instead of stop_machine in set_wedged
stop_machine is not really a locking primitive we should use, except
when the hw folks tell us the hw is broken and that's the only way to
work around it.
This patch tries to address the locking abuse of stop_machine() from
commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Tue Nov 22 14:41:21 2016 +0000
drm/i915: Stop the machine as we install the wedged submit_request handler
Chris said parts of the reasons for going with stop_machine() was that
it's no overhead for the fast-path. But these callbacks use irqsave
spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
To stay as close as possible to the stop_machine semantics we first
update all the submit function pointers to the nop handler, then call
synchronize_rcu() to make sure no new requests can be submitted. This
should give us exactly the huge barrier we want.
I pondered whether we should annotate engine->submit_request as __rcu
and use rcu_assign_pointer and rcu_dereference on it. But the reason
behind those is to make sure the compiler/cpu barriers are there for
when you have an actual data structure you point at, to make sure all
the writes are seen correctly on the read side. But we just have a
function pointer, and .text isn't changed, so no need for these
barriers and hence no need for annotations.
Unfortunately there's a complication with the call to
intel_engine_init_global_seqno:
- Without stop_machine we must hold the corresponding spinlock.
- Without stop_machine we must ensure that all requests are marked as
having failed with dma_fence_set_error() before we call it. That
means we need to split the nop request submission into two phases,
both synchronized with rcu:
1. Only stop submitting the requests to hw and mark them as failed.
2. After all pending requests in the scheduler/ring are suitably
marked up as failed and we can force complete them all, also force
complete by calling intel_engine_init_global_seqno().
This should fix the followwing lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G U
------------------------------------------------------
kworker/3:4/562 is trying to acquire lock:
(cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40
but task is already holding lock:
(&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&dev->struct_mutex){+.+.}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
__mutex_lock+0x86/0x9b0
mutex_lock_interruptible_nested+0x1b/0x20
i915_mutex_lock_interruptible+0x51/0x130 [i915]
i915_gem_fault+0x209/0x650 [i915]
__do_fault+0x1e/0x80
__handle_mm_fault+0xa08/0xed0
handle_mm_fault+0x156/0x300
__do_page_fault+0x2c5/0x570
do_page_fault+0x28/0x250
page_fault+0x22/0x30
-> #5 (&mm->mmap_sem){++++}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
__might_fault+0x68/0x90
_copy_to_user+0x23/0x70
filldir+0xa5/0x120
dcache_readdir+0xf9/0x170
iterate_dir+0x69/0x1a0
SyS_getdents+0xa5/0x140
entry_SYSCALL_64_fastpath+0x1c/0xb1
-> #4 (&sb->s_type->i_mutex_key#5){++++}:
down_write+0x3b/0x70
handle_create+0xcb/0x1e0
devtmpfsd+0x139/0x180
kthread+0x152/0x190
ret_from_fork+0x27/0x40
-> #3 ((complete)&req.done){+.+.}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
wait_for_common+0x58/0x210
wait_for_completion+0x1d/0x20
devtmpfs_create_node+0x13d/0x160
device_add+0x5eb/0x620
device_create_groups_vargs+0xe0/0xf0
device_create+0x3a/0x40
msr_device_create+0x2b/0x40
cpuhp_invoke_callback+0xc9/0xbf0
cpuhp_thread_fun+0x17b/0x240
smpboot_thread_fn+0x18a/0x280
kthread+0x152/0x190
ret_from_fork+0x27/0x40
-> #2 (cpuhp_state-up){+.+.}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
cpuhp_issue_call+0x133/0x1c0
__cpuhp_setup_state_cpuslocked+0x139/0x2a0
__cpuhp_setup_state+0x46/0x60
page_writeback_init+0x43/0x67
pagecache_init+0x3d/0x42
start_kernel+0x3a8/0x3fc
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x6d/0x70
verify_cpu+0x0/0xfb
-> #1 (cpuhp_state_mutex){+.+.}:
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
__mutex_lock+0x86/0x9b0
mutex_lock_nested+0x1b/0x20
__cpuhp_setup_state_cpuslocked+0x53/0x2a0
__cpuhp_setup_state+0x46/0x60
page_alloc_init+0x28/0x30
start_kernel+0x145/0x3fc
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x6d/0x70
verify_cpu+0x0/0xfb
-> #0 (cpu_hotplug_lock.rw_sem){++++}:
check_prev_add+0x430/0x840
__lock_acquire+0x1420/0x15e0
lock_acquire+0xb0/0x200
cpus_read_lock+0x3d/0xb0
stop_machine+0x1c/0x40
i915_gem_set_wedged+0x1a/0x20 [i915]
i915_reset+0xb9/0x230 [i915]
i915_reset_device+0x1f6/0x260 [i915]
i915_handle_error+0x2d8/0x430 [i915]
hangcheck_declare_hang+0xd3/0xf0 [i915]
i915_hangcheck_elapsed+0x262/0x2d0 [i915]
process_one_work+0x233/0x660
worker_thread+0x4e/0x3b0
kthread+0x152/0x190
ret_from_fork+0x27/0x40
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&dev->struct_mutex);
lock(&mm->mmap_sem);
lock(&dev->struct_mutex);
lock(cpu_hotplug_lock.rw_sem);
*** DEADLOCK ***
3 locks held by kworker/3:4/562:
#0: ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
#1: ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
#2: (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
stack backtrace:
CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G U 4.14.0-rc3-CI-CI_DRM_3179+ #1
Hardware name: /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
Workqueue: events_long i915_hangcheck_elapsed [i915]
Call Trace:
dump_stack+0x68/0x9f
print_circular_bug+0x235/0x3c0
? lockdep_init_map_crosslock+0x20/0x20
check_prev_add+0x430/0x840
? irq_work_queue+0x86/0xe0
? wake_up_klogd+0x53/0x70
__lock_acquire+0x1420/0x15e0
? __lock_acquire+0x1420/0x15e0
? lockdep_init_map_crosslock+0x20/0x20
lock_acquire+0xb0/0x200
? stop_machine+0x1c/0x40
? i915_gem_object_truncate+0x50/0x50 [i915]
cpus_read_lock+0x3d/0xb0
? stop_machine+0x1c/0x40
stop_machine+0x1c/0x40
i915_gem_set_wedged+0x1a/0x20 [i915]
i915_reset+0xb9/0x230 [i915]
i915_reset_device+0x1f6/0x260 [i915]
? gen8_gt_irq_ack+0x170/0x170 [i915]
? work_on_cpu_safe+0x60/0x60
i915_handle_error+0x2d8/0x430 [i915]
? vsnprintf+0xd1/0x4b0
? scnprintf+0x3a/0x70
hangcheck_declare_hang+0xd3/0xf0 [i915]
? intel_runtime_pm_put+0x56/0xa0 [i915]
i915_hangcheck_elapsed+0x262/0x2d0 [i915]
process_one_work+0x233/0x660
worker_thread+0x4e/0x3b0
kthread+0x152/0x190
? process_one_work+0x660/0x660
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x27/0x40
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang
v2: Have 1 global synchronize_rcu() barrier across all engines, and
improve commit message.
v3: We need to protect the seqno update with the timeline spinlock (in
set_wedged) to avoid racing with other updates of the seqno, like we
already do in nop_submit_request (Chris).
v4: Use two-phase sequence to plug the race Chris spotted where we can
complete requests before they're marked up with -EIO.
v5: Review from Chris:
- simplify nop_submit_request.
- Add comment to rcu_read_lock section.
- Align comments with the new style.
v6: Remove unused variable to appease CI.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marta Lofstedt <marta.lofstedt@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171011091019.1425-1-daniel.vetter@ffwll.ch
2017-10-11 16:10:19 +07:00
|
|
|
rcu_read_unlock();
|
2017-02-23 14:44:18 +07:00
|
|
|
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
if (i915_request_wait(vip, 0, HZ) == -ETIME) {
|
2019-02-26 16:49:20 +07:00
|
|
|
pr_err("timed out waiting for high priority request\n");
|
2017-02-23 14:44:18 +07:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
if (i915_request_completed(request)) {
|
2017-02-23 14:44:18 +07:00
|
|
|
pr_err("low priority request already completed\n");
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = 0;
|
|
|
|
err:
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_put(vip);
|
2017-02-23 14:44:18 +07:00
|
|
|
err_context_1:
|
|
|
|
mock_context_close(ctx[1]);
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_put(request);
|
2017-02-23 14:44:18 +07:00
|
|
|
err_context_0:
|
|
|
|
mock_context_close(ctx[0]);
|
|
|
|
mock_device_flush(i915);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
struct smoketest {
|
|
|
|
struct intel_engine_cs *engine;
|
|
|
|
struct i915_gem_context **contexts;
|
|
|
|
atomic_long_t num_waits, num_fences;
|
|
|
|
int ncontexts, max_batch;
|
2019-08-08 18:56:40 +07:00
|
|
|
struct i915_request *(*request_alloc)(struct intel_context *ce);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct i915_request *
|
2019-08-08 18:56:40 +07:00
|
|
|
__mock_request_alloc(struct intel_context *ce)
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
{
|
2019-08-08 18:56:40 +07:00
|
|
|
return mock_request(ce, 0);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct i915_request *
|
2019-08-08 18:56:40 +07:00
|
|
|
__live_request_alloc(struct intel_context *ce)
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
{
|
2019-08-08 18:56:40 +07:00
|
|
|
return intel_context_create_request(ce);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static int __igt_breadcrumbs_smoketest(void *arg)
|
|
|
|
{
|
|
|
|
struct smoketest *t = arg;
|
|
|
|
const unsigned int max_batch = min(t->ncontexts, t->max_batch) - 1;
|
|
|
|
const unsigned int total = 4 * t->ncontexts + 1;
|
|
|
|
unsigned int num_waits = 0, num_fences = 0;
|
|
|
|
struct i915_request **requests;
|
|
|
|
I915_RND_STATE(prng);
|
|
|
|
unsigned int *order;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A very simple test to catch the most egregious of list handling bugs.
|
|
|
|
*
|
|
|
|
* At its heart, we simply create oodles of requests running across
|
|
|
|
* multiple kthreads and enable signaling on them, for the sole purpose
|
|
|
|
* of stressing our breadcrumb handling. The only inspection we do is
|
|
|
|
* that the fences were marked as signaled.
|
|
|
|
*/
|
|
|
|
|
|
|
|
requests = kmalloc_array(total, sizeof(*requests), GFP_KERNEL);
|
|
|
|
if (!requests)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
order = i915_random_order(total, &prng);
|
|
|
|
if (!order) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out_requests;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (!kthread_should_stop()) {
|
|
|
|
struct i915_sw_fence *submit, *wait;
|
|
|
|
unsigned int n, count;
|
|
|
|
|
|
|
|
submit = heap_fence_create(GFP_KERNEL);
|
|
|
|
if (!submit) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
wait = heap_fence_create(GFP_KERNEL);
|
|
|
|
if (!wait) {
|
|
|
|
i915_sw_fence_commit(submit);
|
|
|
|
heap_fence_put(submit);
|
|
|
|
err = ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
i915_random_reorder(order, total, &prng);
|
|
|
|
count = 1 + i915_prandom_u32_max_state(max_batch, &prng);
|
|
|
|
|
|
|
|
for (n = 0; n < count; n++) {
|
|
|
|
struct i915_gem_context *ctx =
|
|
|
|
t->contexts[order[n] % t->ncontexts];
|
|
|
|
struct i915_request *rq;
|
2019-08-08 18:56:40 +07:00
|
|
|
struct intel_context *ce;
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
|
2019-08-09 02:45:25 +07:00
|
|
|
ce = i915_gem_context_get_engine(ctx, t->engine->legacy_idx);
|
2019-08-08 18:56:40 +07:00
|
|
|
GEM_BUG_ON(IS_ERR(ce));
|
|
|
|
rq = t->request_alloc(ce);
|
|
|
|
intel_context_put(ce);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
if (IS_ERR(rq)) {
|
|
|
|
err = PTR_ERR(rq);
|
|
|
|
count = n;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = i915_sw_fence_await_sw_fence_gfp(&rq->submit,
|
|
|
|
submit,
|
|
|
|
GFP_KERNEL);
|
|
|
|
|
|
|
|
requests[n] = i915_request_get(rq);
|
|
|
|
i915_request_add(rq);
|
|
|
|
|
|
|
|
if (err >= 0)
|
|
|
|
err = i915_sw_fence_await_dma_fence(wait,
|
|
|
|
&rq->fence,
|
|
|
|
0,
|
|
|
|
GFP_KERNEL);
|
|
|
|
|
|
|
|
if (err < 0) {
|
|
|
|
i915_request_put(rq);
|
|
|
|
count = n;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
i915_sw_fence_commit(submit);
|
|
|
|
i915_sw_fence_commit(wait);
|
|
|
|
|
|
|
|
if (!wait_event_timeout(wait->wait,
|
|
|
|
i915_sw_fence_done(wait),
|
drm/i915/execlists: Preempt-to-busy
When using a global seqno, we required a precise stop-the-workd event to
handle preemption and unwind the global seqno counter. To accomplish
this, we would preempt to a special out-of-band context and wait for the
machine to report that it was idle. Given an idle machine, we could very
precisely see which requests had completed and which we needed to feed
back into the run queue.
However, now that we have scrapped the global seqno, we no longer need
to precisely unwind the global counter and only track requests by their
per-context seqno. This allows us to loosely unwind inflight requests
while scheduling a preemption, with the enormous caveat that the
requests we put back on the run queue are still _inflight_ (until the
preemption request is complete). This makes request tracking much more
messy, as at any point then we can see a completed request that we
believe is not currently scheduled for execution. We also have to be
careful not to rewind RING_TAIL past RING_HEAD on preempting to the
running context, and for this we use a semaphore to prevent completion
of the request before continuing.
To accomplish this feat, we change how we track requests scheduled to
the HW. Instead of appending our requests onto a single list as we
submit, we track each submission to ELSP as its own block. Then upon
receiving the CS preemption event, we promote the pending block to the
inflight block (discarding what was previously being tracked). As normal
CS completion events arrive, we then remove stale entries from the
inflight tracker.
v2: Be a tinge paranoid and ensure we flush the write into the HWS page
for the GPU semaphore to pick in a timely fashion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190620142052.19311-1-chris@chris-wilson.co.uk
2019-06-20 21:20:51 +07:00
|
|
|
5 * HZ)) {
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
struct i915_request *rq = requests[count - 1];
|
|
|
|
|
drm/i915/execlists: Preempt-to-busy
When using a global seqno, we required a precise stop-the-workd event to
handle preemption and unwind the global seqno counter. To accomplish
this, we would preempt to a special out-of-band context and wait for the
machine to report that it was idle. Given an idle machine, we could very
precisely see which requests had completed and which we needed to feed
back into the run queue.
However, now that we have scrapped the global seqno, we no longer need
to precisely unwind the global counter and only track requests by their
per-context seqno. This allows us to loosely unwind inflight requests
while scheduling a preemption, with the enormous caveat that the
requests we put back on the run queue are still _inflight_ (until the
preemption request is complete). This makes request tracking much more
messy, as at any point then we can see a completed request that we
believe is not currently scheduled for execution. We also have to be
careful not to rewind RING_TAIL past RING_HEAD on preempting to the
running context, and for this we use a semaphore to prevent completion
of the request before continuing.
To accomplish this feat, we change how we track requests scheduled to
the HW. Instead of appending our requests onto a single list as we
submit, we track each submission to ELSP as its own block. Then upon
receiving the CS preemption event, we promote the pending block to the
inflight block (discarding what was previously being tracked). As normal
CS completion events arrive, we then remove stale entries from the
inflight tracker.
v2: Be a tinge paranoid and ensure we flush the write into the HWS page
for the GPU semaphore to pick in a timely fashion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190620142052.19311-1-chris@chris-wilson.co.uk
2019-06-20 21:20:51 +07:00
|
|
|
pr_err("waiting for %d/%d fences (last %llx:%lld) on %s timed out!\n",
|
|
|
|
atomic_read(&wait->pending), count,
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
rq->fence.context, rq->fence.seqno,
|
|
|
|
t->engine->name);
|
drm/i915/execlists: Preempt-to-busy
When using a global seqno, we required a precise stop-the-workd event to
handle preemption and unwind the global seqno counter. To accomplish
this, we would preempt to a special out-of-band context and wait for the
machine to report that it was idle. Given an idle machine, we could very
precisely see which requests had completed and which we needed to feed
back into the run queue.
However, now that we have scrapped the global seqno, we no longer need
to precisely unwind the global counter and only track requests by their
per-context seqno. This allows us to loosely unwind inflight requests
while scheduling a preemption, with the enormous caveat that the
requests we put back on the run queue are still _inflight_ (until the
preemption request is complete). This makes request tracking much more
messy, as at any point then we can see a completed request that we
believe is not currently scheduled for execution. We also have to be
careful not to rewind RING_TAIL past RING_HEAD on preempting to the
running context, and for this we use a semaphore to prevent completion
of the request before continuing.
To accomplish this feat, we change how we track requests scheduled to
the HW. Instead of appending our requests onto a single list as we
submit, we track each submission to ELSP as its own block. Then upon
receiving the CS preemption event, we promote the pending block to the
inflight block (discarding what was previously being tracked). As normal
CS completion events arrive, we then remove stale entries from the
inflight tracker.
v2: Be a tinge paranoid and ensure we flush the write into the HWS page
for the GPU semaphore to pick in a timely fashion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190620142052.19311-1-chris@chris-wilson.co.uk
2019-06-20 21:20:51 +07:00
|
|
|
GEM_TRACE_DUMP();
|
|
|
|
|
2019-07-13 02:29:53 +07:00
|
|
|
intel_gt_set_wedged(t->engine->gt);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
GEM_BUG_ON(!i915_request_completed(rq));
|
|
|
|
i915_sw_fence_wait(wait);
|
|
|
|
err = -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (n = 0; n < count; n++) {
|
|
|
|
struct i915_request *rq = requests[n];
|
|
|
|
|
|
|
|
if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
|
|
|
|
&rq->fence.flags)) {
|
|
|
|
pr_err("%llu:%llu was not signaled!\n",
|
|
|
|
rq->fence.context, rq->fence.seqno);
|
|
|
|
err = -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
i915_request_put(rq);
|
|
|
|
}
|
|
|
|
|
|
|
|
heap_fence_put(wait);
|
|
|
|
heap_fence_put(submit);
|
|
|
|
|
|
|
|
if (err < 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
num_fences += count;
|
|
|
|
num_waits++;
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
|
|
|
atomic_long_add(num_fences, &t->num_fences);
|
|
|
|
atomic_long_add(num_waits, &t->num_waits);
|
|
|
|
|
|
|
|
kfree(order);
|
|
|
|
out_requests:
|
|
|
|
kfree(requests);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mock_breadcrumbs_smoketest(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
|
|
|
struct smoketest t = {
|
2019-03-06 01:03:30 +07:00
|
|
|
.engine = i915->engine[RCS0],
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
.ncontexts = 1024,
|
|
|
|
.max_batch = 1024,
|
|
|
|
.request_alloc = __mock_request_alloc
|
|
|
|
};
|
|
|
|
unsigned int ncpus = num_online_cpus();
|
|
|
|
struct task_struct **threads;
|
|
|
|
unsigned int n;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Smoketest our breadcrumb/signal handling for requests across multiple
|
|
|
|
* threads. A very simple test to only catch the most egregious of bugs.
|
|
|
|
* See __igt_breadcrumbs_smoketest();
|
|
|
|
*/
|
|
|
|
|
|
|
|
threads = kmalloc_array(ncpus, sizeof(*threads), GFP_KERNEL);
|
|
|
|
if (!threads)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
t.contexts =
|
|
|
|
kmalloc_array(t.ncontexts, sizeof(*t.contexts), GFP_KERNEL);
|
|
|
|
if (!t.contexts) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_threads;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (n = 0; n < t.ncontexts; n++) {
|
|
|
|
t.contexts[n] = mock_context(t.engine->i915, "mock");
|
|
|
|
if (!t.contexts[n]) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_contexts;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (n = 0; n < ncpus; n++) {
|
|
|
|
threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
|
|
|
|
&t, "igt/%d", n);
|
|
|
|
if (IS_ERR(threads[n])) {
|
|
|
|
ret = PTR_ERR(threads[n]);
|
|
|
|
ncpus = n;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
get_task_struct(threads[n]);
|
|
|
|
}
|
|
|
|
|
|
|
|
msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
|
|
|
|
|
|
|
|
for (n = 0; n < ncpus; n++) {
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = kthread_stop(threads[n]);
|
|
|
|
if (err < 0 && !ret)
|
|
|
|
ret = err;
|
|
|
|
|
|
|
|
put_task_struct(threads[n]);
|
|
|
|
}
|
|
|
|
pr_info("Completed %lu waits for %lu fence across %d cpus\n",
|
|
|
|
atomic_long_read(&t.num_waits),
|
|
|
|
atomic_long_read(&t.num_fences),
|
|
|
|
ncpus);
|
|
|
|
|
|
|
|
out_contexts:
|
|
|
|
for (n = 0; n < t.ncontexts; n++) {
|
|
|
|
if (!t.contexts[n])
|
|
|
|
break;
|
|
|
|
mock_context_close(t.contexts[n]);
|
|
|
|
}
|
|
|
|
kfree(t.contexts);
|
|
|
|
out_threads:
|
|
|
|
kfree(threads);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
int i915_request_mock_selftests(void)
|
2017-02-14 00:15:21 +07:00
|
|
|
{
|
|
|
|
static const struct i915_subtest tests[] = {
|
|
|
|
SUBTEST(igt_add_request),
|
2017-02-14 00:15:22 +07:00
|
|
|
SUBTEST(igt_wait_request),
|
2017-02-14 00:15:23 +07:00
|
|
|
SUBTEST(igt_fence_wait),
|
2017-02-23 14:44:18 +07:00
|
|
|
SUBTEST(igt_request_rewind),
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
SUBTEST(mock_breadcrumbs_smoketest),
|
2017-02-14 00:15:21 +07:00
|
|
|
};
|
|
|
|
struct drm_i915_private *i915;
|
2019-01-14 21:21:22 +07:00
|
|
|
intel_wakeref_t wakeref;
|
2019-01-14 21:21:23 +07:00
|
|
|
int err = 0;
|
2017-02-14 00:15:21 +07:00
|
|
|
|
|
|
|
i915 = mock_gem_device();
|
|
|
|
if (!i915)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2019-06-14 06:21:55 +07:00
|
|
|
with_intel_runtime_pm(&i915->runtime_pm, wakeref)
|
2019-01-14 21:21:23 +07:00
|
|
|
err = i915_subtests(tests, i915);
|
2019-01-14 21:21:22 +07:00
|
|
|
|
2018-06-18 18:01:54 +07:00
|
|
|
drm_dev_put(&i915->drm);
|
2017-02-14 00:15:21 +07:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
2017-02-14 00:15:24 +07:00
|
|
|
|
|
|
|
static int live_nop_request(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
|
|
|
struct intel_engine_cs *engine;
|
2019-01-22 05:20:47 +07:00
|
|
|
struct igt_live_test t;
|
2017-02-14 00:15:24 +07:00
|
|
|
unsigned int id;
|
2017-11-15 05:33:46 +07:00
|
|
|
int err = -ENODEV;
|
2017-02-14 00:15:24 +07:00
|
|
|
|
|
|
|
/* Submit various sized batches of empty requests, to each engine
|
|
|
|
* (individually), and wait for the batch to complete. We can check
|
|
|
|
* the overhead of submitting requests to the hardware.
|
|
|
|
*/
|
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
unsigned long n, prime;
|
2018-06-14 19:49:23 +07:00
|
|
|
IGT_TIMEOUT(end_time);
|
2017-02-14 00:15:24 +07:00
|
|
|
ktime_t times[2] = {};
|
|
|
|
|
2019-01-22 05:20:47 +07:00
|
|
|
err = igt_live_test_begin(&t, i915, __func__, engine->name);
|
2017-02-14 00:15:24 +07:00
|
|
|
if (err)
|
2019-10-04 20:40:02 +07:00
|
|
|
return err;
|
2017-02-14 00:15:24 +07:00
|
|
|
|
|
|
|
for_each_prime_number_from(prime, 1, 8192) {
|
2019-10-04 20:40:02 +07:00
|
|
|
struct i915_request *request = NULL;
|
|
|
|
|
2017-02-14 00:15:24 +07:00
|
|
|
times[1] = ktime_get_raw();
|
|
|
|
|
|
|
|
for (n = 0; n < prime; n++) {
|
2019-10-04 20:40:02 +07:00
|
|
|
i915_request_put(request);
|
2019-04-25 03:07:16 +07:00
|
|
|
request = i915_request_create(engine->kernel_context);
|
2019-10-04 20:40:02 +07:00
|
|
|
if (IS_ERR(request))
|
|
|
|
return PTR_ERR(request);
|
2017-02-14 00:15:24 +07:00
|
|
|
|
|
|
|
/* This space is left intentionally blank.
|
|
|
|
*
|
|
|
|
* We do not actually want to perform any
|
|
|
|
* action with this request, we just want
|
|
|
|
* to measure the latency in allocation
|
|
|
|
* and submission of our breadcrumbs -
|
|
|
|
* ensuring that the bare request is sufficient
|
|
|
|
* for the system to work (i.e. proper HEAD
|
|
|
|
* tracking of the rings, interrupt handling,
|
|
|
|
* etc). It also gives us the lowest bounds
|
|
|
|
* for latency.
|
|
|
|
*/
|
|
|
|
|
2019-10-04 20:40:02 +07:00
|
|
|
i915_request_get(request);
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_add(request);
|
2017-02-14 00:15:24 +07:00
|
|
|
}
|
2019-06-18 14:41:30 +07:00
|
|
|
i915_request_wait(request, 0, MAX_SCHEDULE_TIMEOUT);
|
2019-10-04 20:40:02 +07:00
|
|
|
i915_request_put(request);
|
2017-02-14 00:15:24 +07:00
|
|
|
|
|
|
|
times[1] = ktime_sub(ktime_get_raw(), times[1]);
|
|
|
|
if (prime == 1)
|
|
|
|
times[0] = times[1];
|
|
|
|
|
|
|
|
if (__igt_timeout(end_time, NULL))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2019-01-22 05:20:47 +07:00
|
|
|
err = igt_live_test_end(&t);
|
2017-02-14 00:15:24 +07:00
|
|
|
if (err)
|
2019-10-04 20:40:02 +07:00
|
|
|
return err;
|
2017-02-14 00:15:24 +07:00
|
|
|
|
|
|
|
pr_info("Request latencies on %s: 1 = %lluns, %lu = %lluns\n",
|
|
|
|
engine->name,
|
|
|
|
ktime_to_ns(times[0]),
|
|
|
|
prime, div64_u64(ktime_to_ns(times[1]), prime));
|
|
|
|
}
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-02-14 00:15:27 +07:00
|
|
|
static struct i915_vma *empty_batch(struct drm_i915_private *i915)
|
|
|
|
{
|
|
|
|
struct drm_i915_gem_object *obj;
|
|
|
|
struct i915_vma *vma;
|
|
|
|
u32 *cmd;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
obj = i915_gem_object_create_internal(i915, PAGE_SIZE);
|
|
|
|
if (IS_ERR(obj))
|
|
|
|
return ERR_CAST(obj);
|
|
|
|
|
|
|
|
cmd = i915_gem_object_pin_map(obj, I915_MAP_WB);
|
|
|
|
if (IS_ERR(cmd)) {
|
|
|
|
err = PTR_ERR(cmd);
|
|
|
|
goto err;
|
|
|
|
}
|
2017-09-26 22:34:09 +07:00
|
|
|
|
2017-02-14 00:15:27 +07:00
|
|
|
*cmd = MI_BATCH_BUFFER_END;
|
2017-09-26 22:34:09 +07:00
|
|
|
|
drm/i915: Flush pages on acquisition
When we return pages to the system, we ensure that they are marked as
being in the CPU domain since any external access is uncontrolled and we
must assume the worst. This means that we need to always flush the pages
on acquisition if we need to use them on the GPU, and from the beginning
have used set-domain. Set-domain is overkill for the purpose as it is a
general synchronisation barrier, but our intent is to only flush the
pages being swapped in. If we move that flush into the pages acquisition
phase, we know then that when we have obj->mm.pages, they are coherent
with the GPU and need only maintain that status without resorting to
heavy handed use of set-domain.
The principle knock-on effect for userspace is through mmap-gtt
pagefaulting. Our uAPI has always implied that the GTT mmap was async
(especially as when any pagefault occurs is unpredicatable to userspace)
and so userspace had to apply explicit domain control itself
(set-domain). However, swapping is transparent to the kernel, and so on
first fault we need to acquire the pages and make them coherent for
access through the GTT. Our use of set-domain here leaks into the uABI
that the first pagefault was synchronous. This is unintentional and
baring a few igt should be unoticed, nevertheless we bump the uABI
version for mmap-gtt to reflect the change in behaviour.
Another implication of the change is that gem_create() is presumed to
create an object that is coherent with the CPU and is in the CPU write
domain, so a set-domain(CPU) following a gem_create() would be a minor
operation that merely checked whether we could allocate all pages for
the object. On applying this change, a set-domain(CPU) causes a clflush
as we acquire the pages. This will have a small impact on mesa as we move
the clflush here on !llc from execbuf time to create, but that should
have minimal performance impact as the same clflush exists but is now
done early and because of the clflush issue, userspace recycles bo and
so should resist allocating fresh objects.
Internally, the presumption that objects are created in the CPU
write-domain and remain so through writes to obj->mm.mapping is more
prevalent than I expected; but easy enough to catch and apply a manual
flush.
For the future, we should push the page flush from the central
set_pages() into the callers so that we can more finely control when it
is applied, but for now doing it one location is easier to validate, at
the cost of sometimes flushing when there is no need.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190321161908.8007-1-chris@chris-wilson.co.uk
2019-03-21 23:19:07 +07:00
|
|
|
__i915_gem_object_flush_map(obj, 0, 64);
|
2017-02-14 00:15:27 +07:00
|
|
|
i915_gem_object_unpin_map(obj);
|
|
|
|
|
2019-06-21 14:08:02 +07:00
|
|
|
intel_gt_chipset_flush(&i915->gt);
|
2017-02-14 00:15:27 +07:00
|
|
|
|
2018-06-05 22:37:58 +07:00
|
|
|
vma = i915_vma_instance(obj, &i915->ggtt.vm, NULL);
|
2017-02-14 00:15:27 +07:00
|
|
|
if (IS_ERR(vma)) {
|
|
|
|
err = PTR_ERR(vma);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = i915_vma_pin(vma, 0, 0, PIN_USER | PIN_GLOBAL);
|
|
|
|
if (err)
|
|
|
|
goto err;
|
|
|
|
|
drm/i915: Pull i915_vma_pin under the vm->mutex
Replace the struct_mutex requirement for pinning the i915_vma with the
local vm->mutex instead. Note that the vm->mutex is tainted by the
shrinker (we require unbinding from inside fs-reclaim) and so we cannot
allocate while holding that mutex. Instead we have to preallocate
workers to do allocate and apply the PTE updates after we have we
reserved their slot in the drm_mm (using fences to order the PTE writes
with the GPU work and with later unbind).
In adding the asynchronous vma binding, one subtle requirement is to
avoid coupling the binding fence into the backing object->resv. That is
the asynchronous binding only applies to the vma timeline itself and not
to the pages as that is a more global timeline (the binding of one vma
does not need to be ordered with another vma, nor does the implicit GEM
fencing depend on a vma, only on writes to the backing store). Keeping
the vma binding distinct from the backing store timelines is verified by
a number of async gem_exec_fence and gem_exec_schedule tests. The way we
do this is quite simple, we keep the fence for the vma binding separate
and only wait on it as required, and never add it to the obj->resv
itself.
Another consequence in reducing the locking around the vma is the
destruction of the vma is no longer globally serialised by struct_mutex.
A natural solution would be to add a kref to i915_vma, but that requires
decoupling the reference cycles, possibly by introducing a new
i915_mm_pages object that is own by both obj->mm and vma->pages.
However, we have not taken that route due to the overshadowing lmem/ttm
discussions, and instead play a series of complicated games with
trylocks to (hopefully) ensure that only one destruction path is called!
v2: Add some commentary, and some helpers to reduce patch churn.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-4-chris@chris-wilson.co.uk
2019-10-04 20:39:58 +07:00
|
|
|
/* Force the wait wait now to avoid including it in the benchmark */
|
|
|
|
err = i915_vma_sync(vma);
|
|
|
|
if (err)
|
|
|
|
goto err_pin;
|
|
|
|
|
2017-02-14 00:15:27 +07:00
|
|
|
return vma;
|
|
|
|
|
drm/i915: Pull i915_vma_pin under the vm->mutex
Replace the struct_mutex requirement for pinning the i915_vma with the
local vm->mutex instead. Note that the vm->mutex is tainted by the
shrinker (we require unbinding from inside fs-reclaim) and so we cannot
allocate while holding that mutex. Instead we have to preallocate
workers to do allocate and apply the PTE updates after we have we
reserved their slot in the drm_mm (using fences to order the PTE writes
with the GPU work and with later unbind).
In adding the asynchronous vma binding, one subtle requirement is to
avoid coupling the binding fence into the backing object->resv. That is
the asynchronous binding only applies to the vma timeline itself and not
to the pages as that is a more global timeline (the binding of one vma
does not need to be ordered with another vma, nor does the implicit GEM
fencing depend on a vma, only on writes to the backing store). Keeping
the vma binding distinct from the backing store timelines is verified by
a number of async gem_exec_fence and gem_exec_schedule tests. The way we
do this is quite simple, we keep the fence for the vma binding separate
and only wait on it as required, and never add it to the obj->resv
itself.
Another consequence in reducing the locking around the vma is the
destruction of the vma is no longer globally serialised by struct_mutex.
A natural solution would be to add a kref to i915_vma, but that requires
decoupling the reference cycles, possibly by introducing a new
i915_mm_pages object that is own by both obj->mm and vma->pages.
However, we have not taken that route due to the overshadowing lmem/ttm
discussions, and instead play a series of complicated games with
trylocks to (hopefully) ensure that only one destruction path is called!
v2: Add some commentary, and some helpers to reduce patch churn.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-4-chris@chris-wilson.co.uk
2019-10-04 20:39:58 +07:00
|
|
|
err_pin:
|
|
|
|
i915_vma_unpin(vma);
|
2017-02-14 00:15:27 +07:00
|
|
|
err:
|
|
|
|
i915_gem_object_put(obj);
|
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
static struct i915_request *
|
2017-02-14 00:15:27 +07:00
|
|
|
empty_request(struct intel_engine_cs *engine,
|
|
|
|
struct i915_vma *batch)
|
|
|
|
{
|
2018-02-21 16:56:36 +07:00
|
|
|
struct i915_request *request;
|
2017-02-14 00:15:27 +07:00
|
|
|
int err;
|
|
|
|
|
2019-04-25 03:07:16 +07:00
|
|
|
request = i915_request_create(engine->kernel_context);
|
2017-02-14 00:15:27 +07:00
|
|
|
if (IS_ERR(request))
|
|
|
|
return request;
|
|
|
|
|
|
|
|
err = engine->emit_bb_start(request,
|
|
|
|
batch->node.start,
|
|
|
|
batch->node.size,
|
|
|
|
I915_DISPATCH_SECURE);
|
|
|
|
if (err)
|
|
|
|
goto out_request;
|
|
|
|
|
2019-10-04 20:40:02 +07:00
|
|
|
i915_request_get(request);
|
2017-02-14 00:15:27 +07:00
|
|
|
out_request:
|
2018-06-12 17:51:35 +07:00
|
|
|
i915_request_add(request);
|
2017-02-14 00:15:27 +07:00
|
|
|
return err ? ERR_PTR(err) : request;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int live_empty_request(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
|
|
|
struct intel_engine_cs *engine;
|
2019-01-22 05:20:47 +07:00
|
|
|
struct igt_live_test t;
|
2017-02-14 00:15:27 +07:00
|
|
|
struct i915_vma *batch;
|
|
|
|
unsigned int id;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
/* Submit various sized batches of empty requests, to each engine
|
|
|
|
* (individually), and wait for the batch to complete. We can check
|
|
|
|
* the overhead of submitting requests to the hardware.
|
|
|
|
*/
|
|
|
|
|
|
|
|
batch = empty_batch(i915);
|
2019-10-04 20:40:02 +07:00
|
|
|
if (IS_ERR(batch))
|
|
|
|
return PTR_ERR(batch);
|
2017-02-14 00:15:27 +07:00
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
IGT_TIMEOUT(end_time);
|
2018-02-21 16:56:36 +07:00
|
|
|
struct i915_request *request;
|
2017-02-14 00:15:27 +07:00
|
|
|
unsigned long n, prime;
|
|
|
|
ktime_t times[2] = {};
|
|
|
|
|
2019-01-22 05:20:47 +07:00
|
|
|
err = igt_live_test_begin(&t, i915, __func__, engine->name);
|
2017-02-14 00:15:27 +07:00
|
|
|
if (err)
|
|
|
|
goto out_batch;
|
|
|
|
|
|
|
|
/* Warmup / preload */
|
|
|
|
request = empty_request(engine, batch);
|
|
|
|
if (IS_ERR(request)) {
|
|
|
|
err = PTR_ERR(request);
|
|
|
|
goto out_batch;
|
|
|
|
}
|
2019-06-18 14:41:30 +07:00
|
|
|
i915_request_wait(request, 0, MAX_SCHEDULE_TIMEOUT);
|
2017-02-14 00:15:27 +07:00
|
|
|
|
|
|
|
for_each_prime_number_from(prime, 1, 8192) {
|
|
|
|
times[1] = ktime_get_raw();
|
|
|
|
|
|
|
|
for (n = 0; n < prime; n++) {
|
2019-10-04 20:40:02 +07:00
|
|
|
i915_request_put(request);
|
2017-02-14 00:15:27 +07:00
|
|
|
request = empty_request(engine, batch);
|
|
|
|
if (IS_ERR(request)) {
|
|
|
|
err = PTR_ERR(request);
|
|
|
|
goto out_batch;
|
|
|
|
}
|
|
|
|
}
|
2019-06-18 14:41:30 +07:00
|
|
|
i915_request_wait(request, 0, MAX_SCHEDULE_TIMEOUT);
|
2017-02-14 00:15:27 +07:00
|
|
|
|
|
|
|
times[1] = ktime_sub(ktime_get_raw(), times[1]);
|
|
|
|
if (prime == 1)
|
|
|
|
times[0] = times[1];
|
|
|
|
|
|
|
|
if (__igt_timeout(end_time, NULL))
|
|
|
|
break;
|
|
|
|
}
|
2019-10-04 20:40:02 +07:00
|
|
|
i915_request_put(request);
|
2017-02-14 00:15:27 +07:00
|
|
|
|
2019-01-22 05:20:47 +07:00
|
|
|
err = igt_live_test_end(&t);
|
2017-02-14 00:15:27 +07:00
|
|
|
if (err)
|
|
|
|
goto out_batch;
|
|
|
|
|
|
|
|
pr_info("Batch latencies on %s: 1 = %lluns, %lu = %lluns\n",
|
|
|
|
engine->name,
|
|
|
|
ktime_to_ns(times[0]),
|
|
|
|
prime, div64_u64(ktime_to_ns(times[1]), prime));
|
|
|
|
}
|
|
|
|
|
|
|
|
out_batch:
|
|
|
|
i915_vma_unpin(batch);
|
|
|
|
i915_vma_put(batch);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-02-14 00:15:25 +07:00
|
|
|
static struct i915_vma *recursive_batch(struct drm_i915_private *i915)
|
|
|
|
{
|
|
|
|
struct i915_gem_context *ctx = i915->kernel_context;
|
|
|
|
struct drm_i915_gem_object *obj;
|
|
|
|
const int gen = INTEL_GEN(i915);
|
2019-10-04 20:40:09 +07:00
|
|
|
struct i915_address_space *vm;
|
2017-02-14 00:15:25 +07:00
|
|
|
struct i915_vma *vma;
|
|
|
|
u32 *cmd;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
obj = i915_gem_object_create_internal(i915, PAGE_SIZE);
|
|
|
|
if (IS_ERR(obj))
|
|
|
|
return ERR_CAST(obj);
|
|
|
|
|
2019-10-04 20:40:09 +07:00
|
|
|
vm = i915_gem_context_get_vm_rcu(ctx);
|
2017-02-14 00:15:25 +07:00
|
|
|
vma = i915_vma_instance(obj, vm, NULL);
|
2019-10-04 20:40:09 +07:00
|
|
|
i915_vm_put(vm);
|
2017-02-14 00:15:25 +07:00
|
|
|
if (IS_ERR(vma)) {
|
|
|
|
err = PTR_ERR(vma);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = i915_vma_pin(vma, 0, 0, PIN_USER);
|
|
|
|
if (err)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
cmd = i915_gem_object_pin_map(obj, I915_MAP_WC);
|
|
|
|
if (IS_ERR(cmd)) {
|
|
|
|
err = PTR_ERR(cmd);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (gen >= 8) {
|
|
|
|
*cmd++ = MI_BATCH_BUFFER_START | 1 << 8 | 1;
|
|
|
|
*cmd++ = lower_32_bits(vma->node.start);
|
|
|
|
*cmd++ = upper_32_bits(vma->node.start);
|
|
|
|
} else if (gen >= 6) {
|
|
|
|
*cmd++ = MI_BATCH_BUFFER_START | 1 << 8;
|
|
|
|
*cmd++ = lower_32_bits(vma->node.start);
|
|
|
|
} else {
|
2018-07-05 22:47:56 +07:00
|
|
|
*cmd++ = MI_BATCH_BUFFER_START | MI_BATCH_GTT;
|
2017-02-14 00:15:25 +07:00
|
|
|
*cmd++ = lower_32_bits(vma->node.start);
|
|
|
|
}
|
|
|
|
*cmd++ = MI_BATCH_BUFFER_END; /* terminate early in case of error */
|
|
|
|
|
drm/i915: Flush pages on acquisition
When we return pages to the system, we ensure that they are marked as
being in the CPU domain since any external access is uncontrolled and we
must assume the worst. This means that we need to always flush the pages
on acquisition if we need to use them on the GPU, and from the beginning
have used set-domain. Set-domain is overkill for the purpose as it is a
general synchronisation barrier, but our intent is to only flush the
pages being swapped in. If we move that flush into the pages acquisition
phase, we know then that when we have obj->mm.pages, they are coherent
with the GPU and need only maintain that status without resorting to
heavy handed use of set-domain.
The principle knock-on effect for userspace is through mmap-gtt
pagefaulting. Our uAPI has always implied that the GTT mmap was async
(especially as when any pagefault occurs is unpredicatable to userspace)
and so userspace had to apply explicit domain control itself
(set-domain). However, swapping is transparent to the kernel, and so on
first fault we need to acquire the pages and make them coherent for
access through the GTT. Our use of set-domain here leaks into the uABI
that the first pagefault was synchronous. This is unintentional and
baring a few igt should be unoticed, nevertheless we bump the uABI
version for mmap-gtt to reflect the change in behaviour.
Another implication of the change is that gem_create() is presumed to
create an object that is coherent with the CPU and is in the CPU write
domain, so a set-domain(CPU) following a gem_create() would be a minor
operation that merely checked whether we could allocate all pages for
the object. On applying this change, a set-domain(CPU) causes a clflush
as we acquire the pages. This will have a small impact on mesa as we move
the clflush here on !llc from execbuf time to create, but that should
have minimal performance impact as the same clflush exists but is now
done early and because of the clflush issue, userspace recycles bo and
so should resist allocating fresh objects.
Internally, the presumption that objects are created in the CPU
write-domain and remain so through writes to obj->mm.mapping is more
prevalent than I expected; but easy enough to catch and apply a manual
flush.
For the future, we should push the page flush from the central
set_pages() into the callers so that we can more finely control when it
is applied, but for now doing it one location is easier to validate, at
the cost of sometimes flushing when there is no need.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190321161908.8007-1-chris@chris-wilson.co.uk
2019-03-21 23:19:07 +07:00
|
|
|
__i915_gem_object_flush_map(obj, 0, 64);
|
2017-02-14 00:15:25 +07:00
|
|
|
i915_gem_object_unpin_map(obj);
|
|
|
|
|
2019-06-21 14:08:02 +07:00
|
|
|
intel_gt_chipset_flush(&i915->gt);
|
drm/i915: Flush pages on acquisition
When we return pages to the system, we ensure that they are marked as
being in the CPU domain since any external access is uncontrolled and we
must assume the worst. This means that we need to always flush the pages
on acquisition if we need to use them on the GPU, and from the beginning
have used set-domain. Set-domain is overkill for the purpose as it is a
general synchronisation barrier, but our intent is to only flush the
pages being swapped in. If we move that flush into the pages acquisition
phase, we know then that when we have obj->mm.pages, they are coherent
with the GPU and need only maintain that status without resorting to
heavy handed use of set-domain.
The principle knock-on effect for userspace is through mmap-gtt
pagefaulting. Our uAPI has always implied that the GTT mmap was async
(especially as when any pagefault occurs is unpredicatable to userspace)
and so userspace had to apply explicit domain control itself
(set-domain). However, swapping is transparent to the kernel, and so on
first fault we need to acquire the pages and make them coherent for
access through the GTT. Our use of set-domain here leaks into the uABI
that the first pagefault was synchronous. This is unintentional and
baring a few igt should be unoticed, nevertheless we bump the uABI
version for mmap-gtt to reflect the change in behaviour.
Another implication of the change is that gem_create() is presumed to
create an object that is coherent with the CPU and is in the CPU write
domain, so a set-domain(CPU) following a gem_create() would be a minor
operation that merely checked whether we could allocate all pages for
the object. On applying this change, a set-domain(CPU) causes a clflush
as we acquire the pages. This will have a small impact on mesa as we move
the clflush here on !llc from execbuf time to create, but that should
have minimal performance impact as the same clflush exists but is now
done early and because of the clflush issue, userspace recycles bo and
so should resist allocating fresh objects.
Internally, the presumption that objects are created in the CPU
write-domain and remain so through writes to obj->mm.mapping is more
prevalent than I expected; but easy enough to catch and apply a manual
flush.
For the future, we should push the page flush from the central
set_pages() into the callers so that we can more finely control when it
is applied, but for now doing it one location is easier to validate, at
the cost of sometimes flushing when there is no need.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190321161908.8007-1-chris@chris-wilson.co.uk
2019-03-21 23:19:07 +07:00
|
|
|
|
2017-02-14 00:15:25 +07:00
|
|
|
return vma;
|
|
|
|
|
|
|
|
err:
|
|
|
|
i915_gem_object_put(obj);
|
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int recursive_batch_resolve(struct i915_vma *batch)
|
|
|
|
{
|
|
|
|
u32 *cmd;
|
|
|
|
|
|
|
|
cmd = i915_gem_object_pin_map(batch->obj, I915_MAP_WC);
|
|
|
|
if (IS_ERR(cmd))
|
|
|
|
return PTR_ERR(cmd);
|
|
|
|
|
|
|
|
*cmd = MI_BATCH_BUFFER_END;
|
2019-06-21 14:08:02 +07:00
|
|
|
intel_gt_chipset_flush(batch->vm->gt);
|
2017-02-14 00:15:25 +07:00
|
|
|
|
|
|
|
i915_gem_object_unpin_map(batch->obj);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int live_all_engines(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
|
|
|
struct intel_engine_cs *engine;
|
2018-02-21 16:56:36 +07:00
|
|
|
struct i915_request *request[I915_NUM_ENGINES];
|
2019-01-22 05:20:47 +07:00
|
|
|
struct igt_live_test t;
|
2017-02-14 00:15:25 +07:00
|
|
|
struct i915_vma *batch;
|
|
|
|
unsigned int id;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
/* Check we can submit requests to all engines simultaneously. We
|
|
|
|
* send a recursive batch to each engine - checking that we don't
|
|
|
|
* block doing so, and that they don't complete too soon.
|
|
|
|
*/
|
|
|
|
|
2019-01-22 05:20:47 +07:00
|
|
|
err = igt_live_test_begin(&t, i915, __func__, "");
|
2017-02-14 00:15:25 +07:00
|
|
|
if (err)
|
2019-10-04 20:40:02 +07:00
|
|
|
return err;
|
2017-02-14 00:15:25 +07:00
|
|
|
|
|
|
|
batch = recursive_batch(i915);
|
|
|
|
if (IS_ERR(batch)) {
|
|
|
|
err = PTR_ERR(batch);
|
|
|
|
pr_err("%s: Unable to create batch, err=%d\n", __func__, err);
|
2019-10-04 20:40:02 +07:00
|
|
|
return err;
|
2017-02-14 00:15:25 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
2019-04-25 03:07:16 +07:00
|
|
|
request[id] = i915_request_create(engine->kernel_context);
|
2017-02-14 00:15:25 +07:00
|
|
|
if (IS_ERR(request[id])) {
|
|
|
|
err = PTR_ERR(request[id]);
|
|
|
|
pr_err("%s: Request allocation failed with err=%d\n",
|
|
|
|
__func__, err);
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = engine->emit_bb_start(request[id],
|
|
|
|
batch->node.start,
|
|
|
|
batch->node.size,
|
|
|
|
0);
|
|
|
|
GEM_BUG_ON(err);
|
|
|
|
request[id]->batch = batch;
|
|
|
|
|
2019-05-28 16:29:51 +07:00
|
|
|
i915_vma_lock(batch);
|
2019-08-19 18:20:33 +07:00
|
|
|
err = i915_request_await_object(request[id], batch->obj, 0);
|
|
|
|
if (err == 0)
|
|
|
|
err = i915_vma_move_to_active(batch, request[id], 0);
|
2019-05-28 16:29:51 +07:00
|
|
|
i915_vma_unlock(batch);
|
2018-07-06 17:39:44 +07:00
|
|
|
GEM_BUG_ON(err);
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_get(request[id]);
|
|
|
|
i915_request_add(request[id]);
|
2017-02-14 00:15:25 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
2018-02-21 16:56:36 +07:00
|
|
|
if (i915_request_completed(request[id])) {
|
2017-02-14 00:15:25 +07:00
|
|
|
pr_err("%s(%s): request completed too early!\n",
|
|
|
|
__func__, engine->name);
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
err = recursive_batch_resolve(batch);
|
|
|
|
if (err) {
|
|
|
|
pr_err("%s: failed to resolve batch, err=%d\n", __func__, err);
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
long timeout;
|
|
|
|
|
2019-06-18 14:41:30 +07:00
|
|
|
timeout = i915_request_wait(request[id], 0,
|
2017-02-14 00:15:25 +07:00
|
|
|
MAX_SCHEDULE_TIMEOUT);
|
|
|
|
if (timeout < 0) {
|
|
|
|
err = timeout;
|
|
|
|
pr_err("%s: error waiting for request on %s, err=%d\n",
|
|
|
|
__func__, engine->name, err);
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
GEM_BUG_ON(!i915_request_completed(request[id]));
|
|
|
|
i915_request_put(request[id]);
|
2017-02-14 00:15:25 +07:00
|
|
|
request[id] = NULL;
|
|
|
|
}
|
|
|
|
|
2019-01-22 05:20:47 +07:00
|
|
|
err = igt_live_test_end(&t);
|
2017-02-14 00:15:25 +07:00
|
|
|
|
|
|
|
out_request:
|
|
|
|
for_each_engine(engine, i915, id)
|
|
|
|
if (request[id])
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_put(request[id]);
|
2017-02-14 00:15:25 +07:00
|
|
|
i915_vma_unpin(batch);
|
|
|
|
i915_vma_put(batch);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-02-14 00:15:26 +07:00
|
|
|
static int live_sequential_engines(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
2018-02-21 16:56:36 +07:00
|
|
|
struct i915_request *request[I915_NUM_ENGINES] = {};
|
|
|
|
struct i915_request *prev = NULL;
|
2017-02-14 00:15:26 +07:00
|
|
|
struct intel_engine_cs *engine;
|
2019-01-22 05:20:47 +07:00
|
|
|
struct igt_live_test t;
|
2017-02-14 00:15:26 +07:00
|
|
|
unsigned int id;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
/* Check we can submit requests to all engines sequentially, such
|
|
|
|
* that each successive request waits for the earlier ones. This
|
|
|
|
* tests that we don't execute requests out of order, even though
|
|
|
|
* they are running on independent engines.
|
|
|
|
*/
|
|
|
|
|
2019-01-22 05:20:47 +07:00
|
|
|
err = igt_live_test_begin(&t, i915, __func__, "");
|
2017-02-14 00:15:26 +07:00
|
|
|
if (err)
|
2019-10-04 20:40:02 +07:00
|
|
|
return err;
|
2017-02-14 00:15:26 +07:00
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
struct i915_vma *batch;
|
|
|
|
|
|
|
|
batch = recursive_batch(i915);
|
|
|
|
if (IS_ERR(batch)) {
|
|
|
|
err = PTR_ERR(batch);
|
|
|
|
pr_err("%s: Unable to create batch for %s, err=%d\n",
|
|
|
|
__func__, engine->name, err);
|
2019-10-04 20:40:02 +07:00
|
|
|
return err;
|
2017-02-14 00:15:26 +07:00
|
|
|
}
|
|
|
|
|
2019-04-25 03:07:16 +07:00
|
|
|
request[id] = i915_request_create(engine->kernel_context);
|
2017-02-14 00:15:26 +07:00
|
|
|
if (IS_ERR(request[id])) {
|
|
|
|
err = PTR_ERR(request[id]);
|
|
|
|
pr_err("%s: Request allocation failed for %s with err=%d\n",
|
|
|
|
__func__, engine->name, err);
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (prev) {
|
2018-02-21 16:56:36 +07:00
|
|
|
err = i915_request_await_dma_fence(request[id],
|
|
|
|
&prev->fence);
|
2017-02-14 00:15:26 +07:00
|
|
|
if (err) {
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_add(request[id]);
|
2017-02-14 00:15:26 +07:00
|
|
|
pr_err("%s: Request await failed for %s with err=%d\n",
|
|
|
|
__func__, engine->name, err);
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
err = engine->emit_bb_start(request[id],
|
|
|
|
batch->node.start,
|
|
|
|
batch->node.size,
|
|
|
|
0);
|
|
|
|
GEM_BUG_ON(err);
|
|
|
|
request[id]->batch = batch;
|
|
|
|
|
2019-05-28 16:29:51 +07:00
|
|
|
i915_vma_lock(batch);
|
2019-08-22 02:38:51 +07:00
|
|
|
err = i915_request_await_object(request[id], batch->obj, false);
|
|
|
|
if (err == 0)
|
|
|
|
err = i915_vma_move_to_active(batch, request[id], 0);
|
2019-05-28 16:29:51 +07:00
|
|
|
i915_vma_unlock(batch);
|
2018-07-06 17:39:44 +07:00
|
|
|
GEM_BUG_ON(err);
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_get(request[id]);
|
|
|
|
i915_request_add(request[id]);
|
2017-02-14 00:15:26 +07:00
|
|
|
|
|
|
|
prev = request[id];
|
|
|
|
}
|
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
long timeout;
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
if (i915_request_completed(request[id])) {
|
2017-02-14 00:15:26 +07:00
|
|
|
pr_err("%s(%s): request completed too early!\n",
|
|
|
|
__func__, engine->name);
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = recursive_batch_resolve(request[id]->batch);
|
|
|
|
if (err) {
|
|
|
|
pr_err("%s: failed to resolve batch, err=%d\n",
|
|
|
|
__func__, err);
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
|
2019-06-18 14:41:30 +07:00
|
|
|
timeout = i915_request_wait(request[id], 0,
|
2017-02-14 00:15:26 +07:00
|
|
|
MAX_SCHEDULE_TIMEOUT);
|
|
|
|
if (timeout < 0) {
|
|
|
|
err = timeout;
|
|
|
|
pr_err("%s: error waiting for request on %s, err=%d\n",
|
|
|
|
__func__, engine->name, err);
|
|
|
|
goto out_request;
|
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
GEM_BUG_ON(!i915_request_completed(request[id]));
|
2017-02-14 00:15:26 +07:00
|
|
|
}
|
|
|
|
|
2019-01-22 05:20:47 +07:00
|
|
|
err = igt_live_test_end(&t);
|
2017-02-14 00:15:26 +07:00
|
|
|
|
|
|
|
out_request:
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
u32 *cmd;
|
|
|
|
|
|
|
|
if (!request[id])
|
|
|
|
break;
|
|
|
|
|
|
|
|
cmd = i915_gem_object_pin_map(request[id]->batch->obj,
|
|
|
|
I915_MAP_WC);
|
|
|
|
if (!IS_ERR(cmd)) {
|
|
|
|
*cmd = MI_BATCH_BUFFER_END;
|
2019-06-21 14:08:02 +07:00
|
|
|
intel_gt_chipset_flush(engine->gt);
|
2017-09-26 22:34:09 +07:00
|
|
|
|
2017-02-14 00:15:26 +07:00
|
|
|
i915_gem_object_unpin_map(request[id]->batch->obj);
|
|
|
|
}
|
|
|
|
|
|
|
|
i915_vma_put(request[id]->batch);
|
2018-02-21 16:56:36 +07:00
|
|
|
i915_request_put(request[id]);
|
2017-02-14 00:15:26 +07:00
|
|
|
}
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2019-09-26 02:34:46 +07:00
|
|
|
static int __live_parallel_engine1(void *arg)
|
|
|
|
{
|
|
|
|
struct intel_engine_cs *engine = arg;
|
|
|
|
IGT_TIMEOUT(end_time);
|
|
|
|
unsigned long count;
|
|
|
|
|
|
|
|
count = 0;
|
|
|
|
do {
|
|
|
|
struct i915_request *rq;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
rq = i915_request_create(engine->kernel_context);
|
2019-10-04 20:40:02 +07:00
|
|
|
if (IS_ERR(rq))
|
2019-09-26 02:34:46 +07:00
|
|
|
return PTR_ERR(rq);
|
|
|
|
|
|
|
|
i915_request_get(rq);
|
|
|
|
i915_request_add(rq);
|
|
|
|
|
|
|
|
err = 0;
|
|
|
|
if (i915_request_wait(rq, 0, HZ / 5) < 0)
|
|
|
|
err = -ETIME;
|
|
|
|
i915_request_put(rq);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
count++;
|
|
|
|
} while (!__igt_timeout(end_time, NULL));
|
|
|
|
|
|
|
|
pr_info("%s: %lu request + sync\n", engine->name, count);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __live_parallel_engineN(void *arg)
|
|
|
|
{
|
|
|
|
struct intel_engine_cs *engine = arg;
|
|
|
|
IGT_TIMEOUT(end_time);
|
|
|
|
unsigned long count;
|
|
|
|
|
|
|
|
count = 0;
|
|
|
|
do {
|
|
|
|
struct i915_request *rq;
|
|
|
|
|
|
|
|
rq = i915_request_create(engine->kernel_context);
|
2019-10-04 20:40:02 +07:00
|
|
|
if (IS_ERR(rq))
|
2019-09-26 02:34:46 +07:00
|
|
|
return PTR_ERR(rq);
|
|
|
|
|
|
|
|
i915_request_add(rq);
|
|
|
|
count++;
|
|
|
|
} while (!__igt_timeout(end_time, NULL));
|
|
|
|
|
|
|
|
pr_info("%s: %lu requests\n", engine->name, count);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int live_parallel_engines(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
|
|
|
static int (* const func[])(void *arg) = {
|
|
|
|
__live_parallel_engine1,
|
|
|
|
__live_parallel_engineN,
|
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
struct intel_engine_cs *engine;
|
|
|
|
enum intel_engine_id id;
|
|
|
|
int (* const *fn)(void *arg);
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check we can submit requests to all engines concurrently. This
|
|
|
|
* tests that we load up the system maximally.
|
|
|
|
*/
|
|
|
|
|
|
|
|
for (fn = func; !err && *fn; fn++) {
|
|
|
|
struct task_struct *tsk[I915_NUM_ENGINES] = {};
|
|
|
|
struct igt_live_test t;
|
|
|
|
|
|
|
|
err = igt_live_test_begin(&t, i915, __func__, "");
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
tsk[id] = kthread_run(*fn, engine,
|
|
|
|
"igt/parallel:%s",
|
|
|
|
engine->name);
|
|
|
|
if (IS_ERR(tsk[id])) {
|
|
|
|
err = PTR_ERR(tsk[id]);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
get_task_struct(tsk[id]);
|
|
|
|
}
|
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
int status;
|
|
|
|
|
|
|
|
if (IS_ERR_OR_NULL(tsk[id]))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
status = kthread_stop(tsk[id]);
|
|
|
|
if (status && !err)
|
|
|
|
err = status;
|
|
|
|
|
|
|
|
put_task_struct(tsk[id]);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (igt_live_test_end(&t))
|
|
|
|
err = -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
static int
|
|
|
|
max_batches(struct i915_gem_context *ctx, struct intel_engine_cs *engine)
|
|
|
|
{
|
|
|
|
struct i915_request *rq;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Before execlists, all contexts share the same ringbuffer. With
|
|
|
|
* execlists, each context/engine has a separate ringbuffer and
|
|
|
|
* for the purposes of this test, inexhaustible.
|
|
|
|
*
|
|
|
|
* For the global ringbuffer though, we have to be very careful
|
|
|
|
* that we do not wrap while preventing the execution of requests
|
|
|
|
* with a unsignaled fence.
|
|
|
|
*/
|
|
|
|
if (HAS_EXECLISTS(ctx->i915))
|
|
|
|
return INT_MAX;
|
|
|
|
|
2019-04-26 23:33:36 +07:00
|
|
|
rq = igt_request_alloc(ctx, engine);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
if (IS_ERR(rq)) {
|
|
|
|
ret = PTR_ERR(rq);
|
|
|
|
} else {
|
|
|
|
int sz;
|
|
|
|
|
|
|
|
ret = rq->ring->size - rq->reserved_space;
|
|
|
|
i915_request_add(rq);
|
|
|
|
|
|
|
|
sz = rq->ring->emit - rq->head;
|
|
|
|
if (sz < 0)
|
|
|
|
sz += rq->ring->size;
|
|
|
|
ret /= sz;
|
|
|
|
ret /= 2; /* leave half spare, in case of emergency! */
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int live_breadcrumbs_smoketest(void *arg)
|
|
|
|
{
|
|
|
|
struct drm_i915_private *i915 = arg;
|
|
|
|
struct smoketest t[I915_NUM_ENGINES];
|
|
|
|
unsigned int ncpus = num_online_cpus();
|
|
|
|
unsigned long num_waits, num_fences;
|
|
|
|
struct intel_engine_cs *engine;
|
|
|
|
struct task_struct **threads;
|
|
|
|
struct igt_live_test live;
|
|
|
|
enum intel_engine_id id;
|
|
|
|
intel_wakeref_t wakeref;
|
|
|
|
struct drm_file *file;
|
|
|
|
unsigned int n;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Smoketest our breadcrumb/signal handling for requests across multiple
|
|
|
|
* threads. A very simple test to only catch the most egregious of bugs.
|
|
|
|
* See __igt_breadcrumbs_smoketest();
|
|
|
|
*
|
|
|
|
* On real hardware this time.
|
|
|
|
*/
|
|
|
|
|
2019-06-14 06:21:54 +07:00
|
|
|
wakeref = intel_runtime_pm_get(&i915->runtime_pm);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
|
|
|
|
file = mock_file(i915);
|
|
|
|
if (IS_ERR(file)) {
|
|
|
|
ret = PTR_ERR(file);
|
|
|
|
goto out_rpm;
|
|
|
|
}
|
|
|
|
|
|
|
|
threads = kcalloc(ncpus * I915_NUM_ENGINES,
|
|
|
|
sizeof(*threads),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!threads) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_file;
|
|
|
|
}
|
|
|
|
|
|
|
|
memset(&t[0], 0, sizeof(t[0]));
|
|
|
|
t[0].request_alloc = __live_request_alloc;
|
|
|
|
t[0].ncontexts = 64;
|
|
|
|
t[0].contexts = kmalloc_array(t[0].ncontexts,
|
|
|
|
sizeof(*t[0].contexts),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!t[0].contexts) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_threads;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (n = 0; n < t[0].ncontexts; n++) {
|
|
|
|
t[0].contexts[n] = live_context(i915, file);
|
|
|
|
if (!t[0].contexts[n]) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_contexts;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = igt_live_test_begin(&live, i915, __func__, "");
|
|
|
|
if (ret)
|
|
|
|
goto out_contexts;
|
|
|
|
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
t[id] = t[0];
|
|
|
|
t[id].engine = engine;
|
|
|
|
t[id].max_batch = max_batches(t[0].contexts[0], engine);
|
|
|
|
if (t[id].max_batch < 0) {
|
|
|
|
ret = t[id].max_batch;
|
|
|
|
goto out_flush;
|
|
|
|
}
|
|
|
|
/* One ring interleaved between requests from all cpus */
|
|
|
|
t[id].max_batch /= num_online_cpus() + 1;
|
|
|
|
pr_debug("Limiting batches to %d requests on %s\n",
|
|
|
|
t[id].max_batch, engine->name);
|
|
|
|
|
|
|
|
for (n = 0; n < ncpus; n++) {
|
|
|
|
struct task_struct *tsk;
|
|
|
|
|
|
|
|
tsk = kthread_run(__igt_breadcrumbs_smoketest,
|
|
|
|
&t[id], "igt/%d.%d", id, n);
|
|
|
|
if (IS_ERR(tsk)) {
|
|
|
|
ret = PTR_ERR(tsk);
|
|
|
|
goto out_flush;
|
|
|
|
}
|
|
|
|
|
|
|
|
get_task_struct(tsk);
|
|
|
|
threads[id * ncpus + n] = tsk;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
|
|
|
|
|
|
|
|
out_flush:
|
|
|
|
num_waits = 0;
|
|
|
|
num_fences = 0;
|
|
|
|
for_each_engine(engine, i915, id) {
|
|
|
|
for (n = 0; n < ncpus; n++) {
|
|
|
|
struct task_struct *tsk = threads[id * ncpus + n];
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!tsk)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
err = kthread_stop(tsk);
|
|
|
|
if (err < 0 && !ret)
|
|
|
|
ret = err;
|
|
|
|
|
|
|
|
put_task_struct(tsk);
|
|
|
|
}
|
|
|
|
|
|
|
|
num_waits += atomic_long_read(&t[id].num_waits);
|
|
|
|
num_fences += atomic_long_read(&t[id].num_fences);
|
|
|
|
}
|
|
|
|
pr_info("Completed %lu waits for %lu fences across %d engines and %d cpus\n",
|
2019-03-06 01:03:30 +07:00
|
|
|
num_waits, num_fences, RUNTIME_INFO(i915)->num_engines, ncpus);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
|
|
|
|
ret = igt_live_test_end(&live) ?: ret;
|
|
|
|
out_contexts:
|
|
|
|
kfree(t[0].contexts);
|
|
|
|
out_threads:
|
|
|
|
kfree(threads);
|
|
|
|
out_file:
|
|
|
|
mock_file_free(i915, file);
|
|
|
|
out_rpm:
|
2019-06-14 06:21:54 +07:00
|
|
|
intel_runtime_pm_put(&i915->runtime_pm, wakeref);
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-02-21 16:56:36 +07:00
|
|
|
int i915_request_live_selftests(struct drm_i915_private *i915)
|
2017-02-14 00:15:24 +07:00
|
|
|
{
|
|
|
|
static const struct i915_subtest tests[] = {
|
|
|
|
SUBTEST(live_nop_request),
|
2017-02-14 00:15:25 +07:00
|
|
|
SUBTEST(live_all_engines),
|
2017-02-14 00:15:26 +07:00
|
|
|
SUBTEST(live_sequential_engines),
|
2019-09-26 02:34:46 +07:00
|
|
|
SUBTEST(live_parallel_engines),
|
2017-02-14 00:15:27 +07:00
|
|
|
SUBTEST(live_empty_request),
|
drm/i915: Replace global breadcrumbs with per-context interrupt tracking
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
thundering i915_wait_request herd"), the issue of handling multiple
clients waiting in parallel was brought to our attention. The
requirement was that every client should be woken immediately upon its
request being signaled, without incurring any cpu overhead.
To handle certain fragility of our hw meant that we could not do a
simple check inside the irq handler (some generations required almost
unbounded delays before we could be sure of seqno coherency) and so
request completion checking required delegation.
Before commit 688e6c725816, the solution was simple. Every client
waiting on a request would be woken on every interrupt and each would do
a heavyweight check to see if their request was complete. Commit
688e6c725816 introduced an rbtree so that only the earliest waiter on
the global timeline would woken, and would wake the next and so on.
(Along with various complications to handle requests being reordered
along the global timeline, and also a requirement for kthread to provide
a delegate for fence signaling that had no process context.)
The global rbtree depends on knowing the execution timeline (and global
seqno). Without knowing that order, we must instead check all contexts
queued to the HW to see which may have advanced. We trim that list by
only checking queued contexts that are being waited on, but still we
keep a list of all active contexts and their active signalers that we
inspect from inside the irq handler. By moving the waiters onto the fence
signal list, we can combine the client wakeup with the dma_fence
signaling (a dramatic reduction in complexity, but does require the HW
being coherent, the seqno must be visible from the cpu before the
interrupt is raised - we keep a timer backup just in case).
Having previously fixed all the issues with irq-seqno serialisation (by
inserting delays onto the GPU after each request instead of random delays
on the CPU after each interrupt), we can rely on the seqno state to
perfom direct wakeups from the interrupt handler. This allows us to
preserve our single context switch behaviour of the current routine,
with the only downside that we lose the RT priority sorting of wakeups.
In general, direct wakeup latency of multiple clients is about the same
(about 10% better in most cases) with a reduction in total CPU time spent
in the waiter (about 20-50% depending on gen). Average herd behaviour is
improved, but at the cost of not delegating wakeups on task_prio.
v2: Capture fence signaling state for error state and add comments to
warm even the most cold of hearts.
v3: Check if the request is still active before busywaiting
v4: Reduce the amount of pointer misdirection with list_for_each_safe
and using a local i915_request variable inside the loops
v5: Add a missing pluralisation to a purely informative selftest message.
References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-30 03:52:29 +07:00
|
|
|
SUBTEST(live_breadcrumbs_smoketest),
|
2017-02-14 00:15:24 +07:00
|
|
|
};
|
2018-07-06 13:53:10 +07:00
|
|
|
|
2019-07-13 02:29:53 +07:00
|
|
|
if (intel_gt_is_wedged(&i915->gt))
|
2018-07-06 13:53:10 +07:00
|
|
|
return 0;
|
|
|
|
|
2017-02-14 00:15:24 +07:00
|
|
|
return i915_subtests(tests, i915);
|
|
|
|
}
|