[v2,2/3] drm/i915: Limit the busy wait on requests to 10us not 10ms!

Submitted by Chris Wilson on Nov. 27, 2015, 2:26 p.m.

Details

Message ID 1448634382-8987-2-git-send-email-chris@chris-wilson.co.uk
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Chris Wilson Nov. 27, 2015, 2:26 p.m.
When waiting for high frequency requests, the finite amount of time
required to set up the irq and wait upon it limits the response rate. By
busywaiting on the request completion for a short while we can service
the high frequency waits as quick as possible. However, if it is a slow
request, we want to sleep as quickly as possible. The tradeoff between
waiting and sleeping is roughly the time it takes to sleep on a request,
on the order of a microsecond. Based on measurements of synchronous
workloads from across big core and little atom, I have set the limit for
busywaiting as 10 microseconds. In most of the synchronous cases, we can
reduce the limit down to as little as 2 miscroseconds, but that leaves
quite a few test cases regressing by factors of 3 and more.

The code currently uses the jiffie clock, but that is far too coarse (on
the order of 10 milliseconds) and results in poor interactivity as the
CPU ends up being hogged by slow requests. To get microsecond resolution
we need to use a high resolution timer. The cheapest of which is polling
local_clock(), but that is only valid on the same CPU. If we switch CPUs
because the task was preempted, we can also use that as an indicator that
 the system is too busy to waste cycles on spinning and we should sleep
instead.

__i915_spin_request was introduced in
commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2]
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 7 16:20:41 2015 +0100

     drm/i915: Optimistically spin for the request completion

v2: Drop full u64 for unsigned long - the timer is 32bit wraparound safe,
so we can use native register sizes on smaller architectures. Mention
the approximate microseconds units for elapsed time and add some extra
comments describing the reason for busywaiting.

v3: Raise the limit to 10us

Reported-by: Jens Axboe <axboe@kernel.dk>
Link: https://lkml.org/lkml/2015/11/12/621
Cc; "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com>
Cc: stable@kernel.vger.org
---
 drivers/gpu/drm/i915/i915_gem.c | 47 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 51b6e6f8a44e..c0941e8a70f7 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1146,14 +1146,57 @@  static bool missed_irq(struct drm_i915_private *dev_priv,
 	return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
 }
 
+static unsigned long local_clock_us(unsigned *cpu)
+{
+	unsigned long t;
+
+	/* Cheaply and approximately convert from nanoseconds to microseconds.
+	 * The result and subsequent calculations are also defined in the same
+	 * approximate microseconds units. The principal source of timing
+	 * error here is from the simple truncation.
+	 *
+	 * Note that local_clock() is only defined wrt to the current CPU;
+	 * the comparisons are no longer valid if we switch CPUs. Instead of
+	 * blocking preemption for the entire busywait, we can detect the CPU
+	 * switch and use that as indicator of system load and a reason to
+	 * stop busywaiting, see busywait_stop().
+	 */
+	*cpu = get_cpu();
+	t = local_clock() >> 10;
+	put_cpu();
+
+	return t;
+}
+
+static bool busywait_stop(unsigned long timeout, unsigned cpu)
+{
+	unsigned this_cpu;
+
+	if (time_after(local_clock_us(&this_cpu), timeout))
+		return true;
+
+	return this_cpu != cpu;
+}
+
 static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
 {
 	unsigned long timeout;
+	unsigned cpu;
+
+	/* When waiting for high frequency requests, e.g. during synchronous
+	 * rendering split between the CPU and GPU, the finite amount of time
+	 * required to set up the irq and wait upon it limits the response
+	 * rate. By busywaiting on the request completion for a short while we
+	 * can service the high frequency waits as quick as possible. However,
+	 * if it is a slow request, we want to sleep as quickly as possible.
+	 * The tradeoff between waiting and sleeping is roughly the time it
+	 * takes to sleep on a request, on the order of a microsecond.
+	 */
 
 	if (i915_gem_request_get_ring(req)->irq_refcount)
 		return -EBUSY;
 
-	timeout = jiffies + 1;
+	timeout = local_clock_us(&cpu) + 10;
 	while (!need_resched()) {
 		if (i915_gem_request_completed(req, true))
 			return 0;
@@ -1161,7 +1204,7 @@  static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
 		if (signal_pending_state(state, current))
 			break;
 
-		if (time_after_eq(jiffies, timeout))
+		if (busywait_stop(timeout, cpu))
 			break;
 
 		cpu_relax_lowlatency();

Comments

On 27/11/15 14:26, Chris Wilson wrote:
> When waiting for high frequency requests, the finite amount of time
> required to set up the irq and wait upon it limits the response rate. By
> busywaiting on the request completion for a short while we can service
> the high frequency waits as quick as possible. However, if it is a slow
> request, we want to sleep as quickly as possible. The tradeoff between
> waiting and sleeping is roughly the time it takes to sleep on a request,
> on the order of a microsecond. Based on measurements of synchronous
> workloads from across big core and little atom, I have set the limit for
> busywaiting as 10 microseconds. In most of the synchronous cases, we can
> reduce the limit down to as little as 2 miscroseconds, but that leaves
> quite a few test cases regressing by factors of 3 and more.
>
> The code currently uses the jiffie clock, but that is far too coarse (on
> the order of 10 milliseconds) and results in poor interactivity as the
> CPU ends up being hogged by slow requests. To get microsecond resolution
> we need to use a high resolution timer. The cheapest of which is polling
> local_clock(), but that is only valid on the same CPU. If we switch CPUs
> because the task was preempted, we can also use that as an indicator that
>   the system is too busy to waste cycles on spinning and we should sleep
> instead.
>
> __i915_spin_request was introduced in
> commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2]
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Apr 7 16:20:41 2015 +0100
>
>       drm/i915: Optimistically spin for the request completion
>
> v2: Drop full u64 for unsigned long - the timer is 32bit wraparound safe,
> so we can use native register sizes on smaller architectures. Mention
> the approximate microseconds units for elapsed time and add some extra
> comments describing the reason for busywaiting.
>
> v3: Raise the limit to 10us

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko