Execlist based engine-reset

Submitted by Michel Thierry on Dec. 16, 2016, 8:20 p.m.

Details

Reviewer None
Submitted Dec. 16, 2016, 8:20 p.m.
Last Updated Jan. 12, 2017, 4:53 a.m.
Revision 2

Cover Letter(s)

Revision 1
      These patches are to add engine reset feature from Gen8. This is also
referred to as Timeout detection and recovery (TDR). This complements to
the full gpu reset feature available in i915 but it only allows to reset a
particular engine instead of all engines thus providing a light weight
engine reset and recovery mechanism.

This implementation is for execlist based submission only hence limited
from Gen8 onwards. For GuC based submission, additional changes can be
added later on.

Timeout detection relies on the existing hangcheck which remains the same,
main changes are to the recovery mechanism. Once we detect a hang on a
particular engine we identify the request that caused the hang, skip the
request and adjust head pointers to allow the execution to proceed
normally. After some cleanup, submissions are restarted to process
remaining work queued to that engine.

If engine reset fails to recover engine correctly then we fallback to full
gpu reset.

v2: ELSP queue request tracking and reset path changes to handle incomplete
requests during reset. Thanks to Chris Wilson for providing these patches.

v3: Let the waiter keep handling the full gpu reset if it already has the
lock; point out that GuC submission needs a different method to restart
workloads after the engine reset completes.

Arun Siluvery (6):
  drm/i915: Update i915.reset to handle engine resets
  drm/i915/tdr: Modify error handler for per engine hang recovery
  drm/i915/tdr: Add support for per engine reset recovery
  drm/i915/tdr: Add engine reset count to error state
  drm/i915/tdr: Export per-engine reset count info to debugfs
  drm/i915/tdr: Enable Engine reset and recovery support

Michel Thierry (2):
  drm/i915: Keep i915_handle_error kerneldoc parameters together
  drm/i915: Add engine reset count in get-reset-stats ioctl

Mika Kuoppala (1):
  drm/i915: Skip reset request if there is one already

 drivers/gpu/drm/i915/i915_debugfs.c     | 18 +++++++
 drivers/gpu/drm/i915/i915_drv.c         | 74 +++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h         | 15 ++++++
 drivers/gpu/drm/i915/i915_gem.c         |  2 +-
 drivers/gpu/drm/i915/i915_gem_context.c | 14 +++--
 drivers/gpu/drm/i915/i915_gpu_error.c   |  3 ++
 drivers/gpu/drm/i915/i915_irq.c         | 91 ++++++++++++++++++++++++---------
 drivers/gpu/drm/i915/i915_params.c      |  6 +--
 drivers/gpu/drm/i915/i915_params.h      |  2 +-
 drivers/gpu/drm/i915/i915_pci.c         |  5 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 12 +++++
 drivers/gpu/drm/i915/intel_lrc.h        |  1 +
 drivers/gpu/drm/i915/intel_uncore.c     | 61 +++++++++++++++++++---
 include/uapi/drm/i915_drm.h             |  3 +-
 14 files changed, 266 insertions(+), 41 deletions(-)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    
Revision 2
      These patches are to add engine reset feature from Gen8. This is also
referred to as Timeout detection and recovery (TDR). This complements to
the full gpu reset feature available in i915 but it only allows to reset a
particular engine instead of all engines thus providing a light weight
engine reset and recovery mechanism.

This implementation is for execlist based submission only hence limited
from Gen8 onwards. For GuC based submission, additional changes can be
added later on.

Timeout detection relies on the existing hangcheck which remains the same,
main changes are to the recovery mechanism. Once we detect a hang on a
particular engine we identify the request that caused the hang, skip the
request and adjust head pointers to allow the execution to proceed
normally. After some cleanup, submissions are restarted to process
remaining work queued to that engine.

If engine reset fails to recover engine correctly then we fallback to full
gpu reset.

v2: ELSP queue request tracking and reset path changes to handle incomplete
requests during reset. Thanks to Chris Wilson for providing these patches.

v3: Let the waiter keep handling the full gpu reset if it already has the
lock; point out that GuC submission needs a different method to restart
workloads after the engine reset completes.

v4: Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.

Arun Siluvery (6):
  drm/i915: Update i915.reset to handle engine resets
  drm/i915/tdr: Modify error handler for per engine hang recovery
  drm/i915/tdr: Add support for per engine reset recovery
  drm/i915/tdr: Add engine reset count to error state
  drm/i915/tdr: Export per-engine reset count info to debugfs
  drm/i915/tdr: Enable Engine reset and recovery support

Michel Thierry (3):
  drm/i915: Keep i915_handle_error kerneldoc parameters together
  drm/i915: Update i915_reset parameter for kerneldoc
  drm/i915: Add engine reset count in get-reset-stats ioctl

Mika Kuoppala (1):
  drm/i915: Skip reset request if there is one already

 drivers/gpu/drm/i915/i915_debugfs.c     |  21 ++++++
 drivers/gpu/drm/i915/i915_drv.c         | 118 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h         |  16 +++++
 drivers/gpu/drm/i915/i915_gem.c         |   2 +-
 drivers/gpu/drm/i915/i915_gem_context.c |  14 +++-
 drivers/gpu/drm/i915/i915_gpu_error.c   |   3 +
 drivers/gpu/drm/i915/i915_irq.c         |  34 ++++++---
 drivers/gpu/drm/i915/i915_params.c      |   6 +-
 drivers/gpu/drm/i915/i915_params.h      |   2 +-
 drivers/gpu/drm/i915/i915_pci.c         |   5 +-
 drivers/gpu/drm/i915/intel_uncore.c     |  61 +++++++++++++++--
 include/uapi/drm/i915_drm.h             |   3 +-
 12 files changed, 258 insertions(+), 27 deletions(-)
    

Revisions

Patches download mbox

Tests

Series 16936v1 Execlist based engine-reset
https://patchwork.freedesktop.org/api/1.0/series/16936/revisions/1/mbox/

Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-b:
                pass       -> SKIP       (fi-bxt-j4205)

fi-bdw-5557u     total:247  pass:233  dwarn:0   dfail:0   fail:0   skip:14 
fi-bsw-n3050     total:247  pass:208  dwarn:0   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:247  pass:222  dwarn:0   dfail:0   fail:0   skip:25 
fi-bxt-t5700     total:247  pass:220  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-j1900     total:247  pass:220  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:247  pass:216  dwarn:0   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:247  pass:228  dwarn:0   dfail:0   fail:0   skip:19 
fi-hsw-4770r     total:247  pass:228  dwarn:0   dfail:0   fail:0   skip:19 
fi-ilk-650       total:247  pass:195  dwarn:0   dfail:0   fail:0   skip:52 
fi-ivb-3520m     total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-ivb-3770      total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-kbl-7500u     total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-skl-6260u     total:247  pass:234  dwarn:0   dfail:0   fail:0   skip:13 
fi-skl-6700hq    total:247  pass:227  dwarn:0   dfail:0   fail:0   skip:20 
fi-skl-6700k     total:247  pass:224  dwarn:3   dfail:0   fail:0   skip:20 
fi-skl-6770hq    total:247  pass:234  dwarn:0   dfail:0   fail:0   skip:13 
fi-snb-2520m     total:247  pass:216  dwarn:0   dfail:0   fail:0   skip:31 
fi-snb-2600      total:247  pass:215  dwarn:0   dfail:0   fail:0   skip:32 

705f1e8fef81d504f0032df8e21bdc2e74850b3a drm-tip: 2016y-12m-16d-15h-40m-02s UTC integration manifest
9a591cb drm/i915: Add engine reset count in get-reset-stats ioctl
709af8f drm/i915/tdr: Enable Engine reset and recovery support
f37aa42 drm/i915/tdr: Export per-engine reset count info to debugfs
4c1e046 drm/i915/tdr: Add engine reset count to error state
41e96eb drm/i915: Skip reset request if there is one already
d3b2dc8f4 drm/i915/tdr: Add support for per engine reset recovery
d654f09 drm/i915/tdr: Modify error handler for per engine hang recovery
db8e495 drm/i915: Update i915.reset to handle engine resets
b94f5b6 drm/i915: Keep i915_handle_error kerneldoc parameters together

Patches download mbox

Tests

Series 16936v2 Execlist based engine-reset
https://patchwork.freedesktop.org/api/1.0/series/16936/revisions/2/mbox/


fi-bdw-5557u     total:246  pass:232  dwarn:0   dfail:0   fail:0   skip:14 
fi-bsw-n3050     total:246  pass:207  dwarn:0   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:246  pass:224  dwarn:0   dfail:0   fail:0   skip:22 
fi-bxt-t5700     total:82   pass:69   dwarn:0   dfail:0   fail:0   skip:12 
fi-byt-j1900     total:246  pass:219  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:246  pass:215  dwarn:0   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:246  pass:227  dwarn:0   dfail:0   fail:0   skip:19 
fi-hsw-4770r     total:246  pass:227  dwarn:0   dfail:0   fail:0   skip:19 
fi-ivb-3520m     total:246  pass:225  dwarn:0   dfail:0   fail:0   skip:21 
fi-ivb-3770      total:246  pass:225  dwarn:0   dfail:0   fail:0   skip:21 
fi-kbl-7500u     total:246  pass:225  dwarn:0   dfail:0   fail:0   skip:21 
fi-skl-6260u     total:246  pass:233  dwarn:0   dfail:0   fail:0   skip:13 
fi-skl-6700hq    total:246  pass:226  dwarn:0   dfail:0   fail:0   skip:20 
fi-skl-6700k     total:246  pass:222  dwarn:3   dfail:0   fail:0   skip:21 
fi-skl-6770hq    total:246  pass:233  dwarn:0   dfail:0   fail:0   skip:13 
fi-snb-2520m     total:246  pass:215  dwarn:0   dfail:0   fail:0   skip:31 
fi-snb-2600      total:246  pass:214  dwarn:0   dfail:0   fail:0   skip:32 

60f8884d35facd41e1b085a19444205ec13a5da0 drm-tip: 2017y-01m-11d-20h-53m-23s UTC integration manifest
791b801 drm/i915: Add engine reset count in get-reset-stats ioctl
6259d0a drm/i915/tdr: Enable Engine reset and recovery support
d4c2b71 drm/i915/tdr: Export per-engine reset count info to debugfs
80387dc drm/i915/tdr: Add engine reset count to error state
a301db6 drm/i915: Skip reset request if there is one already
2c9f879 drm/i915/tdr: Add support for per engine reset recovery
951cafa drm/i915/tdr: Modify error handler for per engine hang recovery
10cc65e drm/i915: Update i915.reset to handle engine resets
4651e88 drm/i915: Update i915_reset parameter for kerneldoc
83fb80c drm/i915: Keep i915_handle_error kerneldoc parameters together