[17/19] drm/i915: Use SSE4.1 movntdqa based memcpy for sampling GuC log buffer

Submitted by Akash Goel on Aug. 19, 2016, 8:43 a.m.

Details

Message ID 1471596198-30748-18-git-send-email-akash.goel@intel.com
State New
Headers show
Series "Support for sustained capturing of GuC firmware logs" ( rev: 9 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Akash Goel Aug. 19, 2016, 8:43 a.m.
From: Akash Goel <akash.goel@intel.com>

In order to have fast reads from the GuC log buffer, used SSE4.1 movntdqa
based memcpy function i915_memcpy_from_wc.
GuC log buffer has a WC type vmalloc mapping and copying using movntqda
from WC type memory is almost as fast as reading from WB memory.
This will further reduce the log buffer sampling time, so is needed dearly
to deal with the flush interrupt storm when GuC is generating logs at a
very high rate.
Ideally SSE 4.1 should be present on all chipsets supporting GuC based
submisssions, but if not then logging will not be enabled.

v2: Rebase.

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Akash Goel <akash.goel@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 7c889e1..c214a7c 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1128,13 +1128,13 @@  static void guc_read_update_log_buffer(struct intel_guc *guc)
 
 		/* Just copy the newly written data */
 		if (read_offset > write_offset) {
-			memcpy(dst_data, src_data, write_offset);
+			i915_memcpy_from_wc(dst_data, src_data, write_offset);
 			bytes_to_copy = buffer_size - read_offset;
 		} else {
 			bytes_to_copy = write_offset - read_offset;
 		}
-		memcpy(dst_data + read_offset,
-		       src_data + read_offset, bytes_to_copy);
+		i915_memcpy_from_wc(dst_data + read_offset,
+				    src_data + read_offset, bytes_to_copy);
 
 		src_data += buffer_size;
 		dst_data += buffer_size;
@@ -1247,6 +1247,16 @@  static void guc_create_log(struct intel_guc *guc)
 
 	vma = guc->log.vma;
 	if (!vma) {
+		/* We require SSE 4.1 for fast reads from the GuC log buffer and
+		 * it should be present on the chipsets supporting GuC based
+		 * submisssions.
+		 */
+		if (WARN_ON(!i915_memcpy_from_wc(NULL, NULL, 0))) {
+			/* logging will not be enabled */
+			i915.guc_log_level = -1;
+			return;
+		}
+
 		vma = guc_allocate_vma(guc, size);
 		if (IS_ERR(vma)) {
 			/* logging will be off */

Comments

On Fri, Aug 19, 2016 at 02:13:16PM +0530, akash.goel@intel.com wrote:
> From: Akash Goel <akash.goel@intel.com>
> 
> In order to have fast reads from the GuC log buffer, used SSE4.1 movntdqa
> based memcpy function i915_memcpy_from_wc.
> GuC log buffer has a WC type vmalloc mapping and copying using movntqda
> from WC type memory is almost as fast as reading from WB memory.
> This will further reduce the log buffer sampling time, so is needed dearly
> to deal with the flush interrupt storm when GuC is generating logs at a
> very high rate.
> Ideally SSE 4.1 should be present on all chipsets supporting GuC based
> submisssions, but if not then logging will not be enabled.
> 
> v2: Rebase.
> 
> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Akash Goel <akash.goel@intel.com>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Should be squashed with patch 16 (use MAP_WC).
-Chris
On 8/19/2016 11:49 PM, Chris Wilson wrote:
> On Fri, Aug 19, 2016 at 02:13:16PM +0530, akash.goel@intel.com wrote:
>> From: Akash Goel <akash.goel@intel.com>
>>
>> In order to have fast reads from the GuC log buffer, used SSE4.1 movntdqa
>> based memcpy function i915_memcpy_from_wc.
>> GuC log buffer has a WC type vmalloc mapping and copying using movntqda
>> from WC type memory is almost as fast as reading from WB memory.
>> This will further reduce the log buffer sampling time, so is needed dearly
>> to deal with the flush interrupt storm when GuC is generating logs at a
>> very high rate.
>> Ideally SSE 4.1 should be present on all chipsets supporting GuC based
>> submisssions, but if not then logging will not be enabled.
>>
>> v2: Rebase.
>>
>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Signed-off-by: Akash Goel <akash.goel@intel.com>
>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Should be squashed with patch 16 (use MAP_WC).
Fine will squash, but please could you tell that what issue could be 
there with 2 patches being separate. Either both will be merged or none 
of them will be merged.

Best regards
Akash

> -Chris
>
On Sat, Aug 20, 2016 at 12:14:46PM +0530, Goel, Akash wrote:
> 
> 
> On 8/19/2016 11:49 PM, Chris Wilson wrote:
> >On Fri, Aug 19, 2016 at 02:13:16PM +0530, akash.goel@intel.com wrote:
> >>From: Akash Goel <akash.goel@intel.com>
> >>
> >>In order to have fast reads from the GuC log buffer, used SSE4.1 movntdqa
> >>based memcpy function i915_memcpy_from_wc.
> >>GuC log buffer has a WC type vmalloc mapping and copying using movntqda
> >>from WC type memory is almost as fast as reading from WB memory.
> >>This will further reduce the log buffer sampling time, so is needed dearly
> >>to deal with the flush interrupt storm when GuC is generating logs at a
> >>very high rate.
> >>Ideally SSE 4.1 should be present on all chipsets supporting GuC based
> >>submisssions, but if not then logging will not be enabled.
> >>
> >>v2: Rebase.
> >>
> >>Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>Signed-off-by: Akash Goel <akash.goel@intel.com>
> >>Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >
> >Should be squashed with patch 16 (use MAP_WC).
> Fine will squash, but please could you tell that what issue could be
> there with 2 patches being separate. Either both will be merged or
> none of them will be merged.

Further reflection is that the initial patch to map the log should be
with MAP_WC. Correctness first, then the performance patch.

As it stands, the story being told is:

1. Enable readback.
2. Oops, that isn't correct, better map it WC
3. Oops, that is too slow, better use movntqa.

A better story is

1. Enable readback, enforcing coherency.
2. Accelerate readback for !llc.