[Mesa-dev,7/9] i965/vec4: avoid dependency control around Align1 instructions

Submitted by Iago Toral Quiroga on Nov. 19, 2015, 10:05 a.m.

Details

Message ID 1447927518-30927-8-git-send-email-itoral@igalia.com
State New
Headers show
Series "Early fp64 fixes" ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Iago Toral Quiroga Nov. 19, 2015, 10:05 a.m.
From: Connor Abbott <connor.w.abbott@intel.com>

It appears that not only math instructions, but also MOV_BYTES or
any instruction that uses Align1 mode cannot be in the middle
of a dependency control sequence or the GPU will hang (at least on my
BDW). This fixes GPU hangs in some fp64 tests.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
---
 src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
index 3bcd5cb..bc0a33b 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
@@ -838,6 +838,17 @@  vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
    }
 
    /*
+    * Instructions that use Align1 mode cause the GPU to hang when inserted
+    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
+    */
+
+   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
+       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
+       inst->is_math())
+      return true;
+
+
+   /*
     * mlen:
     * In the presence of send messages, totally interrupt dependency
     * control. They're long enough that the chance of dependency
@@ -851,12 +862,8 @@  vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
     * enable of the last instruction, the optimization must be avoided. This is
     * to avoid instructions being shot down the pipeline when no writes are
     * required.
-    *
-    * math:
-    * Dependency control does not work well over math instructions.
-    * NB: Discovered empirically
     */
-   return (inst->mlen || inst->predicate || inst->is_math());
+   return (inst->mlen || inst->predicate);
 }
 
 /**

Comments

On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral@igalia.com> wrote:
> From: Connor Abbott <connor.w.abbott@intel.com>
>
> It appears that not only math instructions, but also MOV_BYTES or
> any instruction that uses Align1 mode cannot be in the middle
> of a dependency control sequence or the GPU will hang (at least on my
> BDW). This fixes GPU hangs in some fp64 tests.

I'm pretty surprised by this assessment. Doubtful even.

> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
> ---
>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> index 3bcd5cb..bc0a33b 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>     }
>
>     /*
> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
> +    */
> +
> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||

PACK_BYTES sets depctrl itself in the generator, and at the time I
added it I made a test that did

  vec4 foo = vec4(packUnorm4x8(...),
                  packUnorm4x8(...),
                  packUnorm4x8(...),
                  packUnorm4x8(...))

and confirmed that it set depctrl properly on the whole sequence.
There could of course be bugs somewhere, but the "hardware doesn't
work if you mix align1 and align16 depctrl" seems unlikely.

Do you know of a test that this affects?
On Thu, Nov 19, 2015 at 6:40 AM, Matt Turner <mattst88@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral@igalia.com> wrote:
>> From: Connor Abbott <connor.w.abbott@intel.com>
>>
>> It appears that not only math instructions, but also MOV_BYTES or
>> any instruction that uses Align1 mode cannot be in the middle
>> of a dependency control sequence or the GPU will hang (at least on my
>> BDW). This fixes GPU hangs in some fp64 tests.
>
> I'm pretty surprised by this assessment. Doubtful even.
>
>> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
>> ---
>>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> index 3bcd5cb..bc0a33b 100644
>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>>     }
>>
>>     /*
>> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
>> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
>> +    */
>> +
>> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
>> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
>
> PACK_BYTES sets depctrl itself in the generator, and at the time I
> added it I made a test that did
>
>   vec4 foo = vec4(packUnorm4x8(...),
>                   packUnorm4x8(...),
>                   packUnorm4x8(...),
>                   packUnorm4x8(...))
>
> and confirmed that it set depctrl properly on the whole sequence.
> There could of course be bugs somewhere, but the "hardware doesn't
> work if you mix align1 and align16 depctrl" seems unlikely.
>
> Do you know of a test that this affects?

This only affects FP64 tests, since there we use an align1 mov to do
double-to-float and float-to-double. However, I tried commenting out
emit_nir_code() and just doing essentially:

emit(MOV(...))->force_writemask_all = true;
emit(VEC4_OPCODE_PACK_BYTES, ...);
emit(MOV(...))->force_writemask_all = true;

and on my BDW it hanged. In case it's not clear: this isn't about
setting depctrl on the instruction itself, it just can't be inside of
a depctrl sequence (which we were already disallowing for math
instructions anyways).
On Thu, Nov 19, 2015 at 10:31 AM, Connor Abbott <cwabbott0@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 6:40 AM, Matt Turner <mattst88@gmail.com> wrote:
>> On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral@igalia.com> wrote:
>>> From: Connor Abbott <connor.w.abbott@intel.com>
>>>
>>> It appears that not only math instructions, but also MOV_BYTES or
>>> any instruction that uses Align1 mode cannot be in the middle
>>> of a dependency control sequence or the GPU will hang (at least on my
>>> BDW). This fixes GPU hangs in some fp64 tests.
>>
>> I'm pretty surprised by this assessment. Doubtful even.
>>
>>> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> index 3bcd5cb..bc0a33b 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>>>     }
>>>
>>>     /*
>>> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
>>> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
>>> +    */
>>> +
>>> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
>>> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
>>
>> PACK_BYTES sets depctrl itself in the generator, and at the time I
>> added it I made a test that did
>>
>>   vec4 foo = vec4(packUnorm4x8(...),
>>                   packUnorm4x8(...),
>>                   packUnorm4x8(...),
>>                   packUnorm4x8(...))
>>
>> and confirmed that it set depctrl properly on the whole sequence.
>> There could of course be bugs somewhere, but the "hardware doesn't
>> work if you mix align1 and align16 depctrl" seems unlikely.
>>
>> Do you know of a test that this affects?
>
> This only affects FP64 tests, since there we use an align1 mov to do
> double-to-float and float-to-double. However, I tried commenting out
> emit_nir_code() and just doing essentially:
>
> emit(MOV(...))->force_writemask_all = true;
> emit(VEC4_OPCODE_PACK_BYTES, ...);
> emit(MOV(...))->force_writemask_all = true;

Err, I meant:

emit(MOV(...))->no_dd_clear = true;
emit(VEC4_OPCODE_PACK_BYTES, ...);
emit(MOV(...))->no_dd_check = true;

(this also hangs with my DOUBLE_TO_FLOAT and FLOAT_TO_DOUBLE opcodes,
which don't use the data dependency bits at all).

>
> and on my BDW it hanged. In case it's not clear: this isn't about
> setting depctrl on the instruction itself, it just can't be inside of
> a depctrl sequence (which we were already disallowing for math
> instructions anyways).
On Thu, Nov 19, 2015 at 7:31 AM, Connor Abbott <cwabbott0@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 6:40 AM, Matt Turner <mattst88@gmail.com> wrote:
>> On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral@igalia.com> wrote:
>>> From: Connor Abbott <connor.w.abbott@intel.com>
>>>
>>> It appears that not only math instructions, but also MOV_BYTES or
>>> any instruction that uses Align1 mode cannot be in the middle
>>> of a dependency control sequence or the GPU will hang (at least on my
>>> BDW). This fixes GPU hangs in some fp64 tests.
>>
>> I'm pretty surprised by this assessment. Doubtful even.
>>
>>> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> index 3bcd5cb..bc0a33b 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>>>     }
>>>
>>>     /*
>>> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
>>> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
>>> +    */
>>> +
>>> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
>>> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
>>
>> PACK_BYTES sets depctrl itself in the generator, and at the time I
>> added it I made a test that did
>>
>>   vec4 foo = vec4(packUnorm4x8(...),
>>                   packUnorm4x8(...),
>>                   packUnorm4x8(...),
>>                   packUnorm4x8(...))
>>
>> and confirmed that it set depctrl properly on the whole sequence.
>> There could of course be bugs somewhere, but the "hardware doesn't
>> work if you mix align1 and align16 depctrl" seems unlikely.
>>
>> Do you know of a test that this affects?
>
> This only affects FP64 tests, since there we use an align1 mov to do
> double-to-float and float-to-double. However, I tried commenting out
> emit_nir_code() and just doing essentially:
>
> emit(MOV(...))->force_writemask_all = true;
> emit(VEC4_OPCODE_PACK_BYTES, ...);
> emit(MOV(...))->force_writemask_all = true;
>
> and on my BDW it hanged. In case it's not clear: this isn't about
> setting depctrl on the instruction itself, it just can't be inside of
> a depctrl sequence (which we were already disallowing for math
> instructions anyways).

Very weird. I'll take a look. So I understand, are the MOV
instructions writing different channels of the same register? And
VEC4_OPCODE_PACK_BYTES is writing to a different or the same register
as the MOVs? (I saw your fixup reply)

By the way, the math code is too heavy handed as far as I know. The
BDW+ docs say that the MATH instruction itself cannot take dependency
control hints (and empirically earlier platforms seem to have problems
with this as well, see
tests/shaders/dependency-hints/exp2.shader_test) -- nothing about a
math instruction being in the middle of a NoDDC* block. The person who
implemented the math did the minimal amount of work to fix the
problem.

The PRM also says:

"""
Instructions other than send, may use this control as long as
operations that have different pipeline latencies are not mixed. The
operations that have longer latencies are:

Opcodes pln, lrp, dp*.
Operations involving double precision computation.
Integer DW multiplication where both source operands are DWs.
"""

I would say that mixing a double-precision operation and something
else might cause problems, but that seems like we have a different
problem thus far.
On Thu, Nov 19, 2015 at 1:54 PM, Matt Turner <mattst88@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 7:31 AM, Connor Abbott <cwabbott0@gmail.com> wrote:
>> On Thu, Nov 19, 2015 at 6:40 AM, Matt Turner <mattst88@gmail.com> wrote:
>>> On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral@igalia.com> wrote:
>>>> From: Connor Abbott <connor.w.abbott@intel.com>
>>>>
>>>> It appears that not only math instructions, but also MOV_BYTES or
>>>> any instruction that uses Align1 mode cannot be in the middle
>>>> of a dependency control sequence or the GPU will hang (at least on my
>>>> BDW). This fixes GPU hangs in some fp64 tests.
>>>
>>> I'm pretty surprised by this assessment. Doubtful even.
>>>
>>>> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
>>>> ---
>>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>> index 3bcd5cb..bc0a33b 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>>>>     }
>>>>
>>>>     /*
>>>> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
>>>> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
>>>> +    */
>>>> +
>>>> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
>>>> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
>>>
>>> PACK_BYTES sets depctrl itself in the generator, and at the time I
>>> added it I made a test that did
>>>
>>>   vec4 foo = vec4(packUnorm4x8(...),
>>>                   packUnorm4x8(...),
>>>                   packUnorm4x8(...),
>>>                   packUnorm4x8(...))
>>>
>>> and confirmed that it set depctrl properly on the whole sequence.
>>> There could of course be bugs somewhere, but the "hardware doesn't
>>> work if you mix align1 and align16 depctrl" seems unlikely.
>>>
>>> Do you know of a test that this affects?
>>
>> This only affects FP64 tests, since there we use an align1 mov to do
>> double-to-float and float-to-double. However, I tried commenting out
>> emit_nir_code() and just doing essentially:
>>
>> emit(MOV(...))->force_writemask_all = true;
>> emit(VEC4_OPCODE_PACK_BYTES, ...);
>> emit(MOV(...))->force_writemask_all = true;
>>
>> and on my BDW it hanged. In case it's not clear: this isn't about
>> setting depctrl on the instruction itself, it just can't be inside of
>> a depctrl sequence (which we were already disallowing for math
>> instructions anyways).
>
> Very weird. I'll take a look. So I understand, are the MOV
> instructions writing different channels of the same register? And
> VEC4_OPCODE_PACK_BYTES is writing to a different or the same register
> as the MOVs? (I saw your fixup reply)

Actually, I had them writing the same thing so the second overwrote
the first one. The PACK_BYTES/MOV_BYTES/F2D/D2F (I think I tested all
of them, not sure) were operating on completely different registers,
and in the FP64 test that actually hung the GPU they were as well.
Using d2f since it's simpler and I remember what the operands are
(it's just an align1 mov with a dest stride of 2), the test code was
something like:

mov g50, g51 { no_dd_clear }
d2f g52, g54
mov g50, g53 { no_dd_check }

and changing the d2f to a normal align16 mov or commenting it out
prevented the hang. It would be interesting to see if a math
instruction instead of d2f also hangs.

>
> By the way, the math code is too heavy handed as far as I know. The
> BDW+ docs say that the MATH instruction itself cannot take dependency
> control hints (and empirically earlier platforms seem to have problems
> with this as well, see
> tests/shaders/dependency-hints/exp2.shader_test) -- nothing about a
> math instruction being in the middle of a NoDDC* block. The person who
> implemented the math did the minimal amount of work to fix the
> problem.
>
> The PRM also says:
>
> """
> Instructions other than send, may use this control as long as
> operations that have different pipeline latencies are not mixed. The
> operations that have longer latencies are:
>
> Opcodes pln, lrp, dp*.
> Operations involving double precision computation.
> Integer DW multiplication where both source operands are DWs.
> """
>
> I would say that mixing a double-precision operation and something
> else might cause problems, but that seems like we have a different
> problem thus far.

Yeah, these are all just mov's so I would expect that section to
apply. It still seems like we're not taking it into account, though...
On Thu, Nov 19, 2015 at 2:07 PM, Connor Abbott <cwabbott0@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 1:54 PM, Matt Turner <mattst88@gmail.com> wrote:
>> On Thu, Nov 19, 2015 at 7:31 AM, Connor Abbott <cwabbott0@gmail.com> wrote:
>>> On Thu, Nov 19, 2015 at 6:40 AM, Matt Turner <mattst88@gmail.com> wrote:
>>>> On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral@igalia.com> wrote:
>>>>> From: Connor Abbott <connor.w.abbott@intel.com>
>>>>>
>>>>> It appears that not only math instructions, but also MOV_BYTES or
>>>>> any instruction that uses Align1 mode cannot be in the middle
>>>>> of a dependency control sequence or the GPU will hang (at least on my
>>>>> BDW). This fixes GPU hangs in some fp64 tests.
>>>>
>>>> I'm pretty surprised by this assessment. Doubtful even.
>>>>
>>>>> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
>>>>> ---
>>>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>>>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>> index 3bcd5cb..bc0a33b 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>>>>>     }
>>>>>
>>>>>     /*
>>>>> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
>>>>> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
>>>>> +    */
>>>>> +
>>>>> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
>>>>> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
>>>>
>>>> PACK_BYTES sets depctrl itself in the generator, and at the time I
>>>> added it I made a test that did
>>>>
>>>>   vec4 foo = vec4(packUnorm4x8(...),
>>>>                   packUnorm4x8(...),
>>>>                   packUnorm4x8(...),
>>>>                   packUnorm4x8(...))
>>>>
>>>> and confirmed that it set depctrl properly on the whole sequence.
>>>> There could of course be bugs somewhere, but the "hardware doesn't
>>>> work if you mix align1 and align16 depctrl" seems unlikely.
>>>>
>>>> Do you know of a test that this affects?
>>>
>>> This only affects FP64 tests, since there we use an align1 mov to do
>>> double-to-float and float-to-double. However, I tried commenting out
>>> emit_nir_code() and just doing essentially:
>>>
>>> emit(MOV(...))->force_writemask_all = true;
>>> emit(VEC4_OPCODE_PACK_BYTES, ...);
>>> emit(MOV(...))->force_writemask_all = true;
>>>
>>> and on my BDW it hanged. In case it's not clear: this isn't about
>>> setting depctrl on the instruction itself, it just can't be inside of
>>> a depctrl sequence (which we were already disallowing for math
>>> instructions anyways).
>>
>> Very weird. I'll take a look. So I understand, are the MOV
>> instructions writing different channels of the same register? And
>> VEC4_OPCODE_PACK_BYTES is writing to a different or the same register
>> as the MOVs? (I saw your fixup reply)
>
> Actually, I had them writing the same thing so the second overwrote
> the first one.

Err, just to be clear... in the actual test that led to me discovering
this, the instructions (not mov's but a bfi and a mov IIRC) were
writing different components of the same register. In my hacked-up
test to see if align1 in between a depctrl sequence was the problem, I
just had them both write to the same thing since it was easier. In any
case, I don't think it should matter too much.

> The PACK_BYTES/MOV_BYTES/F2D/D2F (I think I tested all
> of them, not sure) were operating on completely different registers,
> and in the FP64 test that actually hung the GPU they were as well.
> Using d2f since it's simpler and I remember what the operands are
> (it's just an align1 mov with a dest stride of 2), the test code was
> something like:
>
> mov g50, g51 { no_dd_clear }
> d2f g52, g54
> mov g50, g53 { no_dd_check }
>
> and changing the d2f to a normal align16 mov or commenting it out
> prevented the hang. It would be interesting to see if a math
> instruction instead of d2f also hangs.
>
>>
>> By the way, the math code is too heavy handed as far as I know. The
>> BDW+ docs say that the MATH instruction itself cannot take dependency
>> control hints (and empirically earlier platforms seem to have problems
>> with this as well, see
>> tests/shaders/dependency-hints/exp2.shader_test) -- nothing about a
>> math instruction being in the middle of a NoDDC* block. The person who
>> implemented the math did the minimal amount of work to fix the
>> problem.
>>
>> The PRM also says:
>>
>> """
>> Instructions other than send, may use this control as long as
>> operations that have different pipeline latencies are not mixed. The
>> operations that have longer latencies are:
>>
>> Opcodes pln, lrp, dp*.
>> Operations involving double precision computation.
>> Integer DW multiplication where both source operands are DWs.
>> """
>>
>> I would say that mixing a double-precision operation and something
>> else might cause problems, but that seems like we have a different
>> problem thus far.
>
> Yeah, these are all just mov's so I would expect that section to
> apply. It still seems like we're not taking it into account, though...
On Thu, Nov 19, 2015 at 11:35 AM, Connor Abbott <cwabbott0@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 2:07 PM, Connor Abbott <cwabbott0@gmail.com> wrote:
>> On Thu, Nov 19, 2015 at 1:54 PM, Matt Turner <mattst88@gmail.com> wrote:
>>> On Thu, Nov 19, 2015 at 7:31 AM, Connor Abbott <cwabbott0@gmail.com> wrote:
>>>> On Thu, Nov 19, 2015 at 6:40 AM, Matt Turner <mattst88@gmail.com> wrote:
>>>>> On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral@igalia.com> wrote:
>>>>>> From: Connor Abbott <connor.w.abbott@intel.com>
>>>>>>
>>>>>> It appears that not only math instructions, but also MOV_BYTES or
>>>>>> any instruction that uses Align1 mode cannot be in the middle
>>>>>> of a dependency control sequence or the GPU will hang (at least on my
>>>>>> BDW). This fixes GPU hangs in some fp64 tests.
>>>>>
>>>>> I'm pretty surprised by this assessment. Doubtful even.
>>>>>
>>>>>> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
>>>>>> ---
>>>>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>>>>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>> index 3bcd5cb..bc0a33b 100644
>>>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>>>>>>     }
>>>>>>
>>>>>>     /*
>>>>>> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
>>>>>> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
>>>>>> +    */
>>>>>> +
>>>>>> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
>>>>>> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
>>>>>
>>>>> PACK_BYTES sets depctrl itself in the generator, and at the time I
>>>>> added it I made a test that did
>>>>>
>>>>>   vec4 foo = vec4(packUnorm4x8(...),
>>>>>                   packUnorm4x8(...),
>>>>>                   packUnorm4x8(...),
>>>>>                   packUnorm4x8(...))
>>>>>
>>>>> and confirmed that it set depctrl properly on the whole sequence.
>>>>> There could of course be bugs somewhere, but the "hardware doesn't
>>>>> work if you mix align1 and align16 depctrl" seems unlikely.
>>>>>
>>>>> Do you know of a test that this affects?
>>>>
>>>> This only affects FP64 tests, since there we use an align1 mov to do
>>>> double-to-float and float-to-double. However, I tried commenting out
>>>> emit_nir_code() and just doing essentially:
>>>>
>>>> emit(MOV(...))->force_writemask_all = true;
>>>> emit(VEC4_OPCODE_PACK_BYTES, ...);
>>>> emit(MOV(...))->force_writemask_all = true;
>>>>
>>>> and on my BDW it hanged. In case it's not clear: this isn't about
>>>> setting depctrl on the instruction itself, it just can't be inside of
>>>> a depctrl sequence (which we were already disallowing for math
>>>> instructions anyways).
>>>
>>> Very weird. I'll take a look. So I understand, are the MOV
>>> instructions writing different channels of the same register? And
>>> VEC4_OPCODE_PACK_BYTES is writing to a different or the same register
>>> as the MOVs? (I saw your fixup reply)
>>
>> Actually, I had them writing the same thing so the second overwrote
>> the first one.
>
> Err, just to be clear... in the actual test that led to me discovering
> this, the instructions (not mov's but a bfi and a mov IIRC) were
> writing different components of the same register. In my hacked-up
> test to see if align1 in between a depctrl sequence was the problem, I
> just had them both write to the same thing since it was easier. In any
> case, I don't think it should matter too much.

Interesting. If it was a BFI and a MOV writing different components of
the same register, I can see that failing. Though I haven't measured,
3-src instructions usually have 2+ extra cycles of latency and with a
2-cycle issue time, a BFI followed immediately by a MOV would mean
that both of them retired in the same cycle. That would seem bad if
one of them was supposed to signal the scoreboard.
×
On Thu, Nov 19, 2015 at 3:11 PM, Matt Turner <mattst88@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 11:35 AM, Connor Abbott <cwabbott0@gmail.com> wrote:
>> On Thu, Nov 19, 2015 at 2:07 PM, Connor Abbott <cwabbott0@gmail.com> wrote:
>>> On Thu, Nov 19, 2015 at 1:54 PM, Matt Turner <mattst88@gmail.com> wrote:
>>>> On Thu, Nov 19, 2015 at 7:31 AM, Connor Abbott <cwabbott0@gmail.com> wrote:
>>>>> On Thu, Nov 19, 2015 at 6:40 AM, Matt Turner <mattst88@gmail.com> wrote:
>>>>>> On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral@igalia.com> wrote:
>>>>>>> From: Connor Abbott <connor.w.abbott@intel.com>
>>>>>>>
>>>>>>> It appears that not only math instructions, but also MOV_BYTES or
>>>>>>> any instruction that uses Align1 mode cannot be in the middle
>>>>>>> of a dependency control sequence or the GPU will hang (at least on my
>>>>>>> BDW). This fixes GPU hangs in some fp64 tests.
>>>>>>
>>>>>> I'm pretty surprised by this assessment. Doubtful even.
>>>>>>
>>>>>>> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
>>>>>>> ---
>>>>>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>>>>>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>>> index 3bcd5cb..bc0a33b 100644
>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>>> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>>>>>>>     }
>>>>>>>
>>>>>>>     /*
>>>>>>> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
>>>>>>> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
>>>>>>> +    */
>>>>>>> +
>>>>>>> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
>>>>>>> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
>>>>>>
>>>>>> PACK_BYTES sets depctrl itself in the generator, and at the time I
>>>>>> added it I made a test that did
>>>>>>
>>>>>>   vec4 foo = vec4(packUnorm4x8(...),
>>>>>>                   packUnorm4x8(...),
>>>>>>                   packUnorm4x8(...),
>>>>>>                   packUnorm4x8(...))
>>>>>>
>>>>>> and confirmed that it set depctrl properly on the whole sequence.
>>>>>> There could of course be bugs somewhere, but the "hardware doesn't
>>>>>> work if you mix align1 and align16 depctrl" seems unlikely.
>>>>>>
>>>>>> Do you know of a test that this affects?
>>>>>
>>>>> This only affects FP64 tests, since there we use an align1 mov to do
>>>>> double-to-float and float-to-double. However, I tried commenting out
>>>>> emit_nir_code() and just doing essentially:
>>>>>
>>>>> emit(MOV(...))->force_writemask_all = true;
>>>>> emit(VEC4_OPCODE_PACK_BYTES, ...);
>>>>> emit(MOV(...))->force_writemask_all = true;
>>>>>
>>>>> and on my BDW it hanged. In case it's not clear: this isn't about
>>>>> setting depctrl on the instruction itself, it just can't be inside of
>>>>> a depctrl sequence (which we were already disallowing for math
>>>>> instructions anyways).
>>>>
>>>> Very weird. I'll take a look. So I understand, are the MOV
>>>> instructions writing different channels of the same register? And
>>>> VEC4_OPCODE_PACK_BYTES is writing to a different or the same register
>>>> as the MOVs? (I saw your fixup reply)
>>>
>>> Actually, I had them writing the same thing so the second overwrote
>>> the first one.
>>
>> Err, just to be clear... in the actual test that led to me discovering
>> this, the instructions (not mov's but a bfi and a mov IIRC) were
>> writing different components of the same register. In my hacked-up
>> test to see if align1 in between a depctrl sequence was the problem, I
>> just had them both write to the same thing since it was easier. In any
>> case, I don't think it should matter too much.
>
> Interesting. If it was a BFI and a MOV writing different components of
> the same register, I can see that failing. Though I haven't measured,
> 3-src instructions usually have 2+ extra cycles of latency and with a
> 2-cycle issue time, a BFI followed immediately by a MOV would mean
> that both of them retired in the same cycle. That would seem bad if
> one of them was supposed to signal the scoreboard.
> ×

They weren't right next to each other, though -- there was a sequence
of 5-6 instructions between them, one of which happens to be an align1
mov which (according to my testing) hangs the GPU in between any
depctrl sequence.