[v4,20/40] intel/compiler: workaround for SIMD8 half-float MAD in gen8

Submitted by Iago Toral Quiroga on Feb. 12, 2019, 11:55 a.m.

Details

Message ID 20190212115607.21467-21-itoral@igalia.com
State New
Headers show
Series "intel: VK_KHR_shader_float16_int8 implementation" ( rev: 16 15 14 13 12 11 10 9 8 7 6 ) in Mesa

Not browsing as part of any series.

Commit Message

Iago Toral Quiroga Feb. 12, 2019, 11:55 a.m.
Empirical testing shows that gen8 has a bug where MAD instructions with
a half-float source starting at a non-zero offset fail to execute
properly.

This scenario usually happened in SIMD8 executions, where we used to
pack vector components Y and W in the second half of SIMD registers
(therefore, with a 16B offset). It looks like we are not currently doing
this any more but this would handle the situation properly if we ever
happen to produce code like this again.

v2 (Jason):
 - Move this workaround to the lower_regioning pass as an additional case
   to has_invalid_src_region()
 - Do not apply the workaround if the stride of the source operand is 0,
   testing suggests the problem doesn't exist in that case.

Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> (v1)
---
 src/intel/compiler/brw_fs_lower_regioning.cpp | 39 +++++++++++++------
 1 file changed, 28 insertions(+), 11 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp b/src/intel/compiler/brw_fs_lower_regioning.cpp
index df50993dee6..7c70cfab535 100644
--- a/src/intel/compiler/brw_fs_lower_regioning.cpp
+++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
@@ -109,20 +109,37 @@  namespace {
    has_invalid_src_region(const gen_device_info *devinfo, const fs_inst *inst,
                           unsigned i)
    {
-      if (is_unordered(inst)) {
+      if (is_unordered(inst))
          return false;
-      } else {
-         const unsigned dst_byte_stride = inst->dst.stride * type_sz(inst->dst.type);
-         const unsigned src_byte_stride = inst->src[i].stride *
-            type_sz(inst->src[i].type);
-         const unsigned dst_byte_offset = reg_offset(inst->dst) % REG_SIZE;
-         const unsigned src_byte_offset = reg_offset(inst->src[i]) % REG_SIZE;
 
-         return has_dst_aligned_region_restriction(devinfo, inst) &&
-                !is_uniform(inst->src[i]) &&
-                (src_byte_stride != dst_byte_stride ||
-                 src_byte_offset != dst_byte_offset);
+      /* Empirical testing shows that Broadwell has a bug affecting half-float
+       * MAD instructions when any of its sources has a non-zero offset, such
+       * as:
+       *
+       * mad(8) g18<1>HF -g17<4,4,1>HF g14.8<4,4,1>HF g11<4,4,1>HF { align16 1Q };
+       *
+       * We used to generate code like this for SIMD8 executions where we
+       * used to pack components Y and W of a vector at offset 16B of a SIMD
+       * register. The problem doesn't occur if the stride of the source is 0.
+       */
+      if (devinfo->gen == 8 &&
+          inst->opcode == BRW_OPCODE_MAD &&
+          inst->src[i].type == BRW_REGISTER_TYPE_HF &&
+          inst->src[i].offset > 0 &&
+          inst->src[i].stride != 0) {
+         return true;
       }
+
+      const unsigned dst_byte_stride = inst->dst.stride * type_sz(inst->dst.type);
+      const unsigned src_byte_stride = inst->src[i].stride *
+         type_sz(inst->src[i].type);
+      const unsigned dst_byte_offset = reg_offset(inst->dst) % REG_SIZE;
+      const unsigned src_byte_offset = reg_offset(inst->src[i]) % REG_SIZE;
+
+      return has_dst_aligned_region_restriction(devinfo, inst) &&
+             !is_uniform(inst->src[i]) &&
+             (src_byte_stride != dst_byte_stride ||
+              src_byte_offset != dst_byte_offset);
    }
 
    /*

Comments

On Tue, Feb 12, 2019 at 5:56 AM Iago Toral Quiroga <itoral@igalia.com>
wrote:

> Empirical testing shows that gen8 has a bug where MAD instructions with
> a half-float source starting at a non-zero offset fail to execute
> properly.
>
> This scenario usually happened in SIMD8 executions, where we used to
> pack vector components Y and W in the second half of SIMD registers
> (therefore, with a 16B offset). It looks like we are not currently doing
> this any more but this would handle the situation properly if we ever
> happen to produce code like this again.
>
> v2 (Jason):
>  - Move this workaround to the lower_regioning pass as an additional case
>    to has_invalid_src_region()
>  - Do not apply the workaround if the stride of the source operand is 0,
>    testing suggests the problem doesn't exist in that case.
>
> Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> (v1)
> ---
>  src/intel/compiler/brw_fs_lower_regioning.cpp | 39 +++++++++++++------
>  1 file changed, 28 insertions(+), 11 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp
> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> index df50993dee6..7c70cfab535 100644
> --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> @@ -109,20 +109,37 @@ namespace {
>     has_invalid_src_region(const gen_device_info *devinfo, const fs_inst
> *inst,
>                            unsigned i)
>     {
> -      if (is_unordered(inst)) {
> +      if (is_unordered(inst))
>           return false;
> -      } else {
> -         const unsigned dst_byte_stride = inst->dst.stride *
> type_sz(inst->dst.type);
> -         const unsigned src_byte_stride = inst->src[i].stride *
> -            type_sz(inst->src[i].type);
> -         const unsigned dst_byte_offset = reg_offset(inst->dst) %
> REG_SIZE;
> -         const unsigned src_byte_offset = reg_offset(inst->src[i]) %
> REG_SIZE;
>
> -         return has_dst_aligned_region_restriction(devinfo, inst) &&
> -                !is_uniform(inst->src[i]) &&
> -                (src_byte_stride != dst_byte_stride ||
> -                 src_byte_offset != dst_byte_offset);
> +      /* Empirical testing shows that Broadwell has a bug affecting
> half-float
> +       * MAD instructions when any of its sources has a non-zero offset,
> such
> +       * as:
> +       *
> +       * mad(8) g18<1>HF -g17<4,4,1>HF g14.8<4,4,1>HF g11<4,4,1>HF {
> align16 1Q };
> +       *
> +       * We used to generate code like this for SIMD8 executions where we
> +       * used to pack components Y and W of a vector at offset 16B of a
> SIMD
> +       * register. The problem doesn't occur if the stride of the source
> is 0.
> +       */
> +      if (devinfo->gen == 8 &&
> +          inst->opcode == BRW_OPCODE_MAD &&
> +          inst->src[i].type == BRW_REGISTER_TYPE_HF &&
> +          inst->src[i].offset > 0 &&
> +          inst->src[i].stride != 0) {
>

The above assumes the register is a GRF.  Perhaps we should make this
assumption explicit?  Or you can use some of curro's helpers and add
another one to get the subreg offset.  Also, the real problem here isn't
offset > 0, it's offset % REG_SIZE > 0.  If we have an array of 4 things,
they'll be at offsets 0, 16, 32, and 48.  We don't want an offset of 32
triggering it.


> +         return true;
>        }
> +
> +      const unsigned dst_byte_stride = inst->dst.stride *
> type_sz(inst->dst.type);
> +      const unsigned src_byte_stride = inst->src[i].stride *
> +         type_sz(inst->src[i].type);
> +      const unsigned dst_byte_offset = reg_offset(inst->dst) % REG_SIZE;
> +      const unsigned src_byte_offset = reg_offset(inst->src[i]) %
> REG_SIZE;
> +
> +      return has_dst_aligned_region_restriction(devinfo, inst) &&
> +             !is_uniform(inst->src[i]) &&
> +             (src_byte_stride != dst_byte_stride ||
> +              src_byte_offset != dst_byte_offset);
>     }
>
>     /*
> --
> 2.17.1
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
On Sat, 2019-02-16 at 09:02 -0600, Jason Ekstrand wrote:
> On Tue, Feb 12, 2019 at 5:56 AM Iago Toral Quiroga <itoral@igalia.com
> > wrote:
> > Empirical testing shows that gen8 has a bug where MAD instructions
> > with
> > 
> > a half-float source starting at a non-zero offset fail to execute
> > 
> > properly.
> > 
> > 
> > 
> > This scenario usually happened in SIMD8 executions, where we used
> > to
> > 
> > pack vector components Y and W in the second half of SIMD registers
> > 
> > (therefore, with a 16B offset). It looks like we are not currently
> > doing
> > 
> > this any more but this would handle the situation properly if we
> > ever
> > 
> > happen to produce code like this again.
> > 
> > 
> > 
> > v2 (Jason):
> > 
> >  - Move this workaround to the lower_regioning pass as an
> > additional case
> > 
> >    to has_invalid_src_region()
> > 
> >  - Do not apply the workaround if the stride of the source operand
> > is 0,
> > 
> >    testing suggests the problem doesn't exist in that case.
> > 
> > 
> > 
> > Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> (v1)
> > 
> > ---
> > 
> >  src/intel/compiler/brw_fs_lower_regioning.cpp | 39 +++++++++++++
> > ------
> > 
> >  1 file changed, 28 insertions(+), 11 deletions(-)
> > 
> > 
> > 
> > diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp
> > b/src/intel/compiler/brw_fs_lower_regioning.cpp
> > 
> > index df50993dee6..7c70cfab535 100644
> > 
> > --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> > 
> > +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> > 
> > @@ -109,20 +109,37 @@ namespace {
> > 
> >     has_invalid_src_region(const gen_device_info *devinfo, const
> > fs_inst *inst,
> > 
> >                            unsigned i)
> > 
> >     {
> > 
> > -      if (is_unordered(inst)) {
> > 
> > +      if (is_unordered(inst))
> > 
> >           return false;
> > 
> > -      } else {
> > 
> > -         const unsigned dst_byte_stride = inst->dst.stride *
> > type_sz(inst->dst.type);
> > 
> > -         const unsigned src_byte_stride = inst->src[i].stride *
> > 
> > -            type_sz(inst->src[i].type);
> > 
> > -         const unsigned dst_byte_offset = reg_offset(inst->dst) %
> > REG_SIZE;
> > 
> > -         const unsigned src_byte_offset = reg_offset(inst->src[i]) 
> > % REG_SIZE;
> > 
> > 
> > 
> > -         return has_dst_aligned_region_restriction(devinfo, inst)
> > &&
> > 
> > -                !is_uniform(inst->src[i]) &&
> > 
> > -                (src_byte_stride != dst_byte_stride ||
> > 
> > -                 src_byte_offset != dst_byte_offset);
> > 
> > +      /* Empirical testing shows that Broadwell has a bug
> > affecting half-float
> > 
> > +       * MAD instructions when any of its sources has a non-zero
> > offset, such
> > 
> > +       * as:
> > 
> > +       *
> > 
> > +       * mad(8) g18<1>HF -g17<4,4,1>HF g14.8<4,4,1>HF g11<4,4,1>HF
> > { align16 1Q };
> > 
> > +       *
> > 
> > +       * We used to generate code like this for SIMD8 executions
> > where we
> > 
> > +       * used to pack components Y and W of a vector at offset 16B
> > of a SIMD
> > 
> > +       * register. The problem doesn't occur if the stride of the
> > source is 0.
> > 
> > +       */
> > 
> > +      if (devinfo->gen == 8 &&
> > 
> > +          inst->opcode == BRW_OPCODE_MAD &&
> > 
> > +          inst->src[i].type == BRW_REGISTER_TYPE_HF &&
> > 
> > +          inst->src[i].offset > 0 &&
> > 
> > +          inst->src[i].stride != 0) {
> 
> The above assumes the register is a GRF.  Perhaps we should make this
> assumption explicit?  Or you can use some of curro's helpers and add
> another one to get the subreg offset.  Also, the real problem here
> isn't offset > 0, it's offset % REG_SIZE > 0.  If we have an array of
> 4 things, they'll be at offsets 0, 16, 32, and 48.  We don't want an
> offset of 32 triggering it.
> 

You're right. We already have a helper available that does what we
want, reg_offset() in brw_ir_fs.h. I have this now:
      if (devinfo->gen == 8 &&          inst->opcode == BRW_OPCODE_MAD
&&          inst->src[i].type == BRW_REGISTER_TYPE_HF
&&          reg_offset(inst->src[i]) % REG_SIZE > 0 &&          inst-
>src[i].stride != 0) {         return true;      }
> > +         return true;
> > 
> >        }
> > 
> > +
> > 
> > +      const unsigned dst_byte_stride = inst->dst.stride *
> > type_sz(inst->dst.type);
> > 
> > +      const unsigned src_byte_stride = inst->src[i].stride *
> > 
> > +         type_sz(inst->src[i].type);
> > 
> > +      const unsigned dst_byte_offset = reg_offset(inst->dst) %
> > REG_SIZE;
> > 
> > +      const unsigned src_byte_offset = reg_offset(inst->src[i]) %
> > REG_SIZE;
> > 
> > +
> > 
> > +      return has_dst_aligned_region_restriction(devinfo, inst) &&
> > 
> > +             !is_uniform(inst->src[i]) &&
> > 
> > +             (src_byte_stride != dst_byte_stride ||
> > 
> > +              src_byte_offset != dst_byte_offset);
> > 
> >     }
> > 
> > 
> > 
> >     /*
> >