[v3,01/42] intel/compiler: handle conversions between int and half-float on atom

Submitted by Iago Toral Quiroga on Jan. 15, 2019, 1:53 p.m.

Details

Message ID 20190115135414.2313-2-itoral@igalia.com
State New
Headers show
Series "intel: VK_KHR_shader_float16_int8 implementation" ( rev: 4 ) in Mesa

Not browsing as part of any series.

Commit Message

Iago Toral Quiroga Jan. 15, 2019, 1:53 p.m.
v2: adapted to work with the new regioning lowering pass

Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> (v1)
---
 src/intel/compiler/brw_ir_fs.h | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/intel/compiler/brw_ir_fs.h b/src/intel/compiler/brw_ir_fs.h
index 3c23fb375e4..ba4d6a95720 100644
--- a/src/intel/compiler/brw_ir_fs.h
+++ b/src/intel/compiler/brw_ir_fs.h
@@ -497,9 +497,10 @@  is_unordered(const fs_inst *inst)
 }
 
 /**
- * Return whether the following regioning restriction applies to the specified
- * instruction.  From the Cherryview PRM Vol 7. "Register Region
- * Restrictions":
+ * Return whether one of the the following regioning restrictions apply to the
+ * specified instruction.
+ *
+ * From the Cherryview PRM Vol 7. "Register Region Restrictions":
  *
  * "When source or destination datatype is 64b or operation is integer DWord
  *  multiply, regioning in Align1 must follow these rules:
@@ -508,6 +509,14 @@  is_unordered(const fs_inst *inst)
  *  2. Regioning must ensure Src.Vstride = Src.Width * Src.Hstride.
  *  3. Source and Destination offset must be the same, except the case of
  *     scalar source."
+ *
+ * From the Cherryview PRM Vol 7. "Register Region Restrictions":
+ *
+ *    "Conversion between Integer and HF (Half Float) must be DWord
+ *     aligned and strided by a DWord on the destination."
+ *
+ *    The same restriction is listed for other hardware platforms, however,
+ *    empirical testing suggests that only atom platforms are affected.
  */
 static inline bool
 has_dst_aligned_region_restriction(const gen_device_info *devinfo,
@@ -518,10 +527,20 @@  has_dst_aligned_region_restriction(const gen_device_info *devinfo,
          (inst->opcode == BRW_OPCODE_MUL || inst->opcode == BRW_OPCODE_MAD);
 
    if (type_sz(inst->dst.type) > 4 || type_sz(exec_type) > 4 ||
-       (type_sz(exec_type) == 4 && is_int_multiply))
-      return devinfo->is_cherryview || gen_device_info_is_9lp(devinfo);
-   else
-      return false;
+       (type_sz(exec_type) == 4 && is_int_multiply)) {
+      if (devinfo->is_cherryview || gen_device_info_is_9lp(devinfo))
+         return true;
+   }
+
+   const bool dst_type_is_hf = inst->dst.type == BRW_REGISTER_TYPE_HF;
+   const bool exec_type_is_hf = exec_type == BRW_REGISTER_TYPE_HF;
+   if ((dst_type_is_hf && !brw_reg_type_is_floating_point(exec_type)) ||
+       (exec_type_is_hf && !brw_reg_type_is_floating_point(inst->dst.type))) {
+      if (devinfo->is_cherryview || gen_device_info_is_9lp(devinfo))
+         return true;
+   }
+
+   return false;
 }
 
 #endif

Comments

Iago Toral Quiroga <itoral@igalia.com> writes:

> v2: adapted to work with the new regioning lowering pass
>
> Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> (v1)
> ---
>  src/intel/compiler/brw_ir_fs.h | 33 ++++++++++++++++++++++++++-------
>  1 file changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/src/intel/compiler/brw_ir_fs.h b/src/intel/compiler/brw_ir_fs.h
> index 3c23fb375e4..ba4d6a95720 100644
> --- a/src/intel/compiler/brw_ir_fs.h
> +++ b/src/intel/compiler/brw_ir_fs.h
> @@ -497,9 +497,10 @@ is_unordered(const fs_inst *inst)
>  }
>  
>  /**
> - * Return whether the following regioning restriction applies to the specified
> - * instruction.  From the Cherryview PRM Vol 7. "Register Region
> - * Restrictions":
> + * Return whether one of the the following regioning restrictions apply to the
> + * specified instruction.
> + *
> + * From the Cherryview PRM Vol 7. "Register Region Restrictions":
>   *
>   * "When source or destination datatype is 64b or operation is integer DWord
>   *  multiply, regioning in Align1 must follow these rules:
> @@ -508,6 +509,14 @@ is_unordered(const fs_inst *inst)
>   *  2. Regioning must ensure Src.Vstride = Src.Width * Src.Hstride.
>   *  3. Source and Destination offset must be the same, except the case of
>   *     scalar source."
> + *
> + * From the Cherryview PRM Vol 7. "Register Region Restrictions":
> + *
> + *    "Conversion between Integer and HF (Half Float) must be DWord
> + *     aligned and strided by a DWord on the destination."
> + *
> + *    The same restriction is listed for other hardware platforms, however,
> + *    empirical testing suggests that only atom platforms are affected.
>   */
>  static inline bool
>  has_dst_aligned_region_restriction(const gen_device_info *devinfo,
> @@ -518,10 +527,20 @@ has_dst_aligned_region_restriction(const gen_device_info *devinfo,
>           (inst->opcode == BRW_OPCODE_MUL || inst->opcode == BRW_OPCODE_MAD);
>  
>     if (type_sz(inst->dst.type) > 4 || type_sz(exec_type) > 4 ||
> -       (type_sz(exec_type) == 4 && is_int_multiply))
> -      return devinfo->is_cherryview || gen_device_info_is_9lp(devinfo);
> -   else
> -      return false;
> +       (type_sz(exec_type) == 4 && is_int_multiply)) {
> +      if (devinfo->is_cherryview || gen_device_info_is_9lp(devinfo))
> +         return true;
> +   }
> +
> +   const bool dst_type_is_hf = inst->dst.type == BRW_REGISTER_TYPE_HF;
> +   const bool exec_type_is_hf = exec_type == BRW_REGISTER_TYPE_HF;
> +   if ((dst_type_is_hf && !brw_reg_type_is_floating_point(exec_type)) ||
> +       (exec_type_is_hf && !brw_reg_type_is_floating_point(inst->dst.type))) {
> +      if (devinfo->is_cherryview || gen_device_info_is_9lp(devinfo))
> +         return true;
> +   }

While looking into this closely, I'm seeing substantial divergence
between the behavior of the simulator, the hardware docs, and the
restriction this is implementing...  The docs are certainly inconsistent
about how and where this should be handled.

I'm suspecting that this restriction is more similar in nature to the
one referred to in the regioning lowering pass as
"is_narrowing_conversion", rather than the one handled by
has_dst_aligned_region_restriction().  Probably we don't need to change
this function nor the regioning pass for it to be honored, because that
restriction is already implemented.  I have a feeling that the reason
for this may be that the 16-bit pipeline lacks the ability to handle
conversions from or to half-float, so the execution type is implicitly
promoted to the matching (integer or floating-point) 32-bit type where
any HF conversion would be needed.  And on those the usual alignment
restriction of the destination to a larger execution type applies.  From
the hardware docs for CHV *only*:

| When single precision and half precision floats are mixed between
| source operands or between source and destination operand. In such
| cases, single precision float is the execution datatype.

This would mean that an "add dst:f, src:hf, src:hf" is really computed
with single precision (!).

The restriction you're quoting seems to be the following:

| BDW+
|
| Conversion between Integer and HF (Half Float) must be DWord-aligned
| and strided by a DWord on the destination.
|
| // Example:
| add (8) r10.0<2>:hf r11.0<8;8,1>:w r12.0<8;8,1>:w
| // Destination stride must be 2.
| mov (8) r10.0<2>:w r11.0<8;8,1>:hf
| // Destination stride must be 2.

However that restriction is apparently overriden on *most* projects
except for BDW (where you aren't applying any restriction at all) by the
following:

| Project:  CHV, SKL+
|
| There is a relaxed alignment rule for word destinations. When the
| destination type is word (UW, W, HF), destination data types can be
| aligned to either the lowest word or the second lowest word of the
| execution channel. This means the destination data words can be either
| all in the even word locations or all in the odd word locations.
| 
| // Example:
| add (8)  r10.0<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf
| add (8)  r10.1<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf
| 
| // Note: The destination offset may be .0 or .1 although the
| // destination subregister is required to be aligned to execution
| // datatype.

Which seems somewhat nonsensical to me, because interpreted strictly it
would rule out the possibility for any packed 16-bit operations...

To make the matter even crazier, the Gen8 simulator enforces the
restrictions on BDW *only*, in fact any time that there is a conversion
between half float and any other type, which is *almost* equivalent to
the rule you'd quoted above except for more stringent requirements on
the destination sub-register alignment of FP32 -> FP16 conversions that
apply to BDW only.

Whatever of the above is going on, this patch doesn't give us what we
want.  I think we want to drop it and fix get_exec_type() to promote the
execution type correctly.  I'll reply with a patch to do that in a
minute.

> +
> +   return false;
>  }
>  
>  #endif
> -- 
> 2.17.1
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Hi Curro, thanks for looking into this, I have a few comments inline:

On Tue, 2019-01-15 at 14:34 -0800, Francisco Jerez wrote:
> Iago Toral Quiroga <itoral@igalia.com> writes:
> 
> > v2: adapted to work with the new regioning lowering pass
> > 
> > Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> (v1)
> > ---
> >  src/intel/compiler/brw_ir_fs.h | 33 ++++++++++++++++++++++++++--
> > -----
> >  1 file changed, 26 insertions(+), 7 deletions(-)
> > 
> > diff --git a/src/intel/compiler/brw_ir_fs.h
> > b/src/intel/compiler/brw_ir_fs.h
> > index 3c23fb375e4..ba4d6a95720 100644
> > --- a/src/intel/compiler/brw_ir_fs.h
> > +++ b/src/intel/compiler/brw_ir_fs.h
> > @@ -497,9 +497,10 @@ is_unordered(const fs_inst *inst)
> >  }
> >  
> >  /**
> > - * Return whether the following regioning restriction applies to
> > the specified
> > - * instruction.  From the Cherryview PRM Vol 7. "Register Region
> > - * Restrictions":
> > + * Return whether one of the the following regioning restrictions
> > apply to the
> > + * specified instruction.
> > + *
> > + * From the Cherryview PRM Vol 7. "Register Region Restrictions":
> >   *
> >   * "When source or destination datatype is 64b or operation is
> > integer DWord
> >   *  multiply, regioning in Align1 must follow these rules:
> > @@ -508,6 +509,14 @@ is_unordered(const fs_inst *inst)
> >   *  2. Regioning must ensure Src.Vstride = Src.Width *
> > Src.Hstride.
> >   *  3. Source and Destination offset must be the same, except the
> > case of
> >   *     scalar source."
> > + *
> > + * From the Cherryview PRM Vol 7. "Register Region Restrictions":
> > + *
> > + *    "Conversion between Integer and HF (Half Float) must be
> > DWord
> > + *     aligned and strided by a DWord on the destination."
> > + *
> > + *    The same restriction is listed for other hardware platforms,
> > however,
> > + *    empirical testing suggests that only atom platforms are
> > affected.
> >   */
> >  static inline bool
> >  has_dst_aligned_region_restriction(const gen_device_info *devinfo,
> > @@ -518,10 +527,20 @@ has_dst_aligned_region_restriction(const
> > gen_device_info *devinfo,
> >           (inst->opcode == BRW_OPCODE_MUL || inst->opcode ==
> > BRW_OPCODE_MAD);
> >  
> >     if (type_sz(inst->dst.type) > 4 || type_sz(exec_type) > 4 ||
> > -       (type_sz(exec_type) == 4 && is_int_multiply))
> > -      return devinfo->is_cherryview ||
> > gen_device_info_is_9lp(devinfo);
> > -   else
> > -      return false;
> > +       (type_sz(exec_type) == 4 && is_int_multiply)) {
> > +      if (devinfo->is_cherryview ||
> > gen_device_info_is_9lp(devinfo))
> > +         return true;
> > +   }
> > +
> > +   const bool dst_type_is_hf = inst->dst.type ==
> > BRW_REGISTER_TYPE_HF;
> > +   const bool exec_type_is_hf = exec_type == BRW_REGISTER_TYPE_HF;
> > +   if ((dst_type_is_hf &&
> > !brw_reg_type_is_floating_point(exec_type)) ||
> > +       (exec_type_is_hf && !brw_reg_type_is_floating_point(inst-
> > >dst.type))) {
> > +      if (devinfo->is_cherryview ||
> > gen_device_info_is_9lp(devinfo))
> > +         return true;
> > +   }
> 
> While looking into this closely, I'm seeing substantial divergence
> between the behavior of the simulator, the hardware docs, and the
> restriction this is implementing...  The docs are certainly
> inconsistent
> about how and where this should be handled.
> 
> I'm suspecting that this restriction is more similar in nature to the
> one referred to in the regioning lowering pass as
> "is_narrowing_conversion", rather than the one handled by
> has_dst_aligned_region_restriction().  Probably we don't need to
> change
> this function nor the regioning pass for it to be honored, because
> that
> restriction is already implemented.  I have a feeling that the reason
> for this may be that the 16-bit pipeline lacks the ability to handle
> conversions from or to half-float, so the execution type is
> implicitly
> promoted to the matching (integer or floating-point) 32-bit type
> where
> any HF conversion would be needed.  And on those the usual alignment
> restriction of the destination to a larger execution type
> applies.  From
> the hardware docs for CHV *only*:
> 
> > When single precision and half precision floats are mixed between
> > source operands or between source and destination operand. In such
> > cases, single precision float is the execution datatype.
> 
> This would mean that an "add dst:f, src:hf, src:hf" is really
> computed
> with single precision (!).
> 
> The restriction you're quoting seems to be the following:
> 
> > BDW+
> > 
> > Conversion between Integer and HF (Half Float) must be DWord-
> > aligned
> > and strided by a DWord on the destination.
> > 
> > // Example:
> > add (8) r10.0<2>:hf r11.0<8;8,1>:w r12.0<8;8,1>:w
> > // Destination stride must be 2.
> > mov (8) r10.0<2>:w r11.0<8;8,1>:hf
> > // Destination stride must be 2.
> 
> However that restriction is apparently overriden on *most* projects
> except for BDW (where you aren't applying any restriction at all) by
> the
> following:

That's interesting and doesn't seem to match with my empirical
experience. Going by the results of the available CTS tests, it is
clear that this is required for atom (at least Braswell), we have
conversion tests between integer and half-float types that fail if we
don't honor this restriction. On the other hand, we haven't seen any
test failures on any other platform that we have tested (BDW, SKL,
KBL).

> > Project:  CHV, SKL+
> > 
> > There is a relaxed alignment rule for word destinations. When the
> > destination type is word (UW, W, HF), destination data types can be
> > aligned to either the lowest word or the second lowest word of the
> > execution channel. This means the destination data words can be
> > either
> > all in the even word locations or all in the odd word locations.
> > 
> > // Example:
> > add (8)  r10.0<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf
> > add (8)  r10.1<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf
> > 
> > // Note: The destination offset may be .0 or .1 although the
> > // destination subregister is required to be aligned to execution
> > // datatype.
> 
> Which seems somewhat nonsensical to me, because interpreted strictly
> it
> would rule out the possibility for any packed 16-bit operations...

I think this only applies in the context of mixed float and half-float
operations. We do emit packed half-float operations and we have not
observed any problems with that on any platform. Which makes me think
that when only half-float is involved, the execution type may not be
promoted to 32-bit.

> To make the matter even crazier, the Gen8 simulator enforces the
> restrictions on BDW *only*,

Yeah, this is really odd, because this is definitely needed on Braswell
too.

>  in fact any time that there is a conversion
> between half float and any other type, which is *almost* equivalent
> to
> the rule you'd quoted above except for more stringent requirements on
> the destination sub-register alignment of FP32 -> FP16 conversions
> that
> apply to BDW only.
> 
> Whatever of the above is going on, this patch doesn't give us what we
> want.  I think we want to drop it and fix get_exec_type() to promote
> the
> execution type correctly.  I'll reply with a patch to do that in a
> minute.

I understand that you mean promoting the execution type only when there
are mixed types involved and not when the instruction is exclusively
half-float on all its operands and destinations, right?

> > +
> > +   return false;
> >  }
> >  
> >  #endif
> > -- 
> > 2.17.1
> > 
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev