[5/6] intel/fs: Handle surface opcode sample masks via predication.

Submitted by Francisco Jerez on Feb. 27, 2018, 9:38 p.m.

Details

Message ID 20180227213828.9889-5-currojerez@riseup.net
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Francisco Jerez Feb. 27, 2018, 9:38 p.m.
The main motivation is to enable HDC surface opcodes on ICL which no
longer allows the sample mask to be provided in a message header, but
this is enabled all the way back to IVB when possible because it
decreases the instruction count of some shaders using HDC messages
significantly, e.g. one of the SynMark2 CSDof compute shaders
decreases instruction count by about 40% due to the removal of header
setup boilerplate which in turn makes a number of send message
payloads more easily CSE-able.  Shader-db results on SKL:

 total instructions in shared programs: 15325319 -> 15314384 (-0.07%)
 instructions in affected programs: 311532 -> 300597 (-3.51%)
 helped: 491
 HURT: 1

Shader-db results on BDW where the optimization needs to be disabled
in some cases due to hardware restrictions:

 total instructions in shared programs: 15604794 -> 15598028 (-0.04%)
 instructions in affected programs: 220863 -> 214097 (-3.06%)
 helped: 351
 HURT: 0

The FPS of SynMark2 CSDof improves by 5.09% ±0.36% (n=10) on my SKL
laptop with this change.
---
 src/intel/compiler/brw_fs.cpp | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index 0b87d8ab14e..639432b4f49 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -4432,6 +4432,8 @@  static void
 lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
                            const fs_reg &sample_mask)
 {
+   const gen_device_info *devinfo = bld.shader->devinfo;
+
    /* Get the logical send arguments. */
    const fs_reg &addr = inst->src[0];
    const fs_reg &src = inst->src[1];
@@ -4442,7 +4444,20 @@  lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
    /* Calculate the total number of components of the payload. */
    const unsigned addr_sz = inst->components_read(0);
    const unsigned src_sz = inst->components_read(1);
-   const unsigned header_sz = (sample_mask.file == BAD_FILE ? 0 : 1);
+   /* From the BDW PRM Volume 7, page 147:
+    *
+    *  "For the Data Cache Data Port*, the header must be present for the
+    *   following message types: [...] Typed read/write/atomics"
+    *
+    * Earlier generations have a similar wording.  Because of this restriction
+    * we don't attempt to implement sample masks via predication for such
+    * messages prior to Gen9, since we have to provide a header anyway.  On
+    * Gen11+ the header has been removed so we can only use predication.
+    */
+   const unsigned header_sz = devinfo->gen < 9 &&
+                              (op == SHADER_OPCODE_TYPED_SURFACE_READ ||
+                               op == SHADER_OPCODE_TYPED_SURFACE_WRITE ||
+                               op == SHADER_OPCODE_TYPED_ATOMIC) ? 1 : 0;
    const unsigned sz = header_sz + addr_sz + src_sz;
 
    /* Allocate space for the payload. */
@@ -4462,6 +4477,31 @@  lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
 
    bld.LOAD_PAYLOAD(payload, components, sz, header_sz);
 
+   /* Predicate the instruction on the sample mask if no header is
+    * provided.
+    */
+   if (!header_sz && sample_mask.file != BAD_FILE &&
+       sample_mask.file != IMM) {
+      const fs_builder ubld = bld.group(1, 0).exec_all();
+      if (inst->predicate) {
+         assert(inst->predicate == BRW_PREDICATE_NORMAL);
+         assert(!inst->predicate_inverse);
+         assert(inst->flag_subreg < 2);
+         /* Combine the sample mask with the existing predicate by using a
+          * vertical predication mode.
+           */
+         inst->predicate = BRW_PREDICATE_ALIGN1_ALLV;
+         ubld.MOV(retype(brw_flag_subreg(inst->flag_subreg + 2),
+                         sample_mask.type),
+                  sample_mask);
+      } else {
+         inst->flag_subreg = 2;
+         inst->predicate = BRW_PREDICATE_NORMAL;
+         ubld.MOV(retype(brw_flag_subreg(inst->flag_subreg), sample_mask.type),
+                  sample_mask);
+      }
+   }
+
    /* Update the original instruction. */
    inst->opcode = op;
    inst->mlen = header_sz + (addr_sz + src_sz) * inst->exec_size / 8;

Comments

On Tuesday, February 27, 2018 1:38:27 PM PST Francisco Jerez wrote:
> The main motivation is to enable HDC surface opcodes on ICL which no
> longer allows the sample mask to be provided in a message header, but
> this is enabled all the way back to IVB when possible because it
> decreases the instruction count of some shaders using HDC messages
> significantly, e.g. one of the SynMark2 CSDof compute shaders
> decreases instruction count by about 40% due to the removal of header
> setup boilerplate which in turn makes a number of send message
> payloads more easily CSE-able.  Shader-db results on SKL:
> 
>  total instructions in shared programs: 15325319 -> 15314384 (-0.07%)
>  instructions in affected programs: 311532 -> 300597 (-3.51%)
>  helped: 491
>  HURT: 1
> 
> Shader-db results on BDW where the optimization needs to be disabled
> in some cases due to hardware restrictions:
> 
>  total instructions in shared programs: 15604794 -> 15598028 (-0.04%)
>  instructions in affected programs: 220863 -> 214097 (-3.06%)
>  helped: 351
>  HURT: 0
> 
> The FPS of SynMark2 CSDof improves by 5.09% ±0.36% (n=10) on my SKL
> laptop with this change.
> ---
>  src/intel/compiler/brw_fs.cpp | 42 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 0b87d8ab14e..639432b4f49 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -4432,6 +4432,8 @@ static void
>  lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>                             const fs_reg &sample_mask)
>  {
> +   const gen_device_info *devinfo = bld.shader->devinfo;
> +
>     /* Get the logical send arguments. */
>     const fs_reg &addr = inst->src[0];
>     const fs_reg &src = inst->src[1];
> @@ -4442,7 +4444,20 @@ lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>     /* Calculate the total number of components of the payload. */
>     const unsigned addr_sz = inst->components_read(0);
>     const unsigned src_sz = inst->components_read(1);
> -   const unsigned header_sz = (sample_mask.file == BAD_FILE ? 0 : 1);
> +   /* From the BDW PRM Volume 7, page 147:
> +    *
> +    *  "For the Data Cache Data Port*, the header must be present for the
> +    *   following message types: [...] Typed read/write/atomics"
> +    *
> +    * Earlier generations have a similar wording.  Because of this restriction
> +    * we don't attempt to implement sample masks via predication for such
> +    * messages prior to Gen9, since we have to provide a header anyway.  On
> +    * Gen11+ the header has been removed so we can only use predication.
> +    */
> +   const unsigned header_sz = devinfo->gen < 9 &&
> +                              (op == SHADER_OPCODE_TYPED_SURFACE_READ ||
> +                               op == SHADER_OPCODE_TYPED_SURFACE_WRITE ||
> +                               op == SHADER_OPCODE_TYPED_ATOMIC) ? 1 : 0;
>     const unsigned sz = header_sz + addr_sz + src_sz;
>  
>     /* Allocate space for the payload. */
> @@ -4462,6 +4477,31 @@ lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>  
>     bld.LOAD_PAYLOAD(payload, components, sz, header_sz);
>  
> +   /* Predicate the instruction on the sample mask if no header is
> +    * provided.
> +    */
> +   if (!header_sz && sample_mask.file != BAD_FILE &&
> +       sample_mask.file != IMM) {
> +      const fs_builder ubld = bld.group(1, 0).exec_all();
> +      if (inst->predicate) {
> +         assert(inst->predicate == BRW_PREDICATE_NORMAL);
> +         assert(!inst->predicate_inverse);
> +         assert(inst->flag_subreg < 2);
> +         /* Combine the sample mask with the existing predicate by using a
> +          * vertical predication mode.
> +           */
> +         inst->predicate = BRW_PREDICATE_ALIGN1_ALLV;
> +         ubld.MOV(retype(brw_flag_subreg(inst->flag_subreg + 2),
> +                         sample_mask.type),
> +                  sample_mask);

I was surprised to see flag_subreg remain unchanged here, but then I
re-read how allv works, and it does f0.0 & f1.0, or f0.1 & f1.1.  So,
we can leave it as 0 or 1 and it'll implicitly use 2 or 3 as well.

Series is:
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

> +      } else {
> +         inst->flag_subreg = 2;
> +         inst->predicate = BRW_PREDICATE_NORMAL;
> +         ubld.MOV(retype(brw_flag_subreg(inst->flag_subreg), sample_mask.type),
> +                  sample_mask);
> +      }
> +   }
> +
>     /* Update the original instruction. */
>     inst->opcode = op;
>     inst->mlen = header_sz + (addr_sz + src_sz) * inst->exec_size / 8;
>
Hi,

On 27.02.2018 23:38, Francisco Jerez wrote:
> The main motivation is to enable HDC surface opcodes on ICL which no
> longer allows the sample mask to be provided in a message header, but
> this is enabled all the way back to IVB when possible because it
> decreases the instruction count of some shaders using HDC messages
> significantly, e.g. one of the SynMark2 CSDof compute shaders
> decreases instruction count by about 40% due to the removal of header
> setup boilerplate which in turn makes a number of send message
> payloads more easily CSE-able.  Shader-db results on SKL:
> 
>   total instructions in shared programs: 15325319 -> 15314384 (-0.07%)
>   instructions in affected programs: 311532 -> 300597 (-3.51%)
>   helped: 491
>   HURT: 1
> 
> Shader-db results on BDW where the optimization needs to be disabled
> in some cases due to hardware restrictions:
> 
>   total instructions in shared programs: 15604794 -> 15598028 (-0.04%)
>   instructions in affected programs: 220863 -> 214097 (-3.06%)
>   helped: 351
>   HURT: 0
> 
> The FPS of SynMark2 CSDof improves by 5.09% ±0.36% (n=10) on my SKL
> laptop with this change.

I tested the series with our full benchmark set, on BYT, HSW GT2, BXT 
and SKL GT2.

CSDof improved by:
* 9% on BYT
* 7-8% on BXT J4205, and on SKL GT2 desktop

(Variance on HSW was too large in this test, to conclude anything.)

Changes in all the other tests were within daily variance.


Tested-By: Eero Tamminen <eero.t.tamminen@intel.com>


	- Eero
> ---
>   src/intel/compiler/brw_fs.cpp | 42 +++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 0b87d8ab14e..639432b4f49 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -4432,6 +4432,8 @@ static void
>   lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>                              const fs_reg &sample_mask)
>   {
> +   const gen_device_info *devinfo = bld.shader->devinfo;
> +
>      /* Get the logical send arguments. */
>      const fs_reg &addr = inst->src[0];
>      const fs_reg &src = inst->src[1];
> @@ -4442,7 +4444,20 @@ lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>      /* Calculate the total number of components of the payload. */
>      const unsigned addr_sz = inst->components_read(0);
>      const unsigned src_sz = inst->components_read(1);
> -   const unsigned header_sz = (sample_mask.file == BAD_FILE ? 0 : 1);
> +   /* From the BDW PRM Volume 7, page 147:
> +    *
> +    *  "For the Data Cache Data Port*, the header must be present for the
> +    *   following message types: [...] Typed read/write/atomics"
> +    *
> +    * Earlier generations have a similar wording.  Because of this restriction
> +    * we don't attempt to implement sample masks via predication for such
> +    * messages prior to Gen9, since we have to provide a header anyway.  On
> +    * Gen11+ the header has been removed so we can only use predication.
> +    */
> +   const unsigned header_sz = devinfo->gen < 9 &&
> +                              (op == SHADER_OPCODE_TYPED_SURFACE_READ ||
> +                               op == SHADER_OPCODE_TYPED_SURFACE_WRITE ||
> +                               op == SHADER_OPCODE_TYPED_ATOMIC) ? 1 : 0;
>      const unsigned sz = header_sz + addr_sz + src_sz;
>   
>      /* Allocate space for the payload. */
> @@ -4462,6 +4477,31 @@ lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>   
>      bld.LOAD_PAYLOAD(payload, components, sz, header_sz);
>   
> +   /* Predicate the instruction on the sample mask if no header is
> +    * provided.
> +    */
> +   if (!header_sz && sample_mask.file != BAD_FILE &&
> +       sample_mask.file != IMM) {
> +      const fs_builder ubld = bld.group(1, 0).exec_all();
> +      if (inst->predicate) {
> +         assert(inst->predicate == BRW_PREDICATE_NORMAL);
> +         assert(!inst->predicate_inverse);
> +         assert(inst->flag_subreg < 2);
> +         /* Combine the sample mask with the existing predicate by using a
> +          * vertical predication mode.
> +           */
> +         inst->predicate = BRW_PREDICATE_ALIGN1_ALLV;
> +         ubld.MOV(retype(brw_flag_subreg(inst->flag_subreg + 2),
> +                         sample_mask.type),
> +                  sample_mask);
> +      } else {
> +         inst->flag_subreg = 2;
> +         inst->predicate = BRW_PREDICATE_NORMAL;
> +         ubld.MOV(retype(brw_flag_subreg(inst->flag_subreg), sample_mask.type),
> +                  sample_mask);
> +      }
> +   }
> +
>      /* Update the original instruction. */
>      inst->opcode = op;
>      inst->mlen = header_sz + (addr_sz + src_sz) * inst->exec_size / 8;
>