[Mesa-dev,3/5] i965/ir: Skip eliminate_find_live_channel() for stages with sparse thread dispatch.

Submitted by Francisco Jerez on Sept. 16, 2016, 10:03 p.m.

Details

Message ID 20160916220312.12988-3-currojerez@riseup.net
State New
Headers show
Series "Series without cover letter" ( rev: 2 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Francisco Jerez Sept. 16, 2016, 10:03 p.m.
The eliminate_find_live_channel optimization eliminates
FIND_LIVE_CHANNEL instructions in cases where control flow is known to
be uniform, and replaces them with 'MOV 0', which in turn unblocks
subsequent elimination of the BROADCAST instruction frequently used on
the result of FIND_LIVE_CHANNEL.  This is however not correct in
per-sample fragment shader dispatch because the PSD can dispatch a
fully unlit sample under certain conditions.  Disable the optimization
in that case.
---
 src/mesa/drivers/dri/i965/brw_compiler.h | 41 ++++++++++++++++++++++++++++++++
 src/mesa/drivers/dri/i965/brw_fs.cpp     |  8 +++++++
 src/mesa/drivers/dri/i965/brw_vec4.cpp   |  8 +++++++
 3 files changed, 57 insertions(+)

Patch hide | download patch | download mbox

diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h b/src/mesa/drivers/dri/i965/brw_compiler.h
index 84d3dde..1429875 100644
--- a/src/mesa/drivers/dri/i965/brw_compiler.h
+++ b/src/mesa/drivers/dri/i965/brw_compiler.h
@@ -868,6 +868,47 @@  encode_slm_size(unsigned gen, uint32_t bytes)
    return slm_size;
 }
 
+/**
+ * Return true if the given shader stage is dispatched contiguously by the
+ * relevant fixed function starting from channel 0 of the SIMD thread, which
+ * implies that the dispatch mask of a thread can be assumed to have the form
+ * '2^n - 1' for some n.
+ */
+static inline bool
+brw_stage_has_packed_dispatch(gl_shader_stage stage,
+                              const struct brw_stage_prog_data *prog_data)
+{
+   switch (stage) {
+   case MESA_SHADER_FRAGMENT: {
+      /* The PSD discards subspans coming in with no lit samples, which in the
+       * per-pixel shading case implies that each subspan will either be fully
+       * lit (due to the VMask being used to allow derivative computations),
+       * or not dispatched at all.  In per-sample dispatch mode individual
+       * samples from the same subspan have a fixed relative location within
+       * the SIMD thread, so dispatch of unlit samples cannot be avoided in
+       * general and we should return false.
+       */
+      const struct brw_wm_prog_data *wm_prog_data =
+         (const struct brw_wm_prog_data *)prog_data;
+      return !wm_prog_data->persample_dispatch;
+   }
+   case MESA_SHADER_COMPUTE:
+      /* Compute shaders will be spawned with either a fully enabled dispatch
+       * mask or with whatever bottom/right execution mask was given to the
+       * GPGPU walker command to be used along the workgroup edges -- In both
+       * cases the dispatch mask is required to be tightly packed for our
+       * invocation index calculations to work.
+       */
+      return true;
+   default:
+      /* Most remaining fixed functions are limited to use a packed dispatch
+       * mask due to the hardware representation of the dispatch mask as a
+       * single counter representing the number of enabled channels.
+       */
+      return true;
+   }
+}
+
 #ifdef __cplusplus
 } /* extern "C" */
 #endif
diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
index bb65077..32f7ae2 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -2835,6 +2835,14 @@  fs_visitor::eliminate_find_live_channel()
    bool progress = false;
    unsigned depth = 0;
 
+   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
+      /* The optimization below assumes that channel zero is live on thread
+       * dispatch, which may not be the case if the fixed function dispatches
+       * threads sparsely.
+       */
+      return progress;
+   }
+
    foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
       switch (inst->opcode) {
       case BRW_OPCODE_IF:
diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
index 58c8a8a..d5bb82b 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
@@ -1291,6 +1291,14 @@  vec4_visitor::eliminate_find_live_channel()
    bool progress = false;
    unsigned depth = 0;
 
+   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
+      /* The optimization below assumes that channel zero is live on thread
+       * dispatch, which may not be the case if the fixed function dispatches
+       * threads sparsely.
+       */
+      return progress;
+   }
+
    foreach_block_and_inst_safe(block, vec4_instruction, inst, cfg) {
       switch (inst->opcode) {
       case BRW_OPCODE_IF:

Comments

On Fri, Sep 16, 2016 at 3:03 PM, Francisco Jerez <currojerez@riseup.net>
wrote:

> The eliminate_find_live_channel optimization eliminates
> FIND_LIVE_CHANNEL instructions in cases where control flow is known to
> be uniform, and replaces them with 'MOV 0', which in turn unblocks
> subsequent elimination of the BROADCAST instruction frequently used on
> the result of FIND_LIVE_CHANNEL.  This is however not correct in
> per-sample fragment shader dispatch because the PSD can dispatch a
> fully unlit sample under certain conditions.  Disable the optimization
> in that case.
> ---
>  src/mesa/drivers/dri/i965/brw_compiler.h | 41
> ++++++++++++++++++++++++++++++++
>  src/mesa/drivers/dri/i965/brw_fs.cpp     |  8 +++++++
>  src/mesa/drivers/dri/i965/brw_vec4.cpp   |  8 +++++++
>  3 files changed, 57 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h
> b/src/mesa/drivers/dri/i965/brw_compiler.h
> index 84d3dde..1429875 100644
> --- a/src/mesa/drivers/dri/i965/brw_compiler.h
> +++ b/src/mesa/drivers/dri/i965/brw_compiler.h
> @@ -868,6 +868,47 @@ encode_slm_size(unsigned gen, uint32_t bytes)
>     return slm_size;
>  }
>
> +/**
> + * Return true if the given shader stage is dispatched contiguously by the
> + * relevant fixed function starting from channel 0 of the SIMD thread,
> which
> + * implies that the dispatch mask of a thread can be assumed to have the
> form
> + * '2^n - 1' for some n.
> + */
> +static inline bool
> +brw_stage_has_packed_dispatch(gl_shader_stage stage,
> +                              const struct brw_stage_prog_data *prog_data)
>

Thank you, thank you, thank you for making this a well-documented helper
function!


> +{
> +   switch (stage) {
> +   case MESA_SHADER_FRAGMENT: {
> +      /* The PSD discards subspans coming in with no lit samples, which
> in the
> +       * per-pixel shading case implies that each subspan will either be
> fully
> +       * lit (due to the VMask being used to allow derivative
> computations),
> +       * or not dispatched at all.  In per-sample dispatch mode individual
> +       * samples from the same subspan have a fixed relative location
> within
> +       * the SIMD thread, so dispatch of unlit samples cannot be avoided
> in
> +       * general and we should return false.
> +       */
> +      const struct brw_wm_prog_data *wm_prog_data =
> +         (const struct brw_wm_prog_data *)prog_data;
> +      return !wm_prog_data->persample_dispatch;
> +   }
> +   case MESA_SHADER_COMPUTE:
> +      /* Compute shaders will be spawned with either a fully enabled
> dispatch
> +       * mask or with whatever bottom/right execution mask was given to
> the
> +       * GPGPU walker command to be used along the workgroup edges -- In
> both
> +       * cases the dispatch mask is required to be tightly packed for our
> +       * invocation index calculations to work.
> +       */
> +      return true;
> +   default:
> +      /* Most remaining fixed functions are limited to use a packed
> dispatch
> +       * mask due to the hardware representation of the dispatch mask as a
> +       * single counter representing the number of enabled channels.
> +       */
> +      return true;
> +   }
> +}
> +
>  #ifdef __cplusplus
>  } /* extern "C" */
>  #endif
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
> b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index bb65077..32f7ae2 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -2835,6 +2835,14 @@ fs_visitor::eliminate_find_live_channel()
>     bool progress = false;
>     unsigned depth = 0;
>
> +   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
> +      /* The optimization below assumes that channel zero is live on
> thread
> +       * dispatch, which may not be the case if the fixed function
> dispatches
> +       * threads sparsely.
> +       */
> +      return progress;
>

Maybe just return false?


> +   }
> +
>     foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>        switch (inst->opcode) {
>        case BRW_OPCODE_IF:
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> index 58c8a8a..d5bb82b 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> @@ -1291,6 +1291,14 @@ vec4_visitor::eliminate_find_live_channel()
>     bool progress = false;
>     unsigned depth = 0;
>
> +   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
> +      /* The optimization below assumes that channel zero is live on
> thread
> +       * dispatch, which may not be the case if the fixed function
> dispatches
> +       * threads sparsely.
> +       */
> +      return progress;
>

Same here.


> +   }
> +
>     foreach_block_and_inst_safe(block, vec4_instruction, inst, cfg) {
>        switch (inst->opcode) {
>        case BRW_OPCODE_IF:
> --
> 2.9.0
>
>
Jason Ekstrand <jason@jlekstrand.net> writes:

> On Fri, Sep 16, 2016 at 3:03 PM, Francisco Jerez <currojerez@riseup.net>
> wrote:
>
>> The eliminate_find_live_channel optimization eliminates
>> FIND_LIVE_CHANNEL instructions in cases where control flow is known to
>> be uniform, and replaces them with 'MOV 0', which in turn unblocks
>> subsequent elimination of the BROADCAST instruction frequently used on
>> the result of FIND_LIVE_CHANNEL.  This is however not correct in
>> per-sample fragment shader dispatch because the PSD can dispatch a
>> fully unlit sample under certain conditions.  Disable the optimization
>> in that case.
>> ---
>>  src/mesa/drivers/dri/i965/brw_compiler.h | 41
>> ++++++++++++++++++++++++++++++++
>>  src/mesa/drivers/dri/i965/brw_fs.cpp     |  8 +++++++
>>  src/mesa/drivers/dri/i965/brw_vec4.cpp   |  8 +++++++
>>  3 files changed, 57 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h
>> b/src/mesa/drivers/dri/i965/brw_compiler.h
>> index 84d3dde..1429875 100644
>> --- a/src/mesa/drivers/dri/i965/brw_compiler.h
>> +++ b/src/mesa/drivers/dri/i965/brw_compiler.h
>> @@ -868,6 +868,47 @@ encode_slm_size(unsigned gen, uint32_t bytes)
>>     return slm_size;
>>  }
>>
>> +/**
>> + * Return true if the given shader stage is dispatched contiguously by the
>> + * relevant fixed function starting from channel 0 of the SIMD thread,
>> which
>> + * implies that the dispatch mask of a thread can be assumed to have the
>> form
>> + * '2^n - 1' for some n.
>> + */
>> +static inline bool
>> +brw_stage_has_packed_dispatch(gl_shader_stage stage,
>> +                              const struct brw_stage_prog_data *prog_data)
>>
>
> Thank you, thank you, thank you for making this a well-documented helper
> function!
>

Given the amount of hardware documentation about this, any documentation
is too little documentation. ;)

>
>> +{
>> +   switch (stage) {
>> +   case MESA_SHADER_FRAGMENT: {
>> +      /* The PSD discards subspans coming in with no lit samples, which
>> in the
>> +       * per-pixel shading case implies that each subspan will either be
>> fully
>> +       * lit (due to the VMask being used to allow derivative
>> computations),
>> +       * or not dispatched at all.  In per-sample dispatch mode individual
>> +       * samples from the same subspan have a fixed relative location
>> within
>> +       * the SIMD thread, so dispatch of unlit samples cannot be avoided
>> in
>> +       * general and we should return false.
>> +       */
>> +      const struct brw_wm_prog_data *wm_prog_data =
>> +         (const struct brw_wm_prog_data *)prog_data;
>> +      return !wm_prog_data->persample_dispatch;
>> +   }
>> +   case MESA_SHADER_COMPUTE:
>> +      /* Compute shaders will be spawned with either a fully enabled
>> dispatch
>> +       * mask or with whatever bottom/right execution mask was given to
>> the
>> +       * GPGPU walker command to be used along the workgroup edges -- In
>> both
>> +       * cases the dispatch mask is required to be tightly packed for our
>> +       * invocation index calculations to work.
>> +       */
>> +      return true;
>> +   default:
>> +      /* Most remaining fixed functions are limited to use a packed
>> dispatch
>> +       * mask due to the hardware representation of the dispatch mask as a
>> +       * single counter representing the number of enabled channels.
>> +       */
>> +      return true;
>> +   }
>> +}
>> +
>>  #ifdef __cplusplus
>>  } /* extern "C" */
>>  #endif
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index bb65077..32f7ae2 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -2835,6 +2835,14 @@ fs_visitor::eliminate_find_live_channel()
>>     bool progress = false;
>>     unsigned depth = 0;
>>
>> +   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
>> +      /* The optimization below assumes that channel zero is live on
>> thread
>> +       * dispatch, which may not be the case if the fixed function
>> dispatches
>> +       * threads sparsely.
>> +       */
>> +      return progress;
>>
>
> Maybe just return false?
>

Sure, don't have a strong preference, changed locally.

>
>> +   }
>> +
>>     foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>>        switch (inst->opcode) {
>>        case BRW_OPCODE_IF:
>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> index 58c8a8a..d5bb82b 100644
>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> @@ -1291,6 +1291,14 @@ vec4_visitor::eliminate_find_live_channel()
>>     bool progress = false;
>>     unsigned depth = 0;
>>
>> +   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
>> +      /* The optimization below assumes that channel zero is live on
>> thread
>> +       * dispatch, which may not be the case if the fixed function
>> dispatches
>> +       * threads sparsely.
>> +       */
>> +      return progress;
>>
>
> Same here.
>
>
>> +   }
>> +
>>     foreach_block_and_inst_safe(block, vec4_instruction, inst, cfg) {
>>        switch (inst->opcode) {
>>        case BRW_OPCODE_IF:
>> --
>> 2.9.0
>>
>>