[3/5] intel/fs: Cap dst-aligned region stride to maximum representable hstride value.

Submitted by Francisco Jerez on Jan. 19, 2019, 12:09 a.m.

Details

Message ID 20190119000902.11597-3-currojerez@riseup.net
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Francisco Jerez Jan. 19, 2019, 12:09 a.m.
This is required in combination with the following commit, because
otherwise if a source region with an extended 8+ stride is present in
the instruction (which we're about to declare legal) we'll end up
emitting code that attempts to write to such a region, even though
strides greater than four are still illegal for the destination.
---
 src/intel/compiler/brw_fs_lower_regioning.cpp | 20 ++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp b/src/intel/compiler/brw_fs_lower_regioning.cpp
index 6a3c23892b4..b86e95ed9eb 100644
--- a/src/intel/compiler/brw_fs_lower_regioning.cpp
+++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
@@ -71,15 +71,25 @@  namespace {
           !is_byte_raw_mov(inst)) {
          return get_exec_type_size(inst);
       } else {
-         unsigned stride = inst->dst.stride * type_sz(inst->dst.type);
+         /* Calculate the maximum byte stride and the minimum type size across
+          * all source and destination operands.
+          */
+         unsigned max_stride = inst->dst.stride * type_sz(inst->dst.type);
+         unsigned min_size = type_sz(inst->dst.type);
 
          for (unsigned i = 0; i < inst->sources; i++) {
-            if (!is_uniform(inst->src[i]) && !inst->is_control_source(i))
-               stride = MAX2(stride, inst->src[i].stride *
-                             type_sz(inst->src[i].type));
+            if (!is_uniform(inst->src[i]) && !inst->is_control_source(i)) {
+               max_stride = MAX2(max_stride, inst->src[i].stride *
+                                 type_sz(inst->src[i].type));
+               min_size = MIN2(min_size, type_sz(inst->src[i].type));
+            }
          }
 
-         return stride;
+         /* Attempt to use the largest byte stride among all present operands,
+          * but never exceed a stride of 4 since that would lead to illegal
+          * destination regions during lowering.
+          */
+         return MIN2(max_stride, 4 * min_size);
       }
    }
 

Comments

On Fri, Jan 18, 2019 at 6:09 PM Francisco Jerez <currojerez@riseup.net>
wrote:

> This is required in combination with the following commit, because
> otherwise if a source region with an extended 8+ stride is present in
> the instruction (which we're about to declare legal) we'll end up
> emitting code that attempts to write to such a region, even though
> strides greater than four are still illegal for the destination.
> ---
>  src/intel/compiler/brw_fs_lower_regioning.cpp | 20 ++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp
> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> index 6a3c23892b4..b86e95ed9eb 100644
> --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> @@ -71,15 +71,25 @@ namespace {
>            !is_byte_raw_mov(inst)) {
>           return get_exec_type_size(inst);
>        } else {
> -         unsigned stride = inst->dst.stride * type_sz(inst->dst.type);
> +         /* Calculate the maximum byte stride and the minimum type size
> across
> +          * all source and destination operands.
> +          */
> +         unsigned max_stride = inst->dst.stride * type_sz(inst->dst.type);
> +         unsigned min_size = type_sz(inst->dst.type);
>
>           for (unsigned i = 0; i < inst->sources; i++) {
> -            if (!is_uniform(inst->src[i]) && !inst->is_control_source(i))
> -               stride = MAX2(stride, inst->src[i].stride *
> -                             type_sz(inst->src[i].type));
> +            if (!is_uniform(inst->src[i]) && !inst->is_control_source(i))
> {
> +               max_stride = MAX2(max_stride, inst->src[i].stride *
> +                                 type_sz(inst->src[i].type));
> +               min_size = MIN2(min_size, type_sz(inst->src[i].type));
> +            }
>           }
>
> -         return stride;
> +         /* Attempt to use the largest byte stride among all present
> operands,
> +          * but never exceed a stride of 4 since that would lead to
> illegal
> +          * destination regions during lowering.
> +          */
> +         return MIN2(max_stride, 4 * min_size);
>

Why not just fall back to tightly packed in this case?  I think I can
answer my own question:  Because using something that's equal to one of the
strides reduces the liklihood that we'll need a temporary.  If that's the
correct answer, then maybe what we want is the maximum of all strides with
stride_in_bytes <= 4 * type_sz?

--Jason


>        }
>     }
>
> --
> 2.19.2
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
Jason Ekstrand <jason@jlekstrand.net> writes:

> On Fri, Jan 18, 2019 at 6:09 PM Francisco Jerez <currojerez@riseup.net>
> wrote:
>
>> This is required in combination with the following commit, because
>> otherwise if a source region with an extended 8+ stride is present in
>> the instruction (which we're about to declare legal) we'll end up
>> emitting code that attempts to write to such a region, even though
>> strides greater than four are still illegal for the destination.
>> ---
>>  src/intel/compiler/brw_fs_lower_regioning.cpp | 20 ++++++++++++++-----
>>  1 file changed, 15 insertions(+), 5 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp
>> b/src/intel/compiler/brw_fs_lower_regioning.cpp
>> index 6a3c23892b4..b86e95ed9eb 100644
>> --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
>> +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
>> @@ -71,15 +71,25 @@ namespace {
>>            !is_byte_raw_mov(inst)) {
>>           return get_exec_type_size(inst);
>>        } else {
>> -         unsigned stride = inst->dst.stride * type_sz(inst->dst.type);
>> +         /* Calculate the maximum byte stride and the minimum type size
>> across
>> +          * all source and destination operands.
>> +          */
>> +         unsigned max_stride = inst->dst.stride * type_sz(inst->dst.type);
>> +         unsigned min_size = type_sz(inst->dst.type);
>>
>>           for (unsigned i = 0; i < inst->sources; i++) {
>> -            if (!is_uniform(inst->src[i]) && !inst->is_control_source(i))
>> -               stride = MAX2(stride, inst->src[i].stride *
>> -                             type_sz(inst->src[i].type));
>> +            if (!is_uniform(inst->src[i]) && !inst->is_control_source(i))
>> {
>> +               max_stride = MAX2(max_stride, inst->src[i].stride *
>> +                                 type_sz(inst->src[i].type));
>> +               min_size = MIN2(min_size, type_sz(inst->src[i].type));
>> +            }
>>           }
>>
>> -         return stride;
>> +         /* Attempt to use the largest byte stride among all present
>> operands,
>> +          * but never exceed a stride of 4 since that would lead to
>> illegal
>> +          * destination regions during lowering.
>> +          */
>> +         return MIN2(max_stride, 4 * min_size);
>>
>
> Why not just fall back to tightly packed in this case?  I think I can
> answer my own question:  Because using something that's equal to one of the
> strides reduces the liklihood that we'll need a temporary.  If that's the
> correct answer, then maybe what we want is the maximum of all strides with
> stride_in_bytes <= 4 * type_sz?
>

We also want the result to be greater than or equal to the size of the
largest non-uniform, non-control source type, since packing a vector of
such a type into a temporary of lower byte stride than its size is
impossible.  This patch guarantees that as long as max_size <= 4 *
min_size, which is necessary for the lowering code that calls this
function to work at all.

It would be possible to preserve this guarantee while attempting to pick
one of the strides of the pre-existing sources as you say -- I would be
happy to review that change as a follow-up micro-optimization patch, but
there are some corner cases to consider I don't necessarily want to
bother with in the patch doing the functional change, for the sake of
bisectability.

It may make sense to add an assert here that max_size <= 4 * min_size
for the case such an instruction doesn't blow up already at the EU
validator (it doesn't look like the validator is currently enforcing the
lack of conversions between 8 and 64 bit types?), it will just involve
calculating max_size in addition to max_stride and min_size above.

> --Jason
>
>
>>        }
>>     }
>>
>> --
>> 2.19.2
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
On Thu, Feb 14, 2019 at 4:53 PM Francisco Jerez <currojerez@riseup.net>
wrote:

> Jason Ekstrand <jason@jlekstrand.net> writes:
>
> > On Fri, Jan 18, 2019 at 6:09 PM Francisco Jerez <currojerez@riseup.net>
> > wrote:
> >
> >> This is required in combination with the following commit, because
> >> otherwise if a source region with an extended 8+ stride is present in
> >> the instruction (which we're about to declare legal) we'll end up
> >> emitting code that attempts to write to such a region, even though
> >> strides greater than four are still illegal for the destination.
> >> ---
> >>  src/intel/compiler/brw_fs_lower_regioning.cpp | 20 ++++++++++++++-----
> >>  1 file changed, 15 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp
> >> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> >> index 6a3c23892b4..b86e95ed9eb 100644
> >> --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> >> +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> >> @@ -71,15 +71,25 @@ namespace {
> >>            !is_byte_raw_mov(inst)) {
> >>           return get_exec_type_size(inst);
> >>        } else {
> >> -         unsigned stride = inst->dst.stride * type_sz(inst->dst.type);
> >> +         /* Calculate the maximum byte stride and the minimum type size
> >> across
> >> +          * all source and destination operands.
> >> +          */
> >> +         unsigned max_stride = inst->dst.stride *
> type_sz(inst->dst.type);
> >> +         unsigned min_size = type_sz(inst->dst.type);
> >>
> >>           for (unsigned i = 0; i < inst->sources; i++) {
> >> -            if (!is_uniform(inst->src[i]) &&
> !inst->is_control_source(i))
> >> -               stride = MAX2(stride, inst->src[i].stride *
> >> -                             type_sz(inst->src[i].type));
> >> +            if (!is_uniform(inst->src[i]) &&
> !inst->is_control_source(i))
> >> {
> >> +               max_stride = MAX2(max_stride, inst->src[i].stride *
> >> +                                 type_sz(inst->src[i].type));
> >> +               min_size = MIN2(min_size, type_sz(inst->src[i].type));
> >> +            }
> >>           }
> >>
> >> -         return stride;
> >> +         /* Attempt to use the largest byte stride among all present
> >> operands,
> >> +          * but never exceed a stride of 4 since that would lead to
> >> illegal
> >> +          * destination regions during lowering.
> >> +          */
> >> +         return MIN2(max_stride, 4 * min_size);
> >>
> >
> > Why not just fall back to tightly packed in this case?  I think I can
> > answer my own question:  Because using something that's equal to one of
> the
> > strides reduces the liklihood that we'll need a temporary.  If that's the
> > correct answer, then maybe what we want is the maximum of all strides
> with
> > stride_in_bytes <= 4 * type_sz?
> >
>
> We also want the result to be greater than or equal to the size of the
> largest non-uniform, non-control source type, since packing a vector of
> such a type into a temporary of lower byte stride than its size is
> impossible.  This patch guarantees that as long as max_size <= 4 *
> min_size, which is necessary for the lowering code that calls this
> function to work at all.
>
> It would be possible to preserve this guarantee while attempting to pick
> one of the strides of the pre-existing sources as you say -- I would be
> happy to review that change as a follow-up micro-optimization patch, but
> there are some corner cases to consider I don't necessarily want to
> bother with in the patch doing the functional change, for the sake of
> bisectability.
>

That's fine with me.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>


> It may make sense to add an assert here that max_size <= 4 * min_size
> for the case such an instruction doesn't blow up already at the EU
> validator (it doesn't look like the validator is currently enforcing the
> lack of conversions between 8 and 64 bit types?), it will just involve
> calculating max_size in addition to max_stride and min_size above.
>
> > --Jason
> >
> >
> >>        }
> >>     }
> >>
> >> --
> >> 2.19.2
> >>
> >> _______________________________________________
> >> mesa-dev mailing list
> >> mesa-dev@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>
>