[4/5] intel/fs: Implement extended strides greater than 4 for IR source regions.

Submitted by Francisco Jerez on Jan. 19, 2019, 12:09 a.m.

Details

Message ID 20190119000902.11597-4-currojerez@riseup.net
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Francisco Jerez Jan. 19, 2019, 12:09 a.m.
Strides up to 32B can be implemented for the source regions of most
instructions by leveraging either the vertical or the horizontal
stride of the hardware Align1 region.  The main motivation for this is
that currently the lower_integer_multiplication() pass will happily
double the stride of one of the 32-bit sources, which can blow up if
the stride of the original source was already the maximum value
allowed by the hardware.

An alternative would be to use the regioning legalization pass in
order to lower such strides into the composition of multiple legal
strides, but that would be somewhat less efficient.

This showed up as a regression from my commit cbea91eb57a501bebb1ca2
in Vulkan 1.1 CTS tests on CHV/BXT platforms, however it was really a
pre-existing problem that had affected conformance on other platforms
without native support for integer multiplication.  CHV/BXT were
getting around it because the code I removed in that commit had the
"fortunate" side effect of emitting narrower regions that didn't hit
the hardware stride limit after lowering.  Beyond fixing the
regression this fixes ~90 additional Vulkan 1.1 subgroup CTS tests on
ICL (that's why this patch is marked for inclusion in mesa-stable even
though the original regressing patch was not).

Cc: mesa-stable@lists.freedesktop.org
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109328
Reported-by: Mark Janes <mark.a.janes@intel.com>
---
 src/intel/compiler/brw_fs_generator.cpp | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/intel/compiler/brw_fs_generator.cpp b/src/intel/compiler/brw_fs_generator.cpp
index 5fc6cf5f8cc..b169eacf15b 100644
--- a/src/intel/compiler/brw_fs_generator.cpp
+++ b/src/intel/compiler/brw_fs_generator.cpp
@@ -90,9 +90,17 @@  brw_reg_from_fs_reg(const struct gen_device_info *devinfo, fs_inst *inst,
           *       different execution size when the number of components
           *       written to each destination GRF is not the same.
           */
-         const unsigned width = MIN2(reg_width, phys_width);
-         brw_reg = brw_vecn_reg(width, brw_file_from_reg(reg), reg->nr, 0);
-         brw_reg = stride(brw_reg, width * reg->stride, width, reg->stride);
+         if (reg->stride > 4) {
+            assert(reg != &inst->dst);
+            assert(reg->stride * type_sz(reg->type) <= REG_SIZE);
+            brw_reg = brw_vecn_reg(1, brw_file_from_reg(reg), reg->nr, 0);
+            brw_reg = stride(brw_reg, reg->stride, 1, 0);
+
+         } else {
+            const unsigned width = MIN2(reg_width, phys_width);
+            brw_reg = brw_vecn_reg(width, brw_file_from_reg(reg), reg->nr, 0);
+            brw_reg = stride(brw_reg, width * reg->stride, width, reg->stride);
+         }
 
          if (devinfo->gen == 7 && !devinfo->is_haswell) {
             /* From the IvyBridge PRM (EU Changes by Processor Generation, page 13):

Comments

On Fri, Jan 18, 2019 at 6:09 PM Francisco Jerez <currojerez@riseup.net>
wrote:

> Strides up to 32B can be implemented for the source regions of most
> instructions by leveraging either the vertical or the horizontal
> stride of the hardware Align1 region.  The main motivation for this is
> that currently the lower_integer_multiplication() pass will happily
> double the stride of one of the 32-bit sources, which can blow up if
> the stride of the original source was already the maximum value
> allowed by the hardware.
>

I thought this looked familiar so I did some digging...

On Nov 2 of 2017, I wrote almost exactly this same patch which was
committed on Nov 7 as e8c9e65185de3e821e1
On Nov 14, Matt reverted it in a31d0382084c8aa8 because it wasn't needed
anymore and he wasn't sure of its correctness.

And here we are again....

I still believe it to be correct so it is

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

My one major request is that you include some of the history of this change
in the commit message.  As far as the patch itself goes, it's identical to
mine except for the unneeded whitespace change and one additional assert
which I believe to be a good addition.

I've also CC'd matt in case he wants to throw in his $.02

--Jason

An alternative would be to use the regioning legalization pass in
> order to lower such strides into the composition of multiple legal
> strides, but that would be somewhat less efficient.
>
> This showed up as a regression from my commit cbea91eb57a501bebb1ca2
> in Vulkan 1.1 CTS tests on CHV/BXT platforms, however it was really a
> pre-existing problem that had affected conformance on other platforms
> without native support for integer multiplication.  CHV/BXT were
> getting around it because the code I removed in that commit had the
> "fortunate" side effect of emitting narrower regions that didn't hit
> the hardware stride limit after lowering.  Beyond fixing the
> regression this fixes ~90 additional Vulkan 1.1 subgroup CTS tests on
> ICL (that's why this patch is marked for inclusion in mesa-stable even
> though the original regressing patch was not).
>
> Cc: mesa-stable@lists.freedesktop.org
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109328
> Reported-by
> <https://bugs.freedesktop.org/show_bug.cgi?id=109328Reported-by>: Mark
> Janes <mark.a.janes@intel.com>
> ---
>  src/intel/compiler/brw_fs_generator.cpp | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs_generator.cpp
> b/src/intel/compiler/brw_fs_generator.cpp
> index 5fc6cf5f8cc..b169eacf15b 100644
> --- a/src/intel/compiler/brw_fs_generator.cpp
> +++ b/src/intel/compiler/brw_fs_generator.cpp
> @@ -90,9 +90,17 @@ brw_reg_from_fs_reg(const struct gen_device_info
> *devinfo, fs_inst *inst,
>            *       different execution size when the number of components
>            *       written to each destination GRF is not the same.
>            */
> -         const unsigned width = MIN2(reg_width, phys_width);
> -         brw_reg = brw_vecn_reg(width, brw_file_from_reg(reg), reg->nr,
> 0);
> -         brw_reg = stride(brw_reg, width * reg->stride, width,
> reg->stride);
> +         if (reg->stride > 4) {
> +            assert(reg != &inst->dst);
> +            assert(reg->stride * type_sz(reg->type) <= REG_SIZE);
> +            brw_reg = brw_vecn_reg(1, brw_file_from_reg(reg), reg->nr, 0);
> +            brw_reg = stride(brw_reg, reg->stride, 1, 0);
> +
>

Extra whitespace?


> +         } else {
> +            const unsigned width = MIN2(reg_width, phys_width);
> +            brw_reg = brw_vecn_reg(width, brw_file_from_reg(reg),
> reg->nr, 0);
> +            brw_reg = stride(brw_reg, width * reg->stride, width,
> reg->stride);
> +         }
>
>           if (devinfo->gen == 7 && !devinfo->is_haswell) {
>              /* From the IvyBridge PRM (EU Changes by Processor
> Generation, page 13):
> --
> 2.19.2
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
Jason Ekstrand <jason@jlekstrand.net> writes:

> On Fri, Jan 18, 2019 at 6:09 PM Francisco Jerez <currojerez@riseup.net>
> wrote:
>
>> Strides up to 32B can be implemented for the source regions of most
>> instructions by leveraging either the vertical or the horizontal
>> stride of the hardware Align1 region.  The main motivation for this is
>> that currently the lower_integer_multiplication() pass will happily
>> double the stride of one of the 32-bit sources, which can blow up if
>> the stride of the original source was already the maximum value
>> allowed by the hardware.
>>
>
> I thought this looked familiar so I did some digging...
>
> On Nov 2 of 2017, I wrote almost exactly this same patch which was
> committed on Nov 7 as e8c9e65185de3e821e1
> On Nov 14, Matt reverted it in a31d0382084c8aa8 because it wasn't needed
> anymore and he wasn't sure of its correctness.
>

That's funny, I wasn't aware of e8c9e65185de3e821e1 nor of its revert.
Change was certainly still needed on Nov 14 due to
lower_integer_multiplication().

> And here we are again....
>
> I still believe it to be correct so it is
>
> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
>
> My one major request is that you include some of the history of this change
> in the commit message.  As far as the patch itself goes, it's identical to
> mine except for the unneeded whitespace change and one additional assert
> which I believe to be a good addition.
>

Added a comment locally about your previous attempt to do the same thing.

> I've also CC'd matt in case he wants to throw in his $.02
>
> --Jason
>
> An alternative would be to use the regioning legalization pass in
>> order to lower such strides into the composition of multiple legal
>> strides, but that would be somewhat less efficient.
>>
>> This showed up as a regression from my commit cbea91eb57a501bebb1ca2
>> in Vulkan 1.1 CTS tests on CHV/BXT platforms, however it was really a
>> pre-existing problem that had affected conformance on other platforms
>> without native support for integer multiplication.  CHV/BXT were
>> getting around it because the code I removed in that commit had the
>> "fortunate" side effect of emitting narrower regions that didn't hit
>> the hardware stride limit after lowering.  Beyond fixing the
>> regression this fixes ~90 additional Vulkan 1.1 subgroup CTS tests on
>> ICL (that's why this patch is marked for inclusion in mesa-stable even
>> though the original regressing patch was not).
>>
>> Cc: mesa-stable@lists.freedesktop.org
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109328
>> Reported-by
>> <https://bugs.freedesktop.org/show_bug.cgi?id=109328Reported-by>: Mark
>> Janes <mark.a.janes@intel.com>
>> ---
>>  src/intel/compiler/brw_fs_generator.cpp | 14 +++++++++++---
>>  1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_generator.cpp
>> b/src/intel/compiler/brw_fs_generator.cpp
>> index 5fc6cf5f8cc..b169eacf15b 100644
>> --- a/src/intel/compiler/brw_fs_generator.cpp
>> +++ b/src/intel/compiler/brw_fs_generator.cpp
>> @@ -90,9 +90,17 @@ brw_reg_from_fs_reg(const struct gen_device_info
>> *devinfo, fs_inst *inst,
>>            *       different execution size when the number of components
>>            *       written to each destination GRF is not the same.
>>            */
>> -         const unsigned width = MIN2(reg_width, phys_width);
>> -         brw_reg = brw_vecn_reg(width, brw_file_from_reg(reg), reg->nr,
>> 0);
>> -         brw_reg = stride(brw_reg, width * reg->stride, width,
>> reg->stride);
>> +         if (reg->stride > 4) {
>> +            assert(reg != &inst->dst);
>> +            assert(reg->stride * type_sz(reg->type) <= REG_SIZE);
>> +            brw_reg = brw_vecn_reg(1, brw_file_from_reg(reg), reg->nr, 0);
>> +            brw_reg = stride(brw_reg, reg->stride, 1, 0);
>> +
>>
>
> Extra whitespace?
>
>
>> +         } else {
>> +            const unsigned width = MIN2(reg_width, phys_width);
>> +            brw_reg = brw_vecn_reg(width, brw_file_from_reg(reg),
>> reg->nr, 0);
>> +            brw_reg = stride(brw_reg, width * reg->stride, width,
>> reg->stride);
>> +         }
>>
>>           if (devinfo->gen == 7 && !devinfo->is_haswell) {
>>              /* From the IvyBridge PRM (EU Changes by Processor
>> Generation, page 13):
>> --
>> 2.19.2
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
On Thu, Feb 14, 2019 at 1:30 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Fri, Jan 18, 2019 at 6:09 PM Francisco Jerez <currojerez@riseup.net> wrote:
>>
>> Strides up to 32B can be implemented for the source regions of most
>> instructions by leveraging either the vertical or the horizontal
>> stride of the hardware Align1 region.  The main motivation for this is
>> that currently the lower_integer_multiplication() pass will happily
>> double the stride of one of the 32-bit sources, which can blow up if
>> the stride of the original source was already the maximum value
>> allowed by the hardware.
>
>
> I thought this looked familiar so I did some digging...
>
> On Nov 2 of 2017, I wrote almost exactly this same patch which was committed on Nov 7 as e8c9e65185de3e821e1
> On Nov 14, Matt reverted it in a31d0382084c8aa8 because it wasn't needed anymore and he wasn't sure of its correctness.
>
> And here we are again....
>
> I still believe it to be correct so it is
>
> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
>
> My one major request is that you include some of the history of this change in the commit message.  As far as the patch itself goes, it's identical to mine except for the unneeded whitespace change and one additional assert which I believe to be a good addition.
>
> I've also CC'd matt in case he wants to throw in his $.02

I don't think I have time to give this a thoughtful review, but as far
as I can remember (with the commit messages to help) the patch I
reverted was the wrong fix at the time. It left some tests failing,
which I fixed in commit 6ac2d1690192, at which point the other commit
was reverted since it was dead code.

Perhaps you've found another case that needs it, but I think it was
the right decision at the time.