intel/fs: Optimize and simplify the copy propagation dataflow logic.

Submitted by Francisco Jerez on Dec. 20, 2017, 2:19 a.m.

Details

Message ID 20171220021957.19121-1-currojerez@riseup.net
State New
Headers show
Series "intel/fs: Optimize and simplify the copy propagation dataflow logic." ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Francisco Jerez Dec. 20, 2017, 2:19 a.m.
Previously the dataflow propagation algorithm would calculate the ACP
live-in and -out sets in a two-pass fixed-point algorithm.  The first
pass would update the live-out sets of all basic blocks of the program
based on their live-in sets, while the second pass would update the
live-in sets based on the live-out sets.  This is incredibly
inefficient in the typical case where the CFG of the program is
approximately acyclic, because it can take up to 2*n passes for an ACP
entry introduced at the top of the program to reach the bottom (where
n is the number of basic blocks in the program), until which point the
algorithm won't be able to reach a fixed point.

The same effect can be achieved in a single pass by computing the
live-in and -out sets in lock-step, because that makes sure that
processing of any basic block will pick up the updated live-out sets
of the lexically preceding blocks.  This gives the dataflow
propagation algorithm effectively O(n) run-time instead of O(n^2) in
the acyclic case.

The time spent in dataflow propagation is reduced by 30x in the
GLES31.functional.ssbo.layout.random.all_shared_buffer.5 dEQP
test-case on my CHV system (the improvement is likely to be of the
same order of magnitude on other platforms).  This more than reverses
an apparent run-time regression in this test-case from my previous
copy-propagation undefined-value handling patch, which was ultimately
caused by the additional work introduced in that commit to account for
undefined values being multiplied by a huge quadratic factor.

According to Chad this test was failing on CHV due to a 30s time-out
imposed by the Android CTS (this was the case regardless of my
undefined-value handling patch, even though my patch substantially
exacerbated the issue).  On my CHV system this patch reduces the
overall run-time of the test by approximately 12x, getting us to
around 13s, well below the time-out.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104271
Reported-by: Chad Versace <chadversary@chromium.org>
---
 src/intel/compiler/brw_fs_copy_propagation.cpp | 30 ++++++++------------------
 1 file changed, 9 insertions(+), 21 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/intel/compiler/brw_fs_copy_propagation.cpp b/src/intel/compiler/brw_fs_copy_propagation.cpp
index af5635eacef..8d6ab18cd13 100644
--- a/src/intel/compiler/brw_fs_copy_propagation.cpp
+++ b/src/intel/compiler/brw_fs_copy_propagation.cpp
@@ -228,34 +228,17 @@  fs_copy_prop_dataflow::run()
    do {
       progress = false;
 
-      /* Update liveout for all blocks. */
       foreach_block (block, cfg) {
          if (block->parents.is_empty())
             continue;
 
          for (int i = 0; i < bitset_words; i++) {
             const BITSET_WORD old_liveout = bd[block->num].liveout[i];
-
-            bd[block->num].liveout[i] =
-               bd[block->num].copy[i] | (bd[block->num].livein[i] &
-                                         ~bd[block->num].kill[i]);
-
-            if (old_liveout != bd[block->num].liveout[i])
-               progress = true;
-         }
-      }
-
-      /* Update livein for all blocks.  If a copy is live out of all parent
-       * blocks, it's live coming in to this block.
-       */
-      foreach_block (block, cfg) {
-         if (block->parents.is_empty())
-            continue;
-
-         for (int i = 0; i < bitset_words; i++) {
-            const BITSET_WORD old_livein = bd[block->num].livein[i];
             BITSET_WORD livein_from_any_block = 0;
 
+            /* Update livein for this block.  If a copy is live out of all
+             * parent blocks, it's live coming in to this block.
+             */
             bd[block->num].livein[i] = ~0u;
             foreach_list_typed(bblock_link, parent_link, link, &block->parents) {
                bblock_t *parent = parent_link->block;
@@ -278,7 +261,12 @@  fs_copy_prop_dataflow::run()
              */
             bd[block->num].livein[i] &= livein_from_any_block;
 
-            if (old_livein != bd[block->num].livein[i])
+            /* Update liveout for this block. */
+            bd[block->num].liveout[i] =
+               bd[block->num].copy[i] | (bd[block->num].livein[i] &
+                                         ~bd[block->num].kill[i]);
+
+            if (old_liveout != bd[block->num].liveout[i])
                progress = true;
          }
       }

Comments

On Tue, Dec 19, 2017 at 9:19 PM, Francisco Jerez <currojerez@riseup.net> wrote:

There's a comment near the top of the file describing the algorithm
and referencing Muchnick. I think with your changes both of those are
no longer accurate?
Matt Turner <mattst88@gmail.com> writes:

> There's a comment near the top of the file describing the algorithm
> and referencing Muchnick. I think with your changes both of those are
> no longer accurate?

I don't have a copy of Muchnick's book at hand right now, but the
paragraph at the top of the file is definitely still accurate, the
local/global copy propagation pass split is unchanged, only the dataflow
propagation algorithm used in order to compute the live sets required as
input for the global copy propagation pass is affected by this patch.
Hi,

I got unexpected results with this, when testing it on BXT & SKL GT2.

While performance in GpuTest Volplosion and GfxBench v4 Tessellation 
test improved slightly, performance in SynMark v7 CSDof and GpuTest v0.7 
Piano dropped clearly.

Piano dropped only on SKL, but there the drop was >10%.  There was also 
a peculiar change in its CPU vs. GPU utilization at the start of the 
test.  With this patch, there's 5-10s of 100% CPU usage before the test 
starts, which extra CPU usage happens only with this patch.

Attached is Callgrind output of where the extra time at startup goes.

(I didn't see that issue with the other tests which perf changed.)


	- Eero

On 20.12.2017 04:19, Francisco Jerez wrote:
> Previously the dataflow propagation algorithm would calculate the ACP
> live-in and -out sets in a two-pass fixed-point algorithm.  The first
> pass would update the live-out sets of all basic blocks of the program
> based on their live-in sets, while the second pass would update the
> live-in sets based on the live-out sets.  This is incredibly
> inefficient in the typical case where the CFG of the program is
> approximately acyclic, because it can take up to 2*n passes for an ACP
> entry introduced at the top of the program to reach the bottom (where
> n is the number of basic blocks in the program), until which point the
> algorithm won't be able to reach a fixed point.
> 
> The same effect can be achieved in a single pass by computing the
> live-in and -out sets in lock-step, because that makes sure that
> processing of any basic block will pick up the updated live-out sets
> of the lexically preceding blocks.  This gives the dataflow
> propagation algorithm effectively O(n) run-time instead of O(n^2) in
> the acyclic case.
> 
> The time spent in dataflow propagation is reduced by 30x in the
> GLES31.functional.ssbo.layout.random.all_shared_buffer.5 dEQP
> test-case on my CHV system (the improvement is likely to be of the
> same order of magnitude on other platforms).  This more than reverses
> an apparent run-time regression in this test-case from my previous
> copy-propagation undefined-value handling patch, which was ultimately
> caused by the additional work introduced in that commit to account for
> undefined values being multiplied by a huge quadratic factor.
> 
> According to Chad this test was failing on CHV due to a 30s time-out
> imposed by the Android CTS (this was the case regardless of my
> undefined-value handling patch, even though my patch substantially
> exacerbated the issue).  On my CHV system this patch reduces the
> overall run-time of the test by approximately 12x, getting us to
> around 13s, well below the time-out.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104271
> Reported-by: Chad Versace <chadversary@chromium.org>
> ---
>   src/intel/compiler/brw_fs_copy_propagation.cpp | 30 ++++++++------------------
>   1 file changed, 9 insertions(+), 21 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_copy_propagation.cpp b/src/intel/compiler/brw_fs_copy_propagation.cpp
> index af5635eacef..8d6ab18cd13 100644
> --- a/src/intel/compiler/brw_fs_copy_propagation.cpp
> +++ b/src/intel/compiler/brw_fs_copy_propagation.cpp
> @@ -228,34 +228,17 @@ fs_copy_prop_dataflow::run()
>      do {
>         progress = false;
>   
> -      /* Update liveout for all blocks. */
>         foreach_block (block, cfg) {
>            if (block->parents.is_empty())
>               continue;
>   
>            for (int i = 0; i < bitset_words; i++) {
>               const BITSET_WORD old_liveout = bd[block->num].liveout[i];
> -
> -            bd[block->num].liveout[i] =
> -               bd[block->num].copy[i] | (bd[block->num].livein[i] &
> -                                         ~bd[block->num].kill[i]);
> -
> -            if (old_liveout != bd[block->num].liveout[i])
> -               progress = true;
> -         }
> -      }
> -
> -      /* Update livein for all blocks.  If a copy is live out of all parent
> -       * blocks, it's live coming in to this block.
> -       */
> -      foreach_block (block, cfg) {
> -         if (block->parents.is_empty())
> -            continue;
> -
> -         for (int i = 0; i < bitset_words; i++) {
> -            const BITSET_WORD old_livein = bd[block->num].livein[i];
>               BITSET_WORD livein_from_any_block = 0;
>   
> +            /* Update livein for this block.  If a copy is live out of all
> +             * parent blocks, it's live coming in to this block.
> +             */
>               bd[block->num].livein[i] = ~0u;
>               foreach_list_typed(bblock_link, parent_link, link, &block->parents) {
>                  bblock_t *parent = parent_link->block;
> @@ -278,7 +261,12 @@ fs_copy_prop_dataflow::run()
>                */
>               bd[block->num].livein[i] &= livein_from_any_block;
>   
> -            if (old_livein != bd[block->num].livein[i])
> +            /* Update liveout for this block. */
> +            bd[block->num].liveout[i] =
> +               bd[block->num].copy[i] | (bd[block->num].livein[i] &
> +                                         ~bd[block->num].kill[i]);
> +
> +            if (old_liveout != bd[block->num].liveout[i])
>                  progress = true;
>            }
>         }
>
Hi,

On 20.12.2017 16:29, Eero Tamminen wrote:
> I got unexpected results with this, when testing it on BXT & SKL GT2.
> 
> While performance in GpuTest Volplosion and GfxBench v4 Tessellation 
> test improved slightly, performance in SynMark v7 CSDof and GpuTest v0.7 
> Piano dropped clearly.
> 
> Piano dropped only on SKL, but there the drop was >10%.

When looking at the generated shader assembly before and after, there's 
one huge shader for which Mesa is able to generated only SIMD8 assembly.


Before the patch:

SIMD8 shader: 2727 instructions. 5 loops. 470840 cycles. 0:0 
spills:fills. Promoted 58 constants. Compacted 43632 to 32560 bytes (25%)


After the patch:

SIMD8 shader: 4784 instructions. 5 loops. 808006 cycles. 212:367 
spills:fills. Promoted 9 constants. Compacted 76544 to 46192 bytes (40%)


Because shader spills i.e. there are no free regs, there are also more 
register bank conflicts.


	- Eero

 > There was also
 > a peculiar change in its CPU vs. GPU utilization at the start of the
 > test.  With this patch, there's 5-10s of 100% CPU usage before the test
 > starts, which extra CPU usage happens only with this patch.
 >
 > Attached is Callgrind output of where the extra time at startup goes.
 >
> (I didn't see that issue with the other tests which perf changed.)
> 
> 
>      - Eero
> 
> On 20.12.2017 04:19, Francisco Jerez wrote:
>> Previously the dataflow propagation algorithm would calculate the ACP
>> live-in and -out sets in a two-pass fixed-point algorithm.  The first
>> pass would update the live-out sets of all basic blocks of the program
>> based on their live-in sets, while the second pass would update the
>> live-in sets based on the live-out sets.  This is incredibly
>> inefficient in the typical case where the CFG of the program is
>> approximately acyclic, because it can take up to 2*n passes for an ACP
>> entry introduced at the top of the program to reach the bottom (where
>> n is the number of basic blocks in the program), until which point the
>> algorithm won't be able to reach a fixed point.
>>
>> The same effect can be achieved in a single pass by computing the
>> live-in and -out sets in lock-step, because that makes sure that
>> processing of any basic block will pick up the updated live-out sets
>> of the lexically preceding blocks.  This gives the dataflow
>> propagation algorithm effectively O(n) run-time instead of O(n^2) in
>> the acyclic case.
>>
>> The time spent in dataflow propagation is reduced by 30x in the
>> GLES31.functional.ssbo.layout.random.all_shared_buffer.5 dEQP
>> test-case on my CHV system (the improvement is likely to be of the
>> same order of magnitude on other platforms).  This more than reverses
>> an apparent run-time regression in this test-case from my previous
>> copy-propagation undefined-value handling patch, which was ultimately
>> caused by the additional work introduced in that commit to account for
>> undefined values being multiplied by a huge quadratic factor.
>>
>> According to Chad this test was failing on CHV due to a 30s time-out
>> imposed by the Android CTS (this was the case regardless of my
>> undefined-value handling patch, even though my patch substantially
>> exacerbated the issue).  On my CHV system this patch reduces the
>> overall run-time of the test by approximately 12x, getting us to
>> around 13s, well below the time-out.
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104271
>> Reported-by: Chad Versace <chadversary@chromium.org>
>> ---
>>   src/intel/compiler/brw_fs_copy_propagation.cpp | 30 
>> ++++++++------------------
>>   1 file changed, 9 insertions(+), 21 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_copy_propagation.cpp 
>> b/src/intel/compiler/brw_fs_copy_propagation.cpp
>> index af5635eacef..8d6ab18cd13 100644
>> --- a/src/intel/compiler/brw_fs_copy_propagation.cpp
>> +++ b/src/intel/compiler/brw_fs_copy_propagation.cpp
>> @@ -228,34 +228,17 @@ fs_copy_prop_dataflow::run()
>>      do {
>>         progress = false;
>> -      /* Update liveout for all blocks. */
>>         foreach_block (block, cfg) {
>>            if (block->parents.is_empty())
>>               continue;
>>            for (int i = 0; i < bitset_words; i++) {
>>               const BITSET_WORD old_liveout = bd[block->num].liveout[i];
>> -
>> -            bd[block->num].liveout[i] =
>> -               bd[block->num].copy[i] | (bd[block->num].livein[i] &
>> -                                         ~bd[block->num].kill[i]);
>> -
>> -            if (old_liveout != bd[block->num].liveout[i])
>> -               progress = true;
>> -         }
>> -      }
>> -
>> -      /* Update livein for all blocks.  If a copy is live out of all 
>> parent
>> -       * blocks, it's live coming in to this block.
>> -       */
>> -      foreach_block (block, cfg) {
>> -         if (block->parents.is_empty())
>> -            continue;
>> -
>> -         for (int i = 0; i < bitset_words; i++) {
>> -            const BITSET_WORD old_livein = bd[block->num].livein[i];
>>               BITSET_WORD livein_from_any_block = 0;
>> +            /* Update livein for this block.  If a copy is live out 
>> of all
>> +             * parent blocks, it's live coming in to this block.
>> +             */
>>               bd[block->num].livein[i] = ~0u;
>>               foreach_list_typed(bblock_link, parent_link, link, 
>> &block->parents) {
>>                  bblock_t *parent = parent_link->block;
>> @@ -278,7 +261,12 @@ fs_copy_prop_dataflow::run()
>>                */
>>               bd[block->num].livein[i] &= livein_from_any_block;
>> -            if (old_livein != bd[block->num].livein[i])
>> +            /* Update liveout for this block. */
>> +            bd[block->num].liveout[i] =
>> +               bd[block->num].copy[i] | (bd[block->num].livein[i] &
>> +                                         ~bd[block->num].kill[i]);
>> +
>> +            if (old_liveout != bd[block->num].liveout[i])
>>                  progress = true;
>>            }
>>         }
>>
>