[RFC,6/7] HACK: Disable the instruction scheduler

Submitted by Jason Ekstrand on Sept. 10, 2018, 6:04 p.m.

Details

Message ID 20180910180437.25000-7-jason.ekstrand@intel.com
State New
Headers show
Series "anv: Support VK_KHR_vulkan_memory_model" ( rev: 1 ) in Mesa

Commit Message

Jason Ekstrand Sept. 10, 2018, 6:04 p.m.
The instruction scheduler is re-ordering loads which is causing fence
values to be loaded after the value they're fencing.  In particular,
consider the following pseudocode:

    void try_use_a_thing(int idx)
    {
        bool ready = ssbo.arr[idx].ready;
        vec4 data = ssbo.arr[idx].data;
        if (ready)
            use(data);
    }

    void write_a_thing(int idx, vec4 data)
    {
        ssbo.arr[idx].data = data;
        ssbo.arr[idx].ready = true;
    }

Our current instruction scheduling scheme doesn't see any problem with
re-ordering the load of "ready" with respect to the load of "data".
However, if try_use_a_thing is called in one thread and write_a_thing is
called in another thread, such re-ordering could cause invalid data to
be used.  Normally, some re-ordering of loads is fine; however, with the
Vulkan memory model, there are some additional guarantees that are
provided particularly in the case of atomic loads which we currently
don't differentiate in any way in the back-end.

Obviously, we need to come up with something better for this than just
shutting off the scheduler but I wanted to send this out earlier rather
than later and provide the opportunity for a discussion.
---
 src/intel/compiler/brw_fs.cpp | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index 3f7f2b4c984..9df238a6f6a 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -6427,7 +6427,7 @@  fs_visitor::allocate_registers(unsigned min_dispatch_width, bool allow_spilling)
     * performance but increasing likelihood of allocating.
     */
    for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
-      schedule_instructions(pre_modes[i]);
+      //schedule_instructions(pre_modes[i]);
 
       if (0) {
          assign_regs_trivial();
@@ -6478,7 +6478,7 @@  fs_visitor::allocate_registers(unsigned min_dispatch_width, bool allow_spilling)
 
    opt_bank_conflicts();
 
-   schedule_instructions(SCHEDULE_POST);
+   //schedule_instructions(SCHEDULE_POST);
 
    if (last_scratch > 0) {
       MAYBE_UNUSED unsigned max_scratch_size = 2 * 1024 * 1024;

Comments

On Mon, Sep 10, 2018 at 8:05 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> The instruction scheduler is re-ordering loads which is causing fence
> values to be loaded after the value they're fencing.  In particular,
> consider the following pseudocode:
>
>     void try_use_a_thing(int idx)
>     {
>         bool ready = ssbo.arr[idx].ready;
>         vec4 data = ssbo.arr[idx].data;
>         if (ready)
>             use(data);
>     }
>
>     void write_a_thing(int idx, vec4 data)
>     {
>         ssbo.arr[idx].data = data;
>         ssbo.arr[idx].ready = true;
>     }
>
> Our current instruction scheduling scheme doesn't see any problem with
> re-ordering the load of "ready" with respect to the load of "data".
> However, if try_use_a_thing is called in one thread and write_a_thing is
> called in another thread, such re-ordering could cause invalid data to
> be used.  Normally, some re-ordering of loads is fine; however, with the
> Vulkan memory model, there are some additional guarantees that are
> provided particularly in the case of atomic loads which we currently
> don't differentiate in any way in the back-end.
>
> Obviously, we need to come up with something better for this than just
> shutting off the scheduler but I wanted to send this out earlier rather
> than later and provide the opportunity for a discussion.

so how about adding a bitmask with flags for these to load/store
intrinsics? I remember we still need a coherent bit to implement
coherent loads and stores for AMD. We could easily have a mask with
coherent+atomic+whatever and then pass that to the backend?

> ---
>  src/intel/compiler/brw_fs.cpp | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 3f7f2b4c984..9df238a6f6a 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -6427,7 +6427,7 @@ fs_visitor::allocate_registers(unsigned min_dispatch_width, bool allow_spilling)
>      * performance but increasing likelihood of allocating.
>      */
>     for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
> -      schedule_instructions(pre_modes[i]);
> +      //schedule_instructions(pre_modes[i]);
>
>        if (0) {
>           assign_regs_trivial();
> @@ -6478,7 +6478,7 @@ fs_visitor::allocate_registers(unsigned min_dispatch_width, bool allow_spilling)
>
>     opt_bank_conflicts();
>
> -   schedule_instructions(SCHEDULE_POST);
> +   //schedule_instructions(SCHEDULE_POST);
>
>     if (last_scratch > 0) {
>        MAYBE_UNUSED unsigned max_scratch_size = 2 * 1024 * 1024;
> --
> 2.17.1
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
On 09/10/2018 11:04 AM, Jason Ekstrand wrote:
> The instruction scheduler is re-ordering loads which is causing fence
> values to be loaded after the value they're fencing.  In particular,
> consider the following pseudocode:
> 
>     void try_use_a_thing(int idx)
>     {
>         bool ready = ssbo.arr[idx].ready;
>         vec4 data = ssbo.arr[idx].data;
>         if (ready)
>             use(data);
>     }
> 
>     void write_a_thing(int idx, vec4 data)
>     {
>         ssbo.arr[idx].data = data;
>         ssbo.arr[idx].ready = true;
>     }
> 
> Our current instruction scheduling scheme doesn't see any problem with
> re-ordering the load of "ready" with respect to the load of "data".
> However, if try_use_a_thing is called in one thread and write_a_thing is
> called in another thread, such re-ordering could cause invalid data to
> be used.  Normally, some re-ordering of loads is fine; however, with the
> Vulkan memory model, there are some additional guarantees that are
> provided particularly in the case of atomic loads which we currently
> don't differentiate in any way in the back-end.

I have encountered similar problems with reordering in OpenGL land
reading a value that is also accessed using atomics.  SPIR-V has an
atomic load instruction.  I haven't looked at how we handle that in
spv_to_nir, but could we use something like that in these cases?

> Obviously, we need to come up with something better for this than just
> shutting off the scheduler but I wanted to send this out earlier rather
> than later and provide the opportunity for a discussion.
> ---
>  src/intel/compiler/brw_fs.cpp | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 3f7f2b4c984..9df238a6f6a 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -6427,7 +6427,7 @@ fs_visitor::allocate_registers(unsigned min_dispatch_width, bool allow_spilling)
>      * performance but increasing likelihood of allocating.
>      */
>     for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
> -      schedule_instructions(pre_modes[i]);
> +      //schedule_instructions(pre_modes[i]);
>  
>        if (0) {
>           assign_regs_trivial();
> @@ -6478,7 +6478,7 @@ fs_visitor::allocate_registers(unsigned min_dispatch_width, bool allow_spilling)
>  
>     opt_bank_conflicts();
>  
> -   schedule_instructions(SCHEDULE_POST);
> +   //schedule_instructions(SCHEDULE_POST);
>  
>     if (last_scratch > 0) {
>        MAYBE_UNUSED unsigned max_scratch_size = 2 * 1024 * 1024;
>
On Mon, Sep 10, 2018 at 5:33 PM Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
wrote:

> On Mon, Sep 10, 2018 at 8:05 PM Jason Ekstrand <jason@jlekstrand.net>
> wrote:
> >
> > The instruction scheduler is re-ordering loads which is causing fence
> > values to be loaded after the value they're fencing.  In particular,
> > consider the following pseudocode:
> >
> >     void try_use_a_thing(int idx)
> >     {
> >         bool ready = ssbo.arr[idx].ready;
> >         vec4 data = ssbo.arr[idx].data;
> >         if (ready)
> >             use(data);
> >     }
> >
> >     void write_a_thing(int idx, vec4 data)
> >     {
> >         ssbo.arr[idx].data = data;
> >         ssbo.arr[idx].ready = true;
> >     }
> >
> > Our current instruction scheduling scheme doesn't see any problem with
> > re-ordering the load of "ready" with respect to the load of "data".
> > However, if try_use_a_thing is called in one thread and write_a_thing is
> > called in another thread, such re-ordering could cause invalid data to
> > be used.  Normally, some re-ordering of loads is fine; however, with the
> > Vulkan memory model, there are some additional guarantees that are
> > provided particularly in the case of atomic loads which we currently
> > don't differentiate in any way in the back-end.
> >
> > Obviously, we need to come up with something better for this than just
> > shutting off the scheduler but I wanted to send this out earlier rather
> > than later and provide the opportunity for a discussion.
>
> so how about adding a bitmask with flags for these to load/store
> intrinsics? I remember we still need a coherent bit to implement
> coherent loads and stores for AMD. We could easily have a mask with
> coherent+atomic+whatever and then pass that to the backend?
>

I think that's where we're going to end up.  The current model you just
have load/store ops and then barriers to block around them.  However, I
suspect that the long-term direction is going to be something more like
SPIR-V's access flags on each load/store that provides some sort of partial
barrier and re-ordering restriction information.

> ---
> >  src/intel/compiler/brw_fs.cpp | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/src/intel/compiler/brw_fs.cpp
> b/src/intel/compiler/brw_fs.cpp
> > index 3f7f2b4c984..9df238a6f6a 100644
> > --- a/src/intel/compiler/brw_fs.cpp
> > +++ b/src/intel/compiler/brw_fs.cpp
> > @@ -6427,7 +6427,7 @@ fs_visitor::allocate_registers(unsigned
> min_dispatch_width, bool allow_spilling)
> >      * performance but increasing likelihood of allocating.
> >      */
> >     for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
> > -      schedule_instructions(pre_modes[i]);
> > +      //schedule_instructions(pre_modes[i]);
> >
> >        if (0) {
> >           assign_regs_trivial();
> > @@ -6478,7 +6478,7 @@ fs_visitor::allocate_registers(unsigned
> min_dispatch_width, bool allow_spilling)
> >
> >     opt_bank_conflicts();
> >
> > -   schedule_instructions(SCHEDULE_POST);
> > +   //schedule_instructions(SCHEDULE_POST);
> >
> >     if (last_scratch > 0) {
> >        MAYBE_UNUSED unsigned max_scratch_size = 2 * 1024 * 1024;
> > --
> > 2.17.1
> >
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
On Mon, Sep 10, 2018 at 5:38 PM Ian Romanick <idr@freedesktop.org> wrote:

> On 09/10/2018 11:04 AM, Jason Ekstrand wrote:
> > The instruction scheduler is re-ordering loads which is causing fence
> > values to be loaded after the value they're fencing.  In particular,
> > consider the following pseudocode:
> >
> >     void try_use_a_thing(int idx)
> >     {
> >         bool ready = ssbo.arr[idx].ready;
> >         vec4 data = ssbo.arr[idx].data;
> >         if (ready)
> >             use(data);
> >     }
> >
> >     void write_a_thing(int idx, vec4 data)
> >     {
> >         ssbo.arr[idx].data = data;
> >         ssbo.arr[idx].ready = true;
> >     }
> >
> > Our current instruction scheduling scheme doesn't see any problem with
> > re-ordering the load of "ready" with respect to the load of "data".
> > However, if try_use_a_thing is called in one thread and write_a_thing is
> > called in another thread, such re-ordering could cause invalid data to
> > be used.  Normally, some re-ordering of loads is fine; however, with the
> > Vulkan memory model, there are some additional guarantees that are
> > provided particularly in the case of atomic loads which we currently
> > don't differentiate in any way in the back-end.
>
> I have encountered similar problems with reordering in OpenGL land
> reading a value that is also accessed using atomics.  SPIR-V has an
> atomic load instruction.  I haven't looked at how we handle that in
> spv_to_nir, but could we use something like that in these cases?
>

OpenGL only has load/store and no concept of an atomic load or store.  I
think the theory is that loads/stores of a 32-bit integer should already be
atomic, right?  Unfortunately, that leaves the ordering rules ambiguous.
SPIR-V has AtomicLoad and AtomicStore opcodes which could imply the
appropreate ordering constraings more-or-less.  With the memory model,
however, things get a lot more granular and you can start specifying very
fine-grained ordering constraints on loads, stores, and atomics.  The
primary difference, in this brave new world, between atomic and not is that
atomics get some of those constraints by default whereas you have to ask
for it on a regular load/store.

Currently SPIR-V's OpAtomicLoad is just translated to a regular load in NIR.

--Jason


> > Obviously, we need to come up with something better for this than just
> > shutting off the scheduler but I wanted to send this out earlier rather
> > than later and provide the opportunity for a discussion.
> > ---
> >  src/intel/compiler/brw_fs.cpp | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/src/intel/compiler/brw_fs.cpp
> b/src/intel/compiler/brw_fs.cpp
> > index 3f7f2b4c984..9df238a6f6a 100644
> > --- a/src/intel/compiler/brw_fs.cpp
> > +++ b/src/intel/compiler/brw_fs.cpp
> > @@ -6427,7 +6427,7 @@ fs_visitor::allocate_registers(unsigned
> min_dispatch_width, bool allow_spilling)
> >      * performance but increasing likelihood of allocating.
> >      */
> >     for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
> > -      schedule_instructions(pre_modes[i]);
> > +      //schedule_instructions(pre_modes[i]);
> >
> >        if (0) {
> >           assign_regs_trivial();
> > @@ -6478,7 +6478,7 @@ fs_visitor::allocate_registers(unsigned
> min_dispatch_width, bool allow_spilling)
> >
> >     opt_bank_conflicts();
> >
> > -   schedule_instructions(SCHEDULE_POST);
> > +   //schedule_instructions(SCHEDULE_POST);
> >
> >     if (last_scratch > 0) {
> >        MAYBE_UNUSED unsigned max_scratch_size = 2 * 1024 * 1024;
> >
>
>