[Mesa-dev,48/95] i965/vec4: add a force_vstride0 flag to src_reg

Submitted by Iago Toral Quiroga on July 19, 2016, 10:40 a.m.

Details

Message ID 1468924892-6910-49-git-send-email-itoral@igalia.com
State New
Headers show
Series "i965 Haswell ARB_gpu_shader_fp64 / OpenGL 4.0" ( rev: 2 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Iago Toral Quiroga July 19, 2016, 10:40 a.m.
We will use this in cases where we want to force the vstride of a src_reg
to 0 to exploit a particular behavior of the hardware. It will come in
handy to implement access to components Z/W.
---
 src/mesa/drivers/dri/i965/brw_ir_vec4.h | 1 +
 src/mesa/drivers/dri/i965/brw_vec4.cpp  | 2 ++
 2 files changed, 3 insertions(+)

Patch hide | download patch | download mbox

diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
index f66c093..f3cce4b 100644
--- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
+++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
@@ -51,6 +51,7 @@  public:
    explicit src_reg(const dst_reg &reg);
 
    src_reg *reladdr;
+   bool force_vstride0;
 };
 
 static inline src_reg
diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
index a20b2fd..bfbbd96 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
@@ -70,6 +70,7 @@  src_reg::src_reg(struct ::brw_reg reg) :
 {
    this->reg_offset = 0;
    this->reladdr = NULL;
+   this->force_vstride0 = false;
 }
 
 src_reg::src_reg(const dst_reg &reg) :
@@ -77,6 +78,7 @@  src_reg::src_reg(const dst_reg &reg) :
 {
    this->reladdr = reg.reladdr;
    this->swizzle = brw_swizzle_for_mask(reg.writemask);
+   this->force_vstride0 = false;
 }
 
 void

Comments

Iago Toral Quiroga <itoral@igalia.com> writes:

> We will use this in cases where we want to force the vstride of a src_reg
> to 0 to exploit a particular behavior of the hardware. It will come in
> handy to implement access to components Z/W.
> ---
>  src/mesa/drivers/dri/i965/brw_ir_vec4.h | 1 +
>  src/mesa/drivers/dri/i965/brw_vec4.cpp  | 2 ++
>  2 files changed, 3 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> index f66c093..f3cce4b 100644
> --- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> +++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> @@ -51,6 +51,7 @@ public:
>     explicit src_reg(const dst_reg &reg);
>  
>     src_reg *reladdr;
> +   bool force_vstride0;

I was wondering whether it would make more sense to unify this with the
FS back-end's fs_reg::stride (a numeric stride field is also likely more
convenient to do arithmetic on than a boolean) and promote it to
backend_reg?  It could be defined as the number of components to jump
over for each logical channel of the register, which is just the vstride
in single-precision SIMD4x2 and the hstride in scalar mode.

But thinking about it some more, I wonder if it's really necessary to
expose vertical strides at the IR level?  Aren't you planing to use this
during the conversion to HW registers exclusively?  Why don't you set
the vstride field directly in that case?

>  };
>  
>  static inline src_reg
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> index a20b2fd..bfbbd96 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> @@ -70,6 +70,7 @@ src_reg::src_reg(struct ::brw_reg reg) :
>  {
>     this->reg_offset = 0;
>     this->reladdr = NULL;
> +   this->force_vstride0 = false;
>  }
>  
>  src_reg::src_reg(const dst_reg &reg) :
> @@ -77,6 +78,7 @@ src_reg::src_reg(const dst_reg &reg) :
>  {
>     this->reladdr = reg.reladdr;
>     this->swizzle = brw_swizzle_for_mask(reg.writemask);
> +   this->force_vstride0 = false;
>  }
>  
>  void
> -- 
> 2.7.4
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
On Mon, 2016-09-12 at 14:05 -0700, Francisco Jerez wrote:
> Iago Toral Quiroga <itoral@igalia.com> writes:
> 
> > 
> > We will use this in cases where we want to force the vstride of a
> > src_reg
> > to 0 to exploit a particular behavior of the hardware. It will come
> > in
> > handy to implement access to components Z/W.
> > ---
> >  src/mesa/drivers/dri/i965/brw_ir_vec4.h | 1 +
> >  src/mesa/drivers/dri/i965/brw_vec4.cpp  | 2 ++
> >  2 files changed, 3 insertions(+)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> > b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> > index f66c093..f3cce4b 100644
> > --- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> > +++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> > @@ -51,6 +51,7 @@ public:
> >     explicit src_reg(const dst_reg &reg);
> >  
> >     src_reg *reladdr;
> > +   bool force_vstride0;
> I was wondering whether it would make more sense to unify this with
> the
> FS back-end's fs_reg::stride (a numeric stride field is also likely
> more
> convenient to do arithmetic on than a boolean) and promote it to
> backend_reg?  It could be defined as the number of components to jump
> over for each logical channel of the register, which is just the
> vstride
> in single-precision SIMD4x2 and the hstride in scalar mode.

We could do that, but I thought it would be a good idea to make it
clear that here we are using the vstride=0 with a very specific
intention and we don't expect the hardware to do what it would be
expected (we are trying to exploit a hardware bug after all). If we
were to use a normal stride field for this I think we would make this
intention much less obvious and other people reading the code would
have a much harder time understanding what is really going on. Since we
are being tricky here I think the extra field to signal that we are
trying to do something "special" might be worth it: people can track
where we read and write that field and see exactly where it is being
used for the purpose of exploiting this particular hardware behavior.

> But thinking about it some more, I wonder if it's really necessary to
> expose vertical strides at the IR level?  Aren't you planing to use
> this
> during the conversion to HW registers exclusively?  Why don't you set
> the vstride field directly in that case?

Yes, this is used exclusively at that time. The conversion to hardware
registers in convert_to_hw_regs() happens in two stages now:

We call our 'expand_64bit_swizzle_to_32bit()' helper first. This one
takes care of checking the regioning on DF instructions, translate
swizzles and set force_vstrid0 to true when needed (which is also the
only place that would set this to true). Then the rest of the code in
convert_to_hw_regs() just operates as usual, only that it will check
the force_vstride0 setting to decide the vstride to use for DF regions.

I did it like this because it allows us to keep the DF swizzle
translation and regioning checking logic separated from the conversion
to hardware registers, but this separation means that we need to tell
the latter when it has to set the vstride to 0, thus the addition of
the forcE_vstride0 field. I think having these two things separated
makes sense and makes the code easier to read. We can keep both things
separate and still avoid the force_vstride0 field by using a stride
field as you suggest above, but as I said, I think we might be doing a
rather tricky thing a bit less obvious than it should to other people.

> > 
> >  };
> >  
> >  static inline src_reg
> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> > b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> > index a20b2fd..bfbbd96 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> > @@ -70,6 +70,7 @@ src_reg::src_reg(struct ::brw_reg reg) :
> >  {
> >     this->reg_offset = 0;
> >     this->reladdr = NULL;
> > +   this->force_vstride0 = false;
> >  }
> >  
> >  src_reg::src_reg(const dst_reg &reg) :
> > @@ -77,6 +78,7 @@ src_reg::src_reg(const dst_reg &reg) :
> >  {
> >     this->reladdr = reg.reladdr;
> >     this->swizzle = brw_swizzle_for_mask(reg.writemask);
> > +   this->force_vstride0 = false;
> >  }
> >  
> >  void
Iago Toral <itoral@igalia.com> writes:

> On Mon, 2016-09-12 at 14:05 -0700, Francisco Jerez wrote:
>> Iago Toral Quiroga <itoral@igalia.com> writes:
>> 
>> > 
>> > We will use this in cases where we want to force the vstride of a
>> > src_reg
>> > to 0 to exploit a particular behavior of the hardware. It will come
>> > in
>> > handy to implement access to components Z/W.
>> > ---
>> >  src/mesa/drivers/dri/i965/brw_ir_vec4.h | 1 +
>> >  src/mesa/drivers/dri/i965/brw_vec4.cpp  | 2 ++
>> >  2 files changed, 3 insertions(+)
>> > 
>> > diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> > b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> > index f66c093..f3cce4b 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> > +++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> > @@ -51,6 +51,7 @@ public:
>> >     explicit src_reg(const dst_reg &reg);
>> >  
>> >     src_reg *reladdr;
>> > +   bool force_vstride0;
>> I was wondering whether it would make more sense to unify this with
>> the
>> FS back-end's fs_reg::stride (a numeric stride field is also likely
>> more
>> convenient to do arithmetic on than a boolean) and promote it to
>> backend_reg?  It could be defined as the number of components to jump
>> over for each logical channel of the register, which is just the
>> vstride
>> in single-precision SIMD4x2 and the hstride in scalar mode.
>
> We could do that, but I thought it would be a good idea to make it
> clear that here we are using the vstride=0 with a very specific
> intention and we don't expect the hardware to do what it would be
> expected (we are trying to exploit a hardware bug after all). If we
> were to use a normal stride field for this I think we would make this
> intention much less obvious and other people reading the code would
> have a much harder time understanding what is really going on. Since we
> are being tricky here I think the extra field to signal that we are
> trying to do something "special" might be worth it: people can track
> where we read and write that field and see exactly where it is being
> used for the purpose of exploiting this particular hardware behavior.
>

Yes, I agree that the hardware's behavior on Gen7 with non-identity
vstride is tricky and special -- Special enough that *none* of the VEC4
optimization passes and IR-handling code need to be aware of it, because
the field is only going to be used as internal book-keeping data
structure in convert_to_hw_regs() and immediately discarded.  IOW you're
storing an internal data structure of convert_to_hw_regs() as part of
the shared IR data structure, with no well-defined semantics and which
no back-end code (not even convert_to_hw_regs()) is going to be able to
honor.

So if your argument for making the representation of vstride
unnecessarily non-orthogonal is that you want to discourage people from
using it at the IR level (which is fair because it won't work at all!),
I would argue that it doesn't belong in the IR data structures in the
first place, because you could just keep convert_to_hw_regs' internal
data structures internal to convert_to_hw_regs.  (I don't actually think
you need the data structure, neither internal nor external, but more on
that later)

>> But thinking about it some more, I wonder if it's really necessary to
>> expose vertical strides at the IR level?  Aren't you planing to use
>> this
>> during the conversion to HW registers exclusively?  Why don't you set
>> the vstride field directly in that case?
>
> Yes, this is used exclusively at that time. The conversion to hardware
> registers in convert_to_hw_regs() happens in two stages now:
>
> We call our 'expand_64bit_swizzle_to_32bit()' helper first. This one
> takes care of checking the regioning on DF instructions, translate
> swizzles and set force_vstrid0 to true when needed (which is also the
> only place that would set this to true). Then the rest of the code in
> convert_to_hw_regs() just operates as usual, only that it will check
> the force_vstride0 setting to decide the vstride to use for DF regions.
>
> I did it like this because it allows us to keep the DF swizzle
> translation and regioning checking logic separated from the conversion
> to hardware registers, but this separation means that we need to tell
> the latter when it has to set the vstride to 0, thus the addition of
> the forcE_vstride0 field. I think having these two things separated
> makes sense and makes the code easier to read. We can keep both things
> separate and still avoid the force_vstride0 field by using a stride
> field as you suggest above, but as I said, I think we might be doing a
> rather tricky thing a bit less obvious than it should to other people.
>

Keeping these two tasks logically separate from each other sounds fine
to me, but you don't need to extend the IR for them to exchange data.
AFAICT expand_64bit_swizzle_to_32bit() is doing two things:

 - Calculate the hardware swizzle, which potentially involves an
   adjustment of the subregister offset -- These two are uniquely
   determined by the original src_reg swizzle, type and offset.

 - Calculate the hardware vstride -- This is also uniquely determined by
   the original src_reg swizzle.

IOW, you could implement this easily (and locally) as a plain function
you could call from convert_to_hw_regs(), e.g.:

| static brw_reg
| apply_logical_swizzle(brw_reg reg, unsigned swz);

that would take a hardware register and a lowered logical swizzle
(i.e. a component-based one), and give you the same register as result
after performing any adjustments on 'reg' needed to rearrange the
logical components of 'reg' as specified by 'swz'.

This would make the IR changes and the whole translation of
force_vstride0 back to vstride currently done in convert_to_hw_regs
unnecessary.

>> > 
>> >  };
>> >  
>> >  static inline src_reg
>> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> > b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> > index a20b2fd..bfbbd96 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> > @@ -70,6 +70,7 @@ src_reg::src_reg(struct ::brw_reg reg) :
>> >  {
>> >     this->reg_offset = 0;
>> >     this->reladdr = NULL;
>> > +   this->force_vstride0 = false;
>> >  }
>> >  
>> >  src_reg::src_reg(const dst_reg &reg) :
>> > @@ -77,6 +78,7 @@ src_reg::src_reg(const dst_reg &reg) :
>> >  {
>> >     this->reladdr = reg.reladdr;
>> >     this->swizzle = brw_swizzle_for_mask(reg.writemask);
>> > +   this->force_vstride0 = false;
>> >  }
>> >  
>> >  void
On Tue, 2016-09-13 at 22:12 -0700, Francisco Jerez wrote:
> Iago Toral <itoral@igalia.com> writes:
> 
> > 
> > On Mon, 2016-09-12 at 14:05 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral Quiroga <itoral@igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > We will use this in cases where we want to force the vstride of
> > > > a
> > > > src_reg
> > > > to 0 to exploit a particular behavior of the hardware. It will
> > > > come
> > > > in
> > > > handy to implement access to components Z/W.
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_ir_vec4.h | 1 +
> > > >  src/mesa/drivers/dri/i965/brw_vec4.cpp  | 2 ++
> > > >  2 files changed, 3 insertions(+)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> > > > b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> > > > index f66c093..f3cce4b 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> > > > +++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> > > > @@ -51,6 +51,7 @@ public:
> > > >     explicit src_reg(const dst_reg &reg);
> > > >  
> > > >     src_reg *reladdr;
> > > > +   bool force_vstride0;
> > > I was wondering whether it would make more sense to unify this
> > > with
> > > the
> > > FS back-end's fs_reg::stride (a numeric stride field is also
> > > likely
> > > more
> > > convenient to do arithmetic on than a boolean) and promote it to
> > > backend_reg?  It could be defined as the number of components to
> > > jump
> > > over for each logical channel of the register, which is just the
> > > vstride
> > > in single-precision SIMD4x2 and the hstride in scalar mode.
> > We could do that, but I thought it would be a good idea to make it
> > clear that here we are using the vstride=0 with a very specific
> > intention and we don't expect the hardware to do what it would be
> > expected (we are trying to exploit a hardware bug after all). If we
> > were to use a normal stride field for this I think we would make
> > this
> > intention much less obvious and other people reading the code would
> > have a much harder time understanding what is really going on.
> > Since we
> > are being tricky here I think the extra field to signal that we are
> > trying to do something "special" might be worth it: people can
> > track
> > where we read and write that field and see exactly where it is
> > being
> > used for the purpose of exploiting this particular hardware
> > behavior.
> > 
> Yes, I agree that the hardware's behavior on Gen7 with non-identity
> vstride is tricky and special -- Special enough that *none* of the
> VEC4
> optimization passes and IR-handling code need to be aware of it,
> because
> the field is only going to be used as internal book-keeping data
> structure in convert_to_hw_regs() and immediately discarded.  IOW
> you're
> storing an internal data structure of convert_to_hw_regs() as part of
> the shared IR data structure, with no well-defined semantics and
> which
> no back-end code (not even convert_to_hw_regs()) is going to be able
> to
> honor.
> 
> So if your argument for making the representation of vstride
> unnecessarily non-orthogonal is that you want to discourage people
> from
> using it at the IR level (which is fair because it won't work at
> all!),
> I would argue that it doesn't belong in the IR data structures in the
> first place, because you could just keep convert_to_hw_regs' internal
> data structures internal to convert_to_hw_regs.  (I don't actually
> think
> you need the data structure, neither internal nor external, but more
> on
> that later)

Yes, that makes sense.

> > 
> > > 
> > > But thinking about it some more, I wonder if it's really
> > > necessary to
> > > expose vertical strides at the IR level?  Aren't you planing to
> > > use
> > > this
> > > during the conversion to HW registers exclusively?  Why don't you
> > > set
> > > the vstride field directly in that case?
> > Yes, this is used exclusively at that time. The conversion to
> > hardware
> > registers in convert_to_hw_regs() happens in two stages now:
> > 
> > We call our 'expand_64bit_swizzle_to_32bit()' helper first. This
> > one
> > takes care of checking the regioning on DF instructions, translate
> > swizzles and set force_vstrid0 to true when needed (which is also
> > the
> > only place that would set this to true). Then the rest of the code
> > in
> > convert_to_hw_regs() just operates as usual, only that it will
> > check
> > the force_vstride0 setting to decide the vstride to use for DF
> > regions.
> > 
> > I did it like this because it allows us to keep the DF swizzle
> > translation and regioning checking logic separated from the
> > conversion
> > to hardware registers, but this separation means that we need to
> > tell
> > the latter when it has to set the vstride to 0, thus the addition
> > of
> > the forcE_vstride0 field. I think having these two things separated
> > makes sense and makes the code easier to read. We can keep both
> > things
> > separate and still avoid the force_vstride0 field by using a stride
> > field as you suggest above, but as I said, I think we might be
> > doing a
> > rather tricky thing a bit less obvious than it should to other
> > people.
> > 
> Keeping these two tasks logically separate from each other sounds
> fine
> to me, but you don't need to extend the IR for them to exchange data.
> AFAICT expand_64bit_swizzle_to_32bit() is doing two things:
> 
>  - Calculate the hardware swizzle, which potentially involves an
>    adjustment of the subregister offset -- These two are uniquely
>    determined by the original src_reg swizzle, type and offset.
> 
>  - Calculate the hardware vstride -- This is also uniquely determined
> by
>    the original src_reg swizzle.
> 
> IOW, you could implement this easily (and locally) as a plain
> function
> you could call from convert_to_hw_regs(), e.g.:
> 
> > 
> > static brw_reg
> > apply_logical_swizzle(brw_reg reg, unsigned swz);
> that would take a hardware register and a lowered logical swizzle
> (i.e. a component-based one), and give you the same register as
> result
> after performing any adjustments on 'reg' needed to rearrange the
> logical components of 'reg' as specified by 'swz'.
> 
> This would make the IR changes and the whole translation of
> force_vstride0 back to vstride currently done in convert_to_hw_regs
> unnecessary.

Ok, so your suggestion is to do the translation to a hardware register
first and then fix the swizzle by calling a helper function. Yes, this
sounds like a better idea, thanks!

> > 
> > > 
> > > > 
> > > > 
> > > >  };
> > > >  
> > > >  static inline src_reg
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> > > > index a20b2fd..bfbbd96 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> > > > @@ -70,6 +70,7 @@ src_reg::src_reg(struct ::brw_reg reg) :
> > > >  {
> > > >     this->reg_offset = 0;
> > > >     this->reladdr = NULL;
> > > > +   this->force_vstride0 = false;
> > > >  }
> > > >  
> > > >  src_reg::src_reg(const dst_reg &reg) :
> > > > @@ -77,6 +78,7 @@ src_reg::src_reg(const dst_reg &reg) :
> > > >  {
> > > >     this->reladdr = reg.reladdr;
> > > >     this->swizzle = brw_swizzle_for_mask(reg.writemask);
> > > > +   this->force_vstride0 = false;
> > > >  }
> > > >  
> > > >  void