[Mesa-dev,62/95] i965/vec4: Add a shuffle_64bit_data helper

Submitted by Iago Toral Quiroga on July 19, 2016, 10:40 a.m.

Details

Message ID 1468924892-6910-63-git-send-email-itoral@igalia.com
State New
Headers show
Series "i965 Haswell ARB_gpu_shader_fp64 / OpenGL 4.0" ( rev: 2 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Iago Toral Quiroga July 19, 2016, 10:40 a.m.
SIMD4x2 64bit data is stored in register space like this:

r0.0:DF  x0 y0 z0 w0
r0.1:DF  x1 y1 z1 w1

When we need to write data such as this to memory using 32-bit write
messages we need to shuffle it in this fashion:

r0.0:DF  x0 y0 x1 y1
r0.1:DF  z0 w0 z1 w1

and emit two 32-bit write messages, one for r0.0 at base_offset
and another one for r0.1 at base_offset+16.

We also need to do the inverse operation when we read using 32-bit messages
to produce valid SIMD4x2 64bit data from the data read. We can achieve this
by aplying the exact same shuffling to the data read, although we need to
apply different channel enables since the layout of the data is reversed.

This helper implements the data shuffling logic and we will use it in
various places where we read and write 64bit data from/to memory.
---
 src/mesa/drivers/dri/i965/brw_vec4.h       |  5 ++
 src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 96 ++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

Patch hide | download patch | download mbox

diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h b/src/mesa/drivers/dri/i965/brw_vec4.h
index 26228d0..3337fc0 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4.h
+++ b/src/mesa/drivers/dri/i965/brw_vec4.h
@@ -327,6 +327,11 @@  public:
 
    src_reg setup_imm_df(double v);
 
+   vec4_instruction *shuffle_64bit_data(dst_reg dst, src_reg src,
+                                        bool for_write,
+                                        bblock_t *block = NULL,
+                                        vec4_instruction *ref = NULL);
+
    virtual void emit_nir_code();
    virtual void nir_setup_uniforms();
    virtual void nir_setup_system_value_intrinsic(nir_intrinsic_instr *instr);
diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
index 450db92..346e822 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
@@ -2145,4 +2145,100 @@  vec4_visitor::nir_emit_undef(nir_ssa_undef_instr *instr)
       dst_reg(VGRF, alloc.allocate(instr->def.bit_size / 32));
 }
 
+/* SIMD4x2 64bit data is stored in register space like this:
+ *
+ * r0.0:DF  x0 y0 z0 w0
+ * r0.1:DF  x1 y1 z1 w1
+ *
+ * When we need to write data such as this to memory using 32-bit write
+ * messages we need to shuffle it in this fashion:
+ *
+ * r0.0:DF  x0 y0 x1 y1 (to be written at base offset)
+ * r0.0:DF  z0 w0 z1 w1 (to be written at base offset + 16)
+ *
+ * We need to do the inverse operation when we read using 32-bit messages,
+ * which we can do by applying the same exact shuffling on the 64-bit data
+ * read, only that because the data for each vertex is positioned differently
+ * we need to apply different channel enables.
+ *
+ * This function takes 64bit data and shuffles it as explained above.
+ *
+ * The @for_write parameter is used to specify if the shuffling is being done
+ * for proper SIMD4x2 64-bit data that needs to be shuffled prior to a 32-bit
+ * write message (for_write = true), or instead we are doing the inverse
+ * opperation and we have just read 64-bit data using a 32-bit messages that we
+ * need to shuffle to create valid SIMD4x2 64-bit data (for_write = false).
+ *
+ * If @block and @ref are non-NULL, then the shuffling is done after @ref,
+ * otherwise the instructions are emitted normally at the end. The function
+ * returns the last instruction inserted.
+ *
+ * Notice that @src and @dst cannot be the same register.
+ */
+vec4_instruction *
+vec4_visitor::shuffle_64bit_data(dst_reg dst, src_reg src, bool for_write,
+                                 bblock_t *block, vec4_instruction *ref)
+{
+   assert(type_sz(src.type) == 8);
+   assert(type_sz(dst.type) == 8);
+   assert(!src.in_range(dst, 2));
+   assert(dst.writemask == WRITEMASK_XYZW);
+   assert(!ref == !block);
+
+   vec4_instruction *inst, *last;
+   bool emit_before = ref != NULL;
+
+   #define EMIT(i) \
+      if (!emit_before) { \
+         emit(i); \
+      } else { \
+         ref->insert_after(block, i); \
+         ref = i; \
+      }  \
+      last = i;
+
+   /* Resolve swizzle in src */
+   if (src.swizzle != BRW_SWIZZLE_XYZW) {
+      dst_reg data = dst_reg(this, glsl_type::dvec4_type);
+      inst = MOV(data, src);
+      EMIT(inst);
+      src = src_reg(data);
+   }
+
+   /* dst+0.XY = src+0.XY */
+   dst.writemask = WRITEMASK_XY;
+   inst = MOV(dst, src);
+   inst->regs_written = 1;
+   inst->exec_size = 4;
+   EMIT(inst);
+
+   /* dst+0.ZW = src+1.XY */
+   dst.writemask = WRITEMASK_ZW;
+   inst = MOV(dst, swizzle(offset(src, 1), BRW_SWIZZLE_XYXY));
+   inst->regs_written = 1;
+   inst->exec_size = 4;
+   inst->group = for_write ? 4 : 0;
+   EMIT(inst);
+
+   /* dst+1.XY = src+0.ZW */
+   dst.writemask = WRITEMASK_XY;
+   inst = MOV(offset(dst, 1), swizzle(src, BRW_SWIZZLE_ZWZW));
+   inst->regs_written = 1;
+   inst->exec_size = 4;
+   inst->group = for_write ? 0 : 4;
+   EMIT(inst);
+
+   /* dst+1.ZW = src+1.ZW */
+   dst.writemask = WRITEMASK_ZW;
+   inst = MOV(offset(dst, 1), offset(src, 1));
+   inst->regs_written = 1;
+   inst->exec_size = 4;
+   inst->group = 4;
+   EMIT(inst);
+
+   #undef EMIT
+
+   return last;
+}
+
 }

Comments

Iago Toral Quiroga <itoral@igalia.com> writes:

> SIMD4x2 64bit data is stored in register space like this:
>
> r0.0:DF  x0 y0 z0 w0
> r0.1:DF  x1 y1 z1 w1
>
> When we need to write data such as this to memory using 32-bit write
> messages we need to shuffle it in this fashion:
>
> r0.0:DF  x0 y0 x1 y1
> r0.1:DF  z0 w0 z1 w1
>
> and emit two 32-bit write messages, one for r0.0 at base_offset
> and another one for r0.1 at base_offset+16.
>
> We also need to do the inverse operation when we read using 32-bit messages
> to produce valid SIMD4x2 64bit data from the data read. We can achieve this
> by aplying the exact same shuffling to the data read, although we need to
> apply different channel enables since the layout of the data is reversed.
>
> This helper implements the data shuffling logic and we will use it in
> various places where we read and write 64bit data from/to memory.
> ---
>  src/mesa/drivers/dri/i965/brw_vec4.h       |  5 ++
>  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 96 ++++++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h b/src/mesa/drivers/dri/i965/brw_vec4.h
> index 26228d0..3337fc0 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.h
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
> @@ -327,6 +327,11 @@ public:
>  
>     src_reg setup_imm_df(double v);
>  
> +   vec4_instruction *shuffle_64bit_data(dst_reg dst, src_reg src,
> +                                        bool for_write,
> +                                        bblock_t *block = NULL,
> +                                        vec4_instruction *ref = NULL);
> +
>     virtual void emit_nir_code();
>     virtual void nir_setup_uniforms();
>     virtual void nir_setup_system_value_intrinsic(nir_intrinsic_instr *instr);
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> index 450db92..346e822 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> @@ -2145,4 +2145,100 @@ vec4_visitor::nir_emit_undef(nir_ssa_undef_instr *instr)
>        dst_reg(VGRF, alloc.allocate(instr->def.bit_size / 32));
>  }
>  
> +/* SIMD4x2 64bit data is stored in register space like this:
> + *
> + * r0.0:DF  x0 y0 z0 w0
> + * r0.1:DF  x1 y1 z1 w1
> + *
> + * When we need to write data such as this to memory using 32-bit write
> + * messages we need to shuffle it in this fashion:
> + *
> + * r0.0:DF  x0 y0 x1 y1 (to be written at base offset)
> + * r0.0:DF  z0 w0 z1 w1 (to be written at base offset + 16)
> + *
> + * We need to do the inverse operation when we read using 32-bit messages,
> + * which we can do by applying the same exact shuffling on the 64-bit data
> + * read, only that because the data for each vertex is positioned differently
> + * we need to apply different channel enables.
> + *
> + * This function takes 64bit data and shuffles it as explained above.
> + *
> + * The @for_write parameter is used to specify if the shuffling is being done
> + * for proper SIMD4x2 64-bit data that needs to be shuffled prior to a 32-bit
> + * write message (for_write = true), or instead we are doing the inverse
> + * opperation and we have just read 64-bit data using a 32-bit messages that we
> + * need to shuffle to create valid SIMD4x2 64-bit data (for_write = false).
> + *
> + * If @block and @ref are non-NULL, then the shuffling is done after @ref,
> + * otherwise the instructions are emitted normally at the end. The function
> + * returns the last instruction inserted.
> + *
> + * Notice that @src and @dst cannot be the same register.
> + */
> +vec4_instruction *
> +vec4_visitor::shuffle_64bit_data(dst_reg dst, src_reg src, bool for_write,
> +                                 bblock_t *block, vec4_instruction *ref)
> +{
> +   assert(type_sz(src.type) == 8);
> +   assert(type_sz(dst.type) == 8);
> +   assert(!src.in_range(dst, 2));
> +   assert(dst.writemask == WRITEMASK_XYZW);
> +   assert(!ref == !block);
> +
> +   vec4_instruction *inst, *last;
> +   bool emit_before = ref != NULL;
> +
> +   #define EMIT(i) \
> +      if (!emit_before) { \
> +         emit(i); \
> +      } else { \
> +         ref->insert_after(block, i); \
> +         ref = i; \
> +      }  \
> +      last = i;
> +
> +   /* Resolve swizzle in src */
> +   if (src.swizzle != BRW_SWIZZLE_XYZW) {
> +      dst_reg data = dst_reg(this, glsl_type::dvec4_type);
> +      inst = MOV(data, src);
> +      EMIT(inst);
> +      src = src_reg(data);
> +   }
> +
> +   /* dst+0.XY = src+0.XY */
> +   dst.writemask = WRITEMASK_XY;
> +   inst = MOV(dst, src);
> +   inst->regs_written = 1;
> +   inst->exec_size = 4;
> +   EMIT(inst);
> +
> +   /* dst+0.ZW = src+1.XY */
> +   dst.writemask = WRITEMASK_ZW;
> +   inst = MOV(dst, swizzle(offset(src, 1), BRW_SWIZZLE_XYXY));
> +   inst->regs_written = 1;
> +   inst->exec_size = 4;
> +   inst->group = for_write ? 4 : 0;
> +   EMIT(inst);
> +
> +   /* dst+1.XY = src+0.ZW */
> +   dst.writemask = WRITEMASK_XY;
> +   inst = MOV(offset(dst, 1), swizzle(src, BRW_SWIZZLE_ZWZW));
> +   inst->regs_written = 1;
> +   inst->exec_size = 4;
> +   inst->group = for_write ? 0 : 4;
> +   EMIT(inst);
> +
> +   /* dst+1.ZW = src+1.ZW */
> +   dst.writemask = WRITEMASK_ZW;
> +   inst = MOV(offset(dst, 1), offset(src, 1));
> +   inst->regs_written = 1;
> +   inst->exec_size = 4;
> +   inst->group = 4;
> +   EMIT(inst);
> +
> +   #undef EMIT
> +
> +   return last;
> +}
> +

This would be a *lot* simpler if you used and took as argument a VEC4
builder instead of constructing the IR manually (and if you used the
writemask() helper).  It might need some small fixes for the VEC4
builder to initialize the new group and exec_size fields correctly but
it seems worth doing just because of this function alone.

>  }
> -- 
> 2.7.4
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
On Mon, 2016-09-12 at 14:19 -0700, Francisco Jerez wrote:
> Iago Toral Quiroga <itoral@igalia.com> writes:
> 
> > 
> > SIMD4x2 64bit data is stored in register space like this:
> > 
> > r0.0:DF  x0 y0 z0 w0
> > r0.1:DF  x1 y1 z1 w1
> > 
> > When we need to write data such as this to memory using 32-bit
> > write
> > messages we need to shuffle it in this fashion:
> > 
> > r0.0:DF  x0 y0 x1 y1
> > r0.1:DF  z0 w0 z1 w1
> > 
> > and emit two 32-bit write messages, one for r0.0 at base_offset
> > and another one for r0.1 at base_offset+16.
> > 
> > We also need to do the inverse operation when we read using 32-bit
> > messages
> > to produce valid SIMD4x2 64bit data from the data read. We can
> > achieve this
> > by aplying the exact same shuffling to the data read, although we
> > need to
> > apply different channel enables since the layout of the data is
> > reversed.
> > 
> > This helper implements the data shuffling logic and we will use it
> > in
> > various places where we read and write 64bit data from/to memory.
> > ---
> >  src/mesa/drivers/dri/i965/brw_vec4.h       |  5 ++
> >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 96
> > ++++++++++++++++++++++++++++++
> >  2 files changed, 101 insertions(+)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h
> > b/src/mesa/drivers/dri/i965/brw_vec4.h
> > index 26228d0..3337fc0 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4.h
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
> > @@ -327,6 +327,11 @@ public:
> >  
> >     src_reg setup_imm_df(double v);
> >  
> > +   vec4_instruction *shuffle_64bit_data(dst_reg dst, src_reg src,
> > +                                        bool for_write,
> > +                                        bblock_t *block = NULL,
> > +                                        vec4_instruction *ref =
> > NULL);
> > +
> >     virtual void emit_nir_code();
> >     virtual void nir_setup_uniforms();
> >     virtual void
> > nir_setup_system_value_intrinsic(nir_intrinsic_instr *instr);
> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > index 450db92..346e822 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > @@ -2145,4 +2145,100 @@
> > vec4_visitor::nir_emit_undef(nir_ssa_undef_instr *instr)
> >        dst_reg(VGRF, alloc.allocate(instr->def.bit_size / 32));
> >  }
> >  
> > +/* SIMD4x2 64bit data is stored in register space like this:
> > + *
> > + * r0.0:DF  x0 y0 z0 w0
> > + * r0.1:DF  x1 y1 z1 w1
> > + *
> > + * When we need to write data such as this to memory using 32-bit
> > write
> > + * messages we need to shuffle it in this fashion:
> > + *
> > + * r0.0:DF  x0 y0 x1 y1 (to be written at base offset)
> > + * r0.0:DF  z0 w0 z1 w1 (to be written at base offset + 16)
> > + *
> > + * We need to do the inverse operation when we read using 32-bit
> > messages,
> > + * which we can do by applying the same exact shuffling on the 64-
> > bit data
> > + * read, only that because the data for each vertex is positioned
> > differently
> > + * we need to apply different channel enables.
> > + *
> > + * This function takes 64bit data and shuffles it as explained
> > above.
> > + *
> > + * The @for_write parameter is used to specify if the shuffling is
> > being done
> > + * for proper SIMD4x2 64-bit data that needs to be shuffled prior
> > to a 32-bit
> > + * write message (for_write = true), or instead we are doing the
> > inverse
> > + * opperation and we have just read 64-bit data using a 32-bit
> > messages that we
> > + * need to shuffle to create valid SIMD4x2 64-bit data (for_write
> > = false).
> > + *
> > + * If @block and @ref are non-NULL, then the shuffling is done
> > after @ref,
> > + * otherwise the instructions are emitted normally at the end. The
> > function
> > + * returns the last instruction inserted.
> > + *
> > + * Notice that @src and @dst cannot be the same register.
> > + */
> > +vec4_instruction *
> > +vec4_visitor::shuffle_64bit_data(dst_reg dst, src_reg src, bool
> > for_write,
> > +                                 bblock_t *block, vec4_instruction
> > *ref)
> > +{
> > +   assert(type_sz(src.type) == 8);
> > +   assert(type_sz(dst.type) == 8);
> > +   assert(!src.in_range(dst, 2));
> > +   assert(dst.writemask == WRITEMASK_XYZW);
> > +   assert(!ref == !block);
> > +
> > +   vec4_instruction *inst, *last;
> > +   bool emit_before = ref != NULL;
> > +
> > +   #define EMIT(i) \
> > +      if (!emit_before) { \
> > +         emit(i); \
> > +      } else { \
> > +         ref->insert_after(block, i); \
> > +         ref = i; \
> > +      }  \
> > +      last = i;
> > +
> > +   /* Resolve swizzle in src */
> > +   if (src.swizzle != BRW_SWIZZLE_XYZW) {
> > +      dst_reg data = dst_reg(this, glsl_type::dvec4_type);
> > +      inst = MOV(data, src);
> > +      EMIT(inst);
> > +      src = src_reg(data);
> > +   }
> > +
> > +   /* dst+0.XY = src+0.XY */
> > +   dst.writemask = WRITEMASK_XY;
> > +   inst = MOV(dst, src);
> > +   inst->regs_written = 1;
> > +   inst->exec_size = 4;
> > +   EMIT(inst);
> > +
> > +   /* dst+0.ZW = src+1.XY */
> > +   dst.writemask = WRITEMASK_ZW;
> > +   inst = MOV(dst, swizzle(offset(src, 1), BRW_SWIZZLE_XYXY));
> > +   inst->regs_written = 1;
> > +   inst->exec_size = 4;
> > +   inst->group = for_write ? 4 : 0;
> > +   EMIT(inst);
> > +
> > +   /* dst+1.XY = src+0.ZW */
> > +   dst.writemask = WRITEMASK_XY;
> > +   inst = MOV(offset(dst, 1), swizzle(src, BRW_SWIZZLE_ZWZW));
> > +   inst->regs_written = 1;
> > +   inst->exec_size = 4;
> > +   inst->group = for_write ? 0 : 4;
> > +   EMIT(inst);
> > +
> > +   /* dst+1.ZW = src+1.ZW */
> > +   dst.writemask = WRITEMASK_ZW;
> > +   inst = MOV(offset(dst, 1), offset(src, 1));
> > +   inst->regs_written = 1;
> > +   inst->exec_size = 4;
> > +   inst->group = 4;
> > +   EMIT(inst);
> > +
> > +   #undef EMIT
> > +
> > +   return last;
> > +}
> > +
> This would be a *lot* simpler if you used and took as argument a VEC4
> builder instead of constructing the IR manually (and if you used the
> writemask() helper).  It might need some small fixes for the VEC4
> builder to initialize the new group and exec_size fields correctly
> but
> it seems worth doing just because of this function alone.

Sure, I can do that. About using the writemask() helper, the problem
with it is that it doesn't set the input writemask on the dst, instead
it sets the result of "dst.writemask & mask", and when we are shuffling
we really want to force the writemask independently of any previous
writemask the dst register had. Probably the dst we get here has an
XYZW writemask anyway since it would typically be a temporary vgrf we
have just created to store the result of the shuffling operation. Maybe
we should just add an assert to ensure that, and then we should be free
to use the writemask() helper.

> > 
> >  }
Iago Toral <itoral@igalia.com> writes:

> On Mon, 2016-09-12 at 14:19 -0700, Francisco Jerez wrote:
>> Iago Toral Quiroga <itoral@igalia.com> writes:
>> 
>> > 
>> > SIMD4x2 64bit data is stored in register space like this:
>> > 
>> > r0.0:DF  x0 y0 z0 w0
>> > r0.1:DF  x1 y1 z1 w1
>> > 
>> > When we need to write data such as this to memory using 32-bit
>> > write
>> > messages we need to shuffle it in this fashion:
>> > 
>> > r0.0:DF  x0 y0 x1 y1
>> > r0.1:DF  z0 w0 z1 w1
>> > 
>> > and emit two 32-bit write messages, one for r0.0 at base_offset
>> > and another one for r0.1 at base_offset+16.
>> > 
>> > We also need to do the inverse operation when we read using 32-bit
>> > messages
>> > to produce valid SIMD4x2 64bit data from the data read. We can
>> > achieve this
>> > by aplying the exact same shuffling to the data read, although we
>> > need to
>> > apply different channel enables since the layout of the data is
>> > reversed.
>> > 
>> > This helper implements the data shuffling logic and we will use it
>> > in
>> > various places where we read and write 64bit data from/to memory.
>> > ---
>> >  src/mesa/drivers/dri/i965/brw_vec4.h       |  5 ++
>> >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 96
>> > ++++++++++++++++++++++++++++++
>> >  2 files changed, 101 insertions(+)
>> > 
>> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h
>> > b/src/mesa/drivers/dri/i965/brw_vec4.h
>> > index 26228d0..3337fc0 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_vec4.h
>> > +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
>> > @@ -327,6 +327,11 @@ public:
>> >  
>> >     src_reg setup_imm_df(double v);
>> >  
>> > +   vec4_instruction *shuffle_64bit_data(dst_reg dst, src_reg src,
>> > +                                        bool for_write,
>> > +                                        bblock_t *block = NULL,
>> > +                                        vec4_instruction *ref =
>> > NULL);
>> > +
>> >     virtual void emit_nir_code();
>> >     virtual void nir_setup_uniforms();
>> >     virtual void
>> > nir_setup_system_value_intrinsic(nir_intrinsic_instr *instr);
>> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > index 450db92..346e822 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > @@ -2145,4 +2145,100 @@
>> > vec4_visitor::nir_emit_undef(nir_ssa_undef_instr *instr)
>> >        dst_reg(VGRF, alloc.allocate(instr->def.bit_size / 32));
>> >  }
>> >  
>> > +/* SIMD4x2 64bit data is stored in register space like this:
>> > + *
>> > + * r0.0:DF  x0 y0 z0 w0
>> > + * r0.1:DF  x1 y1 z1 w1
>> > + *
>> > + * When we need to write data such as this to memory using 32-bit
>> > write
>> > + * messages we need to shuffle it in this fashion:
>> > + *
>> > + * r0.0:DF  x0 y0 x1 y1 (to be written at base offset)
>> > + * r0.0:DF  z0 w0 z1 w1 (to be written at base offset + 16)
>> > + *
>> > + * We need to do the inverse operation when we read using 32-bit
>> > messages,
>> > + * which we can do by applying the same exact shuffling on the 64-
>> > bit data
>> > + * read, only that because the data for each vertex is positioned
>> > differently
>> > + * we need to apply different channel enables.
>> > + *
>> > + * This function takes 64bit data and shuffles it as explained
>> > above.
>> > + *
>> > + * The @for_write parameter is used to specify if the shuffling is
>> > being done
>> > + * for proper SIMD4x2 64-bit data that needs to be shuffled prior
>> > to a 32-bit
>> > + * write message (for_write = true), or instead we are doing the
>> > inverse
>> > + * opperation and we have just read 64-bit data using a 32-bit
>> > messages that we
>> > + * need to shuffle to create valid SIMD4x2 64-bit data (for_write
>> > = false).
>> > + *
>> > + * If @block and @ref are non-NULL, then the shuffling is done
>> > after @ref,
>> > + * otherwise the instructions are emitted normally at the end. The
>> > function
>> > + * returns the last instruction inserted.
>> > + *
>> > + * Notice that @src and @dst cannot be the same register.
>> > + */
>> > +vec4_instruction *
>> > +vec4_visitor::shuffle_64bit_data(dst_reg dst, src_reg src, bool
>> > for_write,
>> > +                                 bblock_t *block, vec4_instruction
>> > *ref)
>> > +{
>> > +   assert(type_sz(src.type) == 8);
>> > +   assert(type_sz(dst.type) == 8);
>> > +   assert(!src.in_range(dst, 2));
>> > +   assert(dst.writemask == WRITEMASK_XYZW);
>> > +   assert(!ref == !block);
>> > +
>> > +   vec4_instruction *inst, *last;
>> > +   bool emit_before = ref != NULL;
>> > +
>> > +   #define EMIT(i) \
>> > +      if (!emit_before) { \
>> > +         emit(i); \
>> > +      } else { \
>> > +         ref->insert_after(block, i); \
>> > +         ref = i; \
>> > +      }  \
>> > +      last = i;
>> > +
>> > +   /* Resolve swizzle in src */
>> > +   if (src.swizzle != BRW_SWIZZLE_XYZW) {
>> > +      dst_reg data = dst_reg(this, glsl_type::dvec4_type);
>> > +      inst = MOV(data, src);
>> > +      EMIT(inst);
>> > +      src = src_reg(data);
>> > +   }
>> > +
>> > +   /* dst+0.XY = src+0.XY */
>> > +   dst.writemask = WRITEMASK_XY;
>> > +   inst = MOV(dst, src);
>> > +   inst->regs_written = 1;
>> > +   inst->exec_size = 4;
>> > +   EMIT(inst);
>> > +
>> > +   /* dst+0.ZW = src+1.XY */
>> > +   dst.writemask = WRITEMASK_ZW;
>> > +   inst = MOV(dst, swizzle(offset(src, 1), BRW_SWIZZLE_XYXY));
>> > +   inst->regs_written = 1;
>> > +   inst->exec_size = 4;
>> > +   inst->group = for_write ? 4 : 0;
>> > +   EMIT(inst);
>> > +
>> > +   /* dst+1.XY = src+0.ZW */
>> > +   dst.writemask = WRITEMASK_XY;
>> > +   inst = MOV(offset(dst, 1), swizzle(src, BRW_SWIZZLE_ZWZW));
>> > +   inst->regs_written = 1;
>> > +   inst->exec_size = 4;
>> > +   inst->group = for_write ? 0 : 4;
>> > +   EMIT(inst);
>> > +
>> > +   /* dst+1.ZW = src+1.ZW */
>> > +   dst.writemask = WRITEMASK_ZW;
>> > +   inst = MOV(offset(dst, 1), offset(src, 1));
>> > +   inst->regs_written = 1;
>> > +   inst->exec_size = 4;
>> > +   inst->group = 4;
>> > +   EMIT(inst);
>> > +
>> > +   #undef EMIT
>> > +
>> > +   return last;
>> > +}
>> > +
>> This would be a *lot* simpler if you used and took as argument a VEC4
>> builder instead of constructing the IR manually (and if you used the
>> writemask() helper).  It might need some small fixes for the VEC4
>> builder to initialize the new group and exec_size fields correctly
>> but
>> it seems worth doing just because of this function alone.
>
> Sure, I can do that. About using the writemask() helper, the problem
> with it is that it doesn't set the input writemask on the dst, instead
> it sets the result of "dst.writemask & mask", and when we are shuffling
> we really want to force the writemask independently of any previous
> writemask the dst register had.

Why?  If I give a register r with writemask XY to a function 'foo', I
wouldn't expect it to ignore the writemask I told it to use and instead
use a different writemask not contained in the original writemask.  If
you're doing that right now you likely have a bug.

> Probably the dst we get here has an XYZW writemask anyway since it
> would typically be a temporary vgrf we have just created to store the
> result of the shuffling operation. Maybe we should just add an assert
> to ensure that, and then we should be free to use the writemask()
> helper.
>

Looks like you have one already:

+   assert(dst.writemask == WRITEMASK_XYZW);

Question is do you need it only because you're bashing the writemasks
directly instead of using writemask(), or is there any other reason for
the assert?

>> > 
>> >  }
On Tue, 2016-09-13 at 22:24 -0700, Francisco Jerez wrote:
> Iago Toral <itoral@igalia.com> writes:
> 
> > 
> > On Mon, 2016-09-12 at 14:19 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral Quiroga <itoral@igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > SIMD4x2 64bit data is stored in register space like this:
> > > > 
> > > > r0.0:DF  x0 y0 z0 w0
> > > > r0.1:DF  x1 y1 z1 w1
> > > > 
> > > > When we need to write data such as this to memory using 32-bit
> > > > write
> > > > messages we need to shuffle it in this fashion:
> > > > 
> > > > r0.0:DF  x0 y0 x1 y1
> > > > r0.1:DF  z0 w0 z1 w1
> > > > 
> > > > and emit two 32-bit write messages, one for r0.0 at base_offset
> > > > and another one for r0.1 at base_offset+16.
> > > > 
> > > > We also need to do the inverse operation when we read using 32-
> > > > bit
> > > > messages
> > > > to produce valid SIMD4x2 64bit data from the data read. We can
> > > > achieve this
> > > > by aplying the exact same shuffling to the data read, although
> > > > we
> > > > need to
> > > > apply different channel enables since the layout of the data is
> > > > reversed.
> > > > 
> > > > This helper implements the data shuffling logic and we will use
> > > > it
> > > > in
> > > > various places where we read and write 64bit data from/to
> > > > memory.
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_vec4.h       |  5 ++
> > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 96
> > > > ++++++++++++++++++++++++++++++
> > > >  2 files changed, 101 insertions(+)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h
> > > > b/src/mesa/drivers/dri/i965/brw_vec4.h
> > > > index 26228d0..3337fc0 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4.h
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
> > > > @@ -327,6 +327,11 @@ public:
> > > >  
> > > >     src_reg setup_imm_df(double v);
> > > >  
> > > > +   vec4_instruction *shuffle_64bit_data(dst_reg dst, src_reg
> > > > src,
> > > > +                                        bool for_write,
> > > > +                                        bblock_t *block =
> > > > NULL,
> > > > +                                        vec4_instruction *ref
> > > > =
> > > > NULL);
> > > > +
> > > >     virtual void emit_nir_code();
> > > >     virtual void nir_setup_uniforms();
> > > >     virtual void
> > > > nir_setup_system_value_intrinsic(nir_intrinsic_instr *instr);
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > index 450db92..346e822 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > @@ -2145,4 +2145,100 @@
> > > > vec4_visitor::nir_emit_undef(nir_ssa_undef_instr *instr)
> > > >        dst_reg(VGRF, alloc.allocate(instr->def.bit_size / 32));
> > > >  }
> > > >  
> > > > +/* SIMD4x2 64bit data is stored in register space like this:
> > > > + *
> > > > + * r0.0:DF  x0 y0 z0 w0
> > > > + * r0.1:DF  x1 y1 z1 w1
> > > > + *
> > > > + * When we need to write data such as this to memory using 32-
> > > > bit
> > > > write
> > > > + * messages we need to shuffle it in this fashion:
> > > > + *
> > > > + * r0.0:DF  x0 y0 x1 y1 (to be written at base offset)
> > > > + * r0.0:DF  z0 w0 z1 w1 (to be written at base offset + 16)
> > > > + *
> > > > + * We need to do the inverse operation when we read using 32-
> > > > bit
> > > > messages,
> > > > + * which we can do by applying the same exact shuffling on the
> > > > 64-
> > > > bit data
> > > > + * read, only that because the data for each vertex is
> > > > positioned
> > > > differently
> > > > + * we need to apply different channel enables.
> > > > + *
> > > > + * This function takes 64bit data and shuffles it as explained
> > > > above.
> > > > + *
> > > > + * The @for_write parameter is used to specify if the
> > > > shuffling is
> > > > being done
> > > > + * for proper SIMD4x2 64-bit data that needs to be shuffled
> > > > prior
> > > > to a 32-bit
> > > > + * write message (for_write = true), or instead we are doing
> > > > the
> > > > inverse
> > > > + * opperation and we have just read 64-bit data using a 32-bit
> > > > messages that we
> > > > + * need to shuffle to create valid SIMD4x2 64-bit data
> > > > (for_write
> > > > = false).
> > > > + *
> > > > + * If @block and @ref are non-NULL, then the shuffling is done
> > > > after @ref,
> > > > + * otherwise the instructions are emitted normally at the end.
> > > > The
> > > > function
> > > > + * returns the last instruction inserted.
> > > > + *
> > > > + * Notice that @src and @dst cannot be the same register.
> > > > + */
> > > > +vec4_instruction *
> > > > +vec4_visitor::shuffle_64bit_data(dst_reg dst, src_reg src,
> > > > bool
> > > > for_write,
> > > > +                                 bblock_t *block,
> > > > vec4_instruction
> > > > *ref)
> > > > +{
> > > > +   assert(type_sz(src.type) == 8);
> > > > +   assert(type_sz(dst.type) == 8);
> > > > +   assert(!src.in_range(dst, 2));
> > > > +   assert(dst.writemask == WRITEMASK_XYZW);
> > > > +   assert(!ref == !block);
> > > > +
> > > > +   vec4_instruction *inst, *last;
> > > > +   bool emit_before = ref != NULL;
> > > > +
> > > > +   #define EMIT(i) \
> > > > +      if (!emit_before) { \
> > > > +         emit(i); \
> > > > +      } else { \
> > > > +         ref->insert_after(block, i); \
> > > > +         ref = i; \
> > > > +      }  \
> > > > +      last = i;
> > > > +
> > > > +   /* Resolve swizzle in src */
> > > > +   if (src.swizzle != BRW_SWIZZLE_XYZW) {
> > > > +      dst_reg data = dst_reg(this, glsl_type::dvec4_type);
> > > > +      inst = MOV(data, src);
> > > > +      EMIT(inst);
> > > > +      src = src_reg(data);
> > > > +   }
> > > > +
> > > > +   /* dst+0.XY = src+0.XY */
> > > > +   dst.writemask = WRITEMASK_XY;
> > > > +   inst = MOV(dst, src);
> > > > +   inst->regs_written = 1;
> > > > +   inst->exec_size = 4;
> > > > +   EMIT(inst);
> > > > +
> > > > +   /* dst+0.ZW = src+1.XY */
> > > > +   dst.writemask = WRITEMASK_ZW;
> > > > +   inst = MOV(dst, swizzle(offset(src, 1), BRW_SWIZZLE_XYXY));
> > > > +   inst->regs_written = 1;
> > > > +   inst->exec_size = 4;
> > > > +   inst->group = for_write ? 4 : 0;
> > > > +   EMIT(inst);
> > > > +
> > > > +   /* dst+1.XY = src+0.ZW */
> > > > +   dst.writemask = WRITEMASK_XY;
> > > > +   inst = MOV(offset(dst, 1), swizzle(src, BRW_SWIZZLE_ZWZW));
> > > > +   inst->regs_written = 1;
> > > > +   inst->exec_size = 4;
> > > > +   inst->group = for_write ? 0 : 4;
> > > > +   EMIT(inst);
> > > > +
> > > > +   /* dst+1.ZW = src+1.ZW */
> > > > +   dst.writemask = WRITEMASK_ZW;
> > > > +   inst = MOV(offset(dst, 1), offset(src, 1));
> > > > +   inst->regs_written = 1;
> > > > +   inst->exec_size = 4;
> > > > +   inst->group = 4;
> > > > +   EMIT(inst);
> > > > +
> > > > +   #undef EMIT
> > > > +
> > > > +   return last;
> > > > +}
> > > > +
> > > This would be a *lot* simpler if you used and took as argument a
> > > VEC4
> > > builder instead of constructing the IR manually (and if you used
> > > the
> > > writemask() helper).  It might need some small fixes for the VEC4
> > > builder to initialize the new group and exec_size fields
> > > correctly
> > > but
> > > it seems worth doing just because of this function alone.
> > Sure, I can do that. About using the writemask() helper, the
> > problem
> > with it is that it doesn't set the input writemask on the dst,
> > instead
> > it sets the result of "dst.writemask & mask", and when we are
> > shuffling
> > we really want to force the writemask independently of any previous
> > writemask the dst register had.
> Why?  If I give a register r with writemask XY to a function 'foo', I
> wouldn't expect it to ignore the writemask I told it to use and
> instead
> use a different writemask not contained in the original
> writemask.  If
> you're doing that right now you likely have a bug.

Shuffled data is always written to a new vgrf because (so there is
never data to preserve in the dst) and it is only used specifically
right before a write where we honor the writemask on the dst of the
write operation or right after a read (where we the swizzle on the
source takes care of reading the data channels it needs from the
shuffled result), so I think this is probably fine.

> > 
> > Probably the dst we get here has an XYZW writemask anyway since it
> > would typically be a temporary vgrf we have just created to store
> > the
> > result of the shuffling operation. Maybe we should just add an
> > assert
> > to ensure that, and then we should be free to use the writemask()
> > helper.
> > 
> Looks like you have one already:
> 
> +   assert(dst.writemask == WRITEMASK_XYZW);
> 
> Question is do you need it only because you're bashing the writemasks
> directly instead of using writemask(), or is there any other reason
> for
> the assert?

I don't think we really need this as a requirement and we can honor any
writemask we get. That said, I don't think that would be immediately
useful in any scenario, at least right now.

As a side note, notice that using writemasks with shuffling operations
requires to think a bit about what we are really doing in each case: if
we use an XY writemask when we shuffle before a write, we are shuffling
XYZW components of just one (the first) of the SIMD4x2 threads. If we
do the same thing when we shuffle the result of a read then we are
writing components XY of both threads.

Either way, if some code path really wants to shuffle with a writemask
set and it knows what it is doing I guess there is no reason to stand
in the way.

> > 
> > > 
> > > > 
> > > > 
> > > >  }