[Mesa-dev,33/95] i965/vec4: implement d2b

Submitted by Iago Toral Quiroga on July 19, 2016, 10:40 a.m.

Details

Message ID 1468924892-6910-34-git-send-email-itoral@igalia.com
State New
Headers show
Series "i965 Haswell ARB_gpu_shader_fp64 / OpenGL 4.0" ( rev: 2 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Iago Toral Quiroga July 19, 2016, 10:40 a.m.
---
 src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Patch hide | download patch | download mbox

diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
index 1525a3d..4014020 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
@@ -1497,6 +1497,24 @@  vec4_visitor::nir_emit_alu(nir_alu_instr *instr)
       emit(CMP(dst, op[0], brw_imm_f(0.0f), BRW_CONDITIONAL_NZ));
       break;
 
+   case nir_op_d2b: {
+      /* two-argument instructions can't take 64-bit immediates */
+      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
+      emit(MOV(zero, brw_imm_df(0.0)));
+
+      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
+      emit(CMP(tmp, op[0], src_reg(zero), BRW_CONDITIONAL_NZ));
+
+      /* Convert the double CMP result to a single boolean result. For that
+       * we take the low 32-bit chunk of each DF component in the result.
+       * and do a final MOV to honor the original writemask
+       */
+      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
+      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
+      emit(MOV(dst, src_reg(result)));
+      break;
+   }
+
    case nir_op_i2b:
       emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
       break;

Comments

Iago Toral Quiroga <itoral@igalia.com> writes:

> ---
>  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> index 1525a3d..4014020 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr *instr)
>        emit(CMP(dst, op[0], brw_imm_f(0.0f), BRW_CONDITIONAL_NZ));
>        break;
>  
> +   case nir_op_d2b: {
> +      /* two-argument instructions can't take 64-bit immediates */
> +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> +      emit(MOV(zero, brw_imm_df(0.0)));
> +
> +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> +      emit(CMP(tmp, op[0], src_reg(zero), BRW_CONDITIONAL_NZ));
> +
> +      /* Convert the double CMP result to a single boolean result. For that
> +       * we take the low 32-bit chunk of each DF component in the result.
> +       * and do a final MOV to honor the original writemask
> +       */
> +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
> +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
> +      emit(MOV(dst, src_reg(result)));

Couldn't you just do a single CMP instruction of the double-precision
argument and a single-precision 0.0 immediate?  I think you could
potentially also use a 32bit destination type on the CMP instruction so
you don't need to emit the PICK_LOW+MOV instructions afterwards.  It may
hit an instruction decompression bug but it could be cleaned up by the
SIMD lowering pass afterwards, AFAICT the result would be two
uncompressed instructions instead of the two uncompressed plus two
compressed instructions above.

> +      break;
> +   }
> +
>     case nir_op_i2b:
>        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
>        break;
> -- 
> 2.7.4
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> Iago Toral Quiroga <itoral@igalia.com> writes:
> 
> > 
> > ---
> >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > index 1525a3d..4014020 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr
> > *instr)
> >        emit(CMP(dst, op[0], brw_imm_f(0.0f), BRW_CONDITIONAL_NZ));
> >        break;
> >  
> > +   case nir_op_d2b: {
> > +      /* two-argument instructions can't take 64-bit immediates */
> > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> > +      emit(MOV(zero, brw_imm_df(0.0)));
> > +
> > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> > +      emit(CMP(tmp, op[0], src_reg(zero), BRW_CONDITIONAL_NZ));
> > +
> > +      /* Convert the double CMP result to a single boolean result.
> > For that
> > +       * we take the low 32-bit chunk of each DF component in the
> > result.
> > +       * and do a final MOV to honor the original writemask
> > +       */
> > +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
> > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
> > +      emit(MOV(dst, src_reg(result)));
> Couldn't you just do a single CMP instruction of the double-precision
> argument and a single-precision 0.0 immediate? 

It never occurred to me that we could do that but it seems to work just
fine.

>  I think you could
> potentially also use a 32bit destination type on the CMP instruction
> so
> you don't need to emit the PICK_LOW+MOV instructions afterwards.

This does not seem to work though.

>   It may
> hit an instruction decompression bug but it could be cleaned up by
> the
> SIMD lowering pass afterwards, AFAICT the result would be two
> uncompressed instructions instead of the two uncompressed plus two
> compressed instructions above.

I tried splitting the instruction just in case. Notice that doing this
requires that we improve the splitting pass to support splitting
instructions that write only one register (the original series did not
implement support for that because so far the only cared to split DF
instructions that wrote more than one register). I implemented a couple
of quick hacks to get this done so I could test this but the result is
still incorrect.

For reference, if we don't split, the resulting CMP looks like this
(after full scalarization):

cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };

And if we split the instruction in two we produce this:

cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N };
cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N };
cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N };
cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N };
cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N };
cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N };
cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N };
cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N };

Both sequences of instructions look correct to me as far as the
splitting and DF regioning go, so I guess the 32-bit CMP result is not
doing what we want.

I am not too surprised about this. Even if we use a 32b dst type I
suppose the instruction execution type is still 64b so I guess there
should be a 64b to 32b conversion involved in writing to a 32b result.
Seeing how 64b to 32b conversions require that we emit specific code
using ALIGN1 mode to deal with alignment requirements I guess it is not
too surprising that this does not work, right?

I can send you a trace if you want to have a look at it.

Iago

> > 
> > +      break;
> > +   }
> > +
> >     case nir_op_i2b:
> >        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
> >        break;
Iago Toral <itoral@igalia.com> writes:

> On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
>> Iago Toral Quiroga <itoral@igalia.com> writes:
>> 
>> > 
>> > ---
>> >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18 ++++++++++++++++++
>> >  1 file changed, 18 insertions(+)
>> > 
>> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > index 1525a3d..4014020 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr
>> > *instr)
>> >        emit(CMP(dst, op[0], brw_imm_f(0.0f), BRW_CONDITIONAL_NZ));
>> >        break;
>> >  
>> > +   case nir_op_d2b: {
>> > +      /* two-argument instructions can't take 64-bit immediates */
>> > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
>> > +      emit(MOV(zero, brw_imm_df(0.0)));
>> > +
>> > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
>> > +      emit(CMP(tmp, op[0], src_reg(zero), BRW_CONDITIONAL_NZ));
>> > +
>> > +      /* Convert the double CMP result to a single boolean result.
>> > For that
>> > +       * we take the low 32-bit chunk of each DF component in the
>> > result.
>> > +       * and do a final MOV to honor the original writemask
>> > +       */
>> > +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
>> > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
>> > +      emit(MOV(dst, src_reg(result)));
>> Couldn't you just do a single CMP instruction of the double-precision
>> argument and a single-precision 0.0 immediate? 
>
> It never occurred to me that we could do that but it seems to work just
> fine.
>
>>  I think you could
>> potentially also use a 32bit destination type on the CMP instruction
>> so
>> you don't need to emit the PICK_LOW+MOV instructions afterwards.
>
> This does not seem to work though.
>
>>   It may
>> hit an instruction decompression bug but it could be cleaned up by
>> the
>> SIMD lowering pass afterwards, AFAICT the result would be two
>> uncompressed instructions instead of the two uncompressed plus two
>> compressed instructions above.
>
> I tried splitting the instruction just in case. Notice that doing this
> requires that we improve the splitting pass to support splitting
> instructions that write only one register (the original series did not
> implement support for that because so far the only cared to split DF
> instructions that wrote more than one register). I implemented a couple
> of quick hacks to get this done so I could test this but the result is
> still incorrect.
>
> For reference, if we don't split, the resulting CMP looks like this
> (after full scalarization):
>
> cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
> cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
> cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
> cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };
>
> And if we split the instruction in two we produce this:
>
> cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N };
> cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N };
> cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N };
> cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N };
> cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N };
> cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N };
> cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N };
> cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N };
>
> Both sequences of instructions look correct to me as far as the
> splitting and DF regioning go, so I guess the 32-bit CMP result is not
> doing what we want.
>
> I am not too surprised about this. Even if we use a 32b dst type I
> suppose the instruction execution type is still 64b so I guess there
> should be a 64b to 32b conversion involved in writing to a 32b result.
> Seeing how 64b to 32b conversions require that we emit specific code
> using ALIGN1 mode to deal with alignment requirements I guess it is not
> too surprising that this does not work, right?
>

Yeah, that's not too surprising, don't worry about it.  Another idea
would be to do something along the lines of 'MOV.nz null, <df source
goes here>' to check whether the source is non-zero (so you avoid
loading the immediate which is a non-trivial operation on Gen7), and
then use a single-precision SEL instruction (or two predicated MOV
instructions) to write true or false to the destination (so you don't
need the PICK_LOW+MOV sequence).

> I can send you a trace if you want to have a look at it.
>
> Iago
>
>> > 
>> > +      break;
>> > +   }
>> > +
>> >     case nir_op_i2b:
>> >        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
>> >        break;
On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
> Iago Toral <itoral@igalia.com> writes:
> 
> > 
> > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral Quiroga <itoral@igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
> > > > ++++++++++++++++++
> > > >  1 file changed, 18 insertions(+)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > index 1525a3d..4014020 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr
> > > > *instr)
> > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
> > > > BRW_CONDITIONAL_NZ));
> > > >        break;
> > > >  
> > > > +   case nir_op_d2b: {
> > > > +      /* two-argument instructions can't take 64-bit
> > > > immediates */
> > > > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> > > > +      emit(MOV(zero, brw_imm_df(0.0)));
> > > > +
> > > > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> > > > +      emit(CMP(tmp, op[0], src_reg(zero),
> > > > BRW_CONDITIONAL_NZ));
> > > > +
> > > > +      /* Convert the double CMP result to a single boolean
> > > > result.
> > > > For that
> > > > +       * we take the low 32-bit chunk of each DF component in
> > > > the
> > > > result.
> > > > +       * and do a final MOV to honor the original writemask
> > > > +       */
> > > > +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
> > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
> > > > +      emit(MOV(dst, src_reg(result)));
> > > Couldn't you just do a single CMP instruction of the double-
> > > precision
> > > argument and a single-precision 0.0 immediate? 
> > It never occurred to me that we could do that but it seems to work
> > just
> > fine.
> > 
> > > 
> > >  I think you could
> > > potentially also use a 32bit destination type on the CMP
> > > instruction
> > > so
> > > you don't need to emit the PICK_LOW+MOV instructions afterwards.
> > This does not seem to work though.
> > 
> > > 
> > >   It may
> > > hit an instruction decompression bug but it could be cleaned up
> > > by
> > > the
> > > SIMD lowering pass afterwards, AFAICT the result would be two
> > > uncompressed instructions instead of the two uncompressed plus
> > > two
> > > compressed instructions above.
> > I tried splitting the instruction just in case. Notice that doing
> > this
> > requires that we improve the splitting pass to support splitting
> > instructions that write only one register (the original series did
> > not
> > implement support for that because so far the only cared to split
> > DF
> > instructions that wrote more than one register). I implemented a
> > couple
> > of quick hacks to get this done so I could test this but the result
> > is
> > still incorrect.
> > 
> > For reference, if we don't split, the resulting CMP looks like this
> > (after full scalarization):
> > 
> > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };
> > 
> > And if we split the instruction in two we produce this:
> > 
> > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N };
> > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N };
> > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N };
> > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N };
> > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N };
> > 
> > Both sequences of instructions look correct to me as far as the
> > splitting and DF regioning go, so I guess the 32-bit CMP result is
> > not
> > doing what we want.
> > 
> > I am not too surprised about this. Even if we use a 32b dst type I
> > suppose the instruction execution type is still 64b so I guess
> > there
> > should be a 64b to 32b conversion involved in writing to a 32b
> > result.
> > Seeing how 64b to 32b conversions require that we emit specific
> > code
> > using ALIGN1 mode to deal with alignment requirements I guess it is
> > not
> > too surprising that this does not work, right?
> > 
> Yeah, that's not too surprising, don't worry about it.  Another idea
> would be to do something along the lines of 'MOV.nz null, <df source
> goes here>' to check whether the source is non-zero (so you avoid
> loading the immediate which is a non-trivial operation on Gen7), and
> then use a single-precision SEL instruction (or two predicated MOV
> instructions) to write true or false to the destination (so you don't
> need the PICK_LOW+MOV sequence).

Yeah, that sounds like a good idea, I'll give it a try. Thanks!

Iago

> > 
> > I can send you a trace if you want to have a look at it.
> > 
> > Iago
> > 
> > > 
> > > > 
> > > > 
> > > > +      break;
> > > > +   }
> > > > +
> > > >     case nir_op_i2b:
> > > >        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
> > > >        break;
On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
> Iago Toral <itoral@igalia.com> writes:
> 
> > 
> > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral Quiroga <itoral@igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
> > > > ++++++++++++++++++
> > > >  1 file changed, 18 insertions(+)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > index 1525a3d..4014020 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr
> > > > *instr)
> > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
> > > > BRW_CONDITIONAL_NZ));
> > > >        break;
> > > >  
> > > > +   case nir_op_d2b: {
> > > > +      /* two-argument instructions can't take 64-bit
> > > > immediates */
> > > > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> > > > +      emit(MOV(zero, brw_imm_df(0.0)));
> > > > +
> > > > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> > > > +      emit(CMP(tmp, op[0], src_reg(zero),
> > > > BRW_CONDITIONAL_NZ));
> > > > +
> > > > +      /* Convert the double CMP result to a single boolean
> > > > result.
> > > > For that
> > > > +       * we take the low 32-bit chunk of each DF component in
> > > > the
> > > > result.
> > > > +       * and do a final MOV to honor the original writemask
> > > > +       */
> > > > +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
> > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
> > > > +      emit(MOV(dst, src_reg(result)));
> > > Couldn't you just do a single CMP instruction of the double-
> > > precision
> > > argument and a single-precision 0.0 immediate? 
> > It never occurred to me that we could do that but it seems to work
> > just
> > fine.
> > 
> > > 
> > >  I think you could
> > > potentially also use a 32bit destination type on the CMP
> > > instruction
> > > so
> > > you don't need to emit the PICK_LOW+MOV instructions afterwards.
> > This does not seem to work though.
> > 
> > > 
> > >   It may
> > > hit an instruction decompression bug but it could be cleaned up
> > > by
> > > the
> > > SIMD lowering pass afterwards, AFAICT the result would be two
> > > uncompressed instructions instead of the two uncompressed plus
> > > two
> > > compressed instructions above.
> > I tried splitting the instruction just in case. Notice that doing
> > this
> > requires that we improve the splitting pass to support splitting
> > instructions that write only one register (the original series did
> > not
> > implement support for that because so far the only cared to split
> > DF
> > instructions that wrote more than one register). I implemented a
> > couple
> > of quick hacks to get this done so I could test this but the result
> > is
> > still incorrect.
> > 
> > For reference, if we don't split, the resulting CMP looks like this
> > (after full scalarization):
> > 
> > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };
> > 
> > And if we split the instruction in two we produce this:
> > 
> > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N };
> > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N };
> > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N };
> > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N };
> > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N };
> > 
> > Both sequences of instructions look correct to me as far as the
> > splitting and DF regioning go, so I guess the 32-bit CMP result is
> > not
> > doing what we want.
> > 
> > I am not too surprised about this. Even if we use a 32b dst type I
> > suppose the instruction execution type is still 64b so I guess
> > there
> > should be a 64b to 32b conversion involved in writing to a 32b
> > result.
> > Seeing how 64b to 32b conversions require that we emit specific
> > code
> > using ALIGN1 mode to deal with alignment requirements I guess it is
> > not
> > too surprising that this does not work, right?
> > 
> Yeah, that's not too surprising, don't worry about it.  Another idea
> would be to do something along the lines of 'MOV.nz null, <df source
> goes here>' to check whether the source is non-zero (so you avoid
> loading the immediate which is a non-trivial operation on Gen7), and
> then use a single-precision SEL instruction (or two predicated MOV
> instructions) to write true or false to the destination (so you don't
> need the PICK_LOW+MOV sequence).

Using MOV.nz seems to have a different behavior than CMP.nz. The former
will consider anything other than 0.0 to be non-zero, whereas the CMP
version will consider small enough DF values to be zero as well, which
I think is the behavior we want (I think I remember discussing this
here when we were implementing this for broadwell).

Fortunately, we can do something like this:

   src_reg one = src_reg(this, glsl_type::ivec4_type);
   emit(MOV(dst_reg(one), brw_imm_d(~0)));

   vec4_instruction *inst =
      emit(CMP(dst_null_df(), op[0], brw_imm_f(0.0f),
BRW_CONDITIONAL_NZ));
   inst->regs_written = 1;

   inst = emit(BRW_OPCODE_SEL, dst, one, brw_imm_d(0));
   inst->predicate = BRW_PREDICATE_NORMAL;

which is pretty much the same idea (notice that we use a float constant
for the CMP) and keeps the original behavior.

If you are curious about the inst->regs_written=1 in the CMP it is just
because the vec4_constructor will figure out that because it has a DF
destination it writes two registers, but it really doesn't because it
is a null dst register. If we leave it at 2, then the simd lowering
pass will split the CMP instruction because the 0.0f constant argument
only reads one register. Maybe we should make vec4_instruction set only
1 register written for null dsts or just make the simd lowering pass
ignore the number of registers written for instructions with a null
dst?

Iago

> > 
> > I can send you a trace if you want to have a look at it.
> > 
> > Iago
> > 
> > > 
> > > > 
> > > > 
> > > > +      break;
> > > > +   }
> > > > +
> > > >     case nir_op_i2b:
> > > >        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
> > > >        break;
Iago Toral <itoral@igalia.com> writes:

> On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
>> Iago Toral <itoral@igalia.com> writes:
>> 
>> > 
>> > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
>> > > 
>> > > Iago Toral Quiroga <itoral@igalia.com> writes:
>> > > 
>> > > > 
>> > > > 
>> > > > ---
>> > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
>> > > > ++++++++++++++++++
>> > > >  1 file changed, 18 insertions(+)
>> > > > 
>> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > > > index 1525a3d..4014020 100644
>> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
>> > > > @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr
>> > > > *instr)
>> > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
>> > > > BRW_CONDITIONAL_NZ));
>> > > >        break;
>> > > >  
>> > > > +   case nir_op_d2b: {
>> > > > +      /* two-argument instructions can't take 64-bit
>> > > > immediates */
>> > > > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
>> > > > +      emit(MOV(zero, brw_imm_df(0.0)));
>> > > > +
>> > > > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
>> > > > +      emit(CMP(tmp, op[0], src_reg(zero),
>> > > > BRW_CONDITIONAL_NZ));
>> > > > +
>> > > > +      /* Convert the double CMP result to a single boolean
>> > > > result.
>> > > > For that
>> > > > +       * we take the low 32-bit chunk of each DF component in
>> > > > the
>> > > > result.
>> > > > +       * and do a final MOV to honor the original writemask
>> > > > +       */
>> > > > +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
>> > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
>> > > > +      emit(MOV(dst, src_reg(result)));
>> > > Couldn't you just do a single CMP instruction of the double-
>> > > precision
>> > > argument and a single-precision 0.0 immediate? 
>> > It never occurred to me that we could do that but it seems to work
>> > just
>> > fine.
>> > 
>> > > 
>> > >  I think you could
>> > > potentially also use a 32bit destination type on the CMP
>> > > instruction
>> > > so
>> > > you don't need to emit the PICK_LOW+MOV instructions afterwards.
>> > This does not seem to work though.
>> > 
>> > > 
>> > >   It may
>> > > hit an instruction decompression bug but it could be cleaned up
>> > > by
>> > > the
>> > > SIMD lowering pass afterwards, AFAICT the result would be two
>> > > uncompressed instructions instead of the two uncompressed plus
>> > > two
>> > > compressed instructions above.
>> > I tried splitting the instruction just in case. Notice that doing
>> > this
>> > requires that we improve the splitting pass to support splitting
>> > instructions that write only one register (the original series did
>> > not
>> > implement support for that because so far the only cared to split
>> > DF
>> > instructions that wrote more than one register). I implemented a
>> > couple
>> > of quick hacks to get this done so I could test this but the result
>> > is
>> > still incorrect.
>> > 
>> > For reference, if we don't split, the resulting CMP looks like this
>> > (after full scalarization):
>> > 
>> > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
>> > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
>> > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
>> > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };
>> > 
>> > And if we split the instruction in two we produce this:
>> > 
>> > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N };
>> > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N };
>> > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N };
>> > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N };
>> > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N };
>> > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N };
>> > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N };
>> > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N };
>> > 
>> > Both sequences of instructions look correct to me as far as the
>> > splitting and DF regioning go, so I guess the 32-bit CMP result is
>> > not
>> > doing what we want.
>> > 
>> > I am not too surprised about this. Even if we use a 32b dst type I
>> > suppose the instruction execution type is still 64b so I guess
>> > there
>> > should be a 64b to 32b conversion involved in writing to a 32b
>> > result.
>> > Seeing how 64b to 32b conversions require that we emit specific
>> > code
>> > using ALIGN1 mode to deal with alignment requirements I guess it is
>> > not
>> > too surprising that this does not work, right?
>> > 
>> Yeah, that's not too surprising, don't worry about it.  Another idea
>> would be to do something along the lines of 'MOV.nz null, <df source
>> goes here>' to check whether the source is non-zero (so you avoid
>> loading the immediate which is a non-trivial operation on Gen7), and
>> then use a single-precision SEL instruction (or two predicated MOV
>> instructions) to write true or false to the destination (so you don't
>> need the PICK_LOW+MOV sequence).
>
> Using MOV.nz seems to have a different behavior than CMP.nz. The former
> will consider anything other than 0.0 to be non-zero, whereas the CMP
> version will consider small enough DF values to be zero as well, which
> I think is the behavior we want (I think I remember discussing this
> here when we were implementing this for broadwell).
>
What do you mean by small enough?  Denormalized?  MOV instructions can
flush them to zero as well, only raw moves won't, you can make sure that
the MOV is not raw by applying a source modifier to the source which
shouldn't affect the flag result for normalized sources.

> Fortunately, we can do something like this:
>
>    src_reg one = src_reg(this, glsl_type::ivec4_type);
>    emit(MOV(dst_reg(one), brw_imm_d(~0)));
>
>    vec4_instruction *inst =
>       emit(CMP(dst_null_df(), op[0], brw_imm_f(0.0f),
> BRW_CONDITIONAL_NZ));
>    inst->regs_written = 1;
>
>    inst = emit(BRW_OPCODE_SEL, dst, one, brw_imm_d(0));
>    inst->predicate = BRW_PREDICATE_NORMAL;
>
> which is pretty much the same idea (notice that we use a float constant
> for the CMP) and keeps the original behavior.
>
Hm, I thought you said that using an F immediate on a double-precision
instruction doesn't work?

> If you are curious about the inst->regs_written=1 in the CMP it is
> just because the vec4_constructor will figure out that because it has
> a DF destination it writes two registers, but it really doesn't
> because it is a null dst register.  If we leave it at 2, then the simd
> lowering pass will split the CMP instruction because the 0.0f constant
> argument only reads one register.

That seems like a bug.  There is no need to lower an instruction with a
mismatch in the number of registers read and written if the source is
uniform:

|  When destination spans two registers, the source MUST span two
|  registers. The exception to the above rule:
|
|   - When source is scalar, the source registers are not incremented.
|   - When source is packed integer Word and destination is
|     packed integer DWord, the source register is not incremented
|     but the source sub register is incremented.

The FS back-end implements both exceptions, but there's probably no need
to implement the second one in the VEC4 back-end.

> Maybe we should make vec4_instruction set only 1 register
> written for null dsts or just make the simd lowering pass ignore the
> number of registers written for instructions with a null dst?
>
> Iago
>
>> > 
>> > I can send you a trace if you want to have a look at it.
>> > 
>> > Iago
>> > 
>> > > 
>> > > > 
>> > > > 
>> > > > +      break;
>> > > > +   }
>> > > > +
>> > > >     case nir_op_i2b:
>> > > >        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
>> > > >        break;
On Thu, 2016-08-18 at 12:08 -0700, Francisco Jerez wrote:
> Iago Toral <itoral@igalia.com> writes:
> 
> > 
> > On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral <itoral@igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> > > > > 
> > > > > 
> > > > > Iago Toral Quiroga <itoral@igalia.com> writes:
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > ---
> > > > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
> > > > > > ++++++++++++++++++
> > > > > >  1 file changed, 18 insertions(+)
> > > > > > 
> > > > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > index 1525a3d..4014020 100644
> > > > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > @@ -1497,6 +1497,24 @@
> > > > > > vec4_visitor::nir_emit_alu(nir_alu_instr
> > > > > > *instr)
> > > > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
> > > > > > BRW_CONDITIONAL_NZ));
> > > > > >        break;
> > > > > >  
> > > > > > +   case nir_op_d2b: {
> > > > > > +      /* two-argument instructions can't take 64-bit
> > > > > > immediates */
> > > > > > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> > > > > > +      emit(MOV(zero, brw_imm_df(0.0)));
> > > > > > +
> > > > > > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> > > > > > +      emit(CMP(tmp, op[0], src_reg(zero),
> > > > > > BRW_CONDITIONAL_NZ));
> > > > > > +
> > > > > > +      /* Convert the double CMP result to a single boolean
> > > > > > result.
> > > > > > For that
> > > > > > +       * we take the low 32-bit chunk of each DF component
> > > > > > in
> > > > > > the
> > > > > > result.
> > > > > > +       * and do a final MOV to honor the original
> > > > > > writemask
> > > > > > +       */
> > > > > > +      dst_reg result = dst_reg(this,
> > > > > > glsl_type::bvec4_type);
> > > > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result,
> > > > > > src_reg(tmp));
> > > > > > +      emit(MOV(dst, src_reg(result)));
> > > > > Couldn't you just do a single CMP instruction of the double-
> > > > > precision
> > > > > argument and a single-precision 0.0 immediate? 
> > > > It never occurred to me that we could do that but it seems to
> > > > work
> > > > just
> > > > fine.
> > > > 
> > > > > 
> > > > > 
> > > > >  I think you could
> > > > > potentially also use a 32bit destination type on the CMP
> > > > > instruction
> > > > > so
> > > > > you don't need to emit the PICK_LOW+MOV instructions
> > > > > afterwards.
> > > > This does not seem to work though.
> > > > 
> > > > > 
> > > > > 
> > > > >   It may
> > > > > hit an instruction decompression bug but it could be cleaned
> > > > > up
> > > > > by
> > > > > the
> > > > > SIMD lowering pass afterwards, AFAICT the result would be two
> > > > > uncompressed instructions instead of the two uncompressed
> > > > > plus
> > > > > two
> > > > > compressed instructions above.
> > > > I tried splitting the instruction just in case. Notice that
> > > > doing
> > > > this
> > > > requires that we improve the splitting pass to support
> > > > splitting
> > > > instructions that write only one register (the original series
> > > > did
> > > > not
> > > > implement support for that because so far the only cared to
> > > > split
> > > > DF
> > > > instructions that wrote more than one register). I implemented
> > > > a
> > > > couple
> > > > of quick hacks to get this done so I could test this but the
> > > > result
> > > > is
> > > > still incorrect.
> > > > 
> > > > For reference, if we don't split, the resulting CMP looks like
> > > > this
> > > > (after full scalarization):
> > > > 
> > > > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
> > > > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
> > > > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
> > > > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };
> > > > 
> > > > And if we split the instruction in two we produce this:
> > > > 
> > > > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N
> > > > };
> > > > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N
> > > > };
> > > > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N
> > > > };
> > > > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N
> > > > };
> > > > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N
> > > > };
> > > > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N
> > > > };
> > > > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N
> > > > };
> > > > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N
> > > > };
> > > > 
> > > > Both sequences of instructions look correct to me as far as the
> > > > splitting and DF regioning go, so I guess the 32-bit CMP result
> > > > is
> > > > not
> > > > doing what we want.
> > > > 
> > > > I am not too surprised about this. Even if we use a 32b dst
> > > > type I
> > > > suppose the instruction execution type is still 64b so I guess
> > > > there
> > > > should be a 64b to 32b conversion involved in writing to a 32b
> > > > result.
> > > > Seeing how 64b to 32b conversions require that we emit specific
> > > > code
> > > > using ALIGN1 mode to deal with alignment requirements I guess
> > > > it is
> > > > not
> > > > too surprising that this does not work, right?
> > > > 
> > > Yeah, that's not too surprising, don't worry about it.  Another
> > > idea
> > > would be to do something along the lines of 'MOV.nz null, <df
> > > source
> > > goes here>' to check whether the source is non-zero (so you avoid
> > > loading the immediate which is a non-trivial operation on Gen7),
> > > and
> > > then use a single-precision SEL instruction (or two predicated
> > > MOV
> > > instructions) to write true or false to the destination (so you
> > > don't
> > > need the PICK_LOW+MOV sequence).
> > Using MOV.nz seems to have a different behavior than CMP.nz. The
> > former
> > will consider anything other than 0.0 to be non-zero, whereas the
> > CMP
> > version will consider small enough DF values to be zero as well,
> > which
> > I think is the behavior we want (I think I remember discussing this
> > here when we were implementing this for broadwell).
> > 
> What do you mean by small enough?  Denormalized? 

Yes, denormalized values, sorry I wasn't more clear.

>  MOV instructions can
> flush them to zero as well, only raw moves won't, you can make sure
> that
> the MOV is not raw by applying a source modifier to the source which
> shouldn't affect the flag result for normalized sources.

Oh, didn't know that, I can try this.

> > 
> > Fortunately, we can do something like this:
> > 
> >    src_reg one = src_reg(this, glsl_type::ivec4_type);
> >    emit(MOV(dst_reg(one), brw_imm_d(~0)));
> > 
> >    vec4_instruction *inst =
> >       emit(CMP(dst_null_df(), op[0], brw_imm_f(0.0f),
> > BRW_CONDITIONAL_NZ));
> >    inst->regs_written = 1;
> > 
> >    inst = emit(BRW_OPCODE_SEL, dst, one, brw_imm_d(0));
> >    inst->predicate = BRW_PREDICATE_NORMAL;
> > 
> > which is pretty much the same idea (notice that we use a float
> > constant
> > for the CMP) and keeps the original behavior.
> > 
> Hm, I thought you said that using an F immediate on a double-
> precision
> instruction doesn't work?

No, that seems to work fine, at least if the float constant is 0.0 it
seems to produce correct results in all the cases I tested. What
doesn't work is the part where we also make the CMP produce a 32-bit
result.

> > 
> > If you are curious about the inst->regs_written=1 in the CMP it is
> > just because the vec4_constructor will figure out that because it
> > has
> > a DF destination it writes two registers, but it really doesn't
> > because it is a null dst register.  If we leave it at 2, then the
> > simd
> > lowering pass will split the CMP instruction because the 0.0f
> > constant
> > argument only reads one register.
> That seems like a bug.  There is no need to lower an instruction with
> a
> mismatch in the number of registers read and written if the source is
> uniform:
> 
> > 
> >  When destination spans two registers, the source MUST span two
> >  registers. The exception to the above rule:
> > 
> >   - When source is scalar, the source registers are not
> > incremented.
> >   - When source is packed integer Word and destination is
> >     packed integer DWord, the source register is not incremented
> >     but the source sub register is incremented.
> The FS back-end implements both exceptions, but there's probably no
> need
> to implement the second one in the VEC4 back-end.
> 
> > 
> > Maybe we should make vec4_instruction set only 1 register
> > written for null dsts or just make the simd lowering pass ignore
> > the
> > number of registers written for instructions with a null dst?
> > 
> > Iago
> > 
> > > 
> > > > 
> > > > 
> > > > I can send you a trace if you want to have a look at it.
> > > > 
> > > > Iago
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > +      break;
> > > > > > +   }
> > > > > > +
> > > > > >     case nir_op_i2b:
> > > > > >        emit(CMP(dst, op[0], brw_imm_d(0),
> > > > > > BRW_CONDITIONAL_NZ));
> > > > > >        break;
On Fri, 2016-08-19 at 08:50 +0200, Iago Toral wrote:
> On Thu, 2016-08-18 at 12:08 -0700, Francisco Jerez wrote:
> > 
> > Iago Toral <itoral@igalia.com> writes:
> > 
> > > 
> > > 
> > > On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
> > > > 
> > > > 
> > > > Iago Toral <itoral@igalia.com> writes:
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Iago Toral Quiroga <itoral@igalia.com> writes:
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > ---
> > > > > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
> > > > > > > ++++++++++++++++++
> > > > > > >  1 file changed, 18 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > > index 1525a3d..4014020 100644
> > > > > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > > @@ -1497,6 +1497,24 @@
> > > > > > > vec4_visitor::nir_emit_alu(nir_alu_instr
> > > > > > > *instr)
> > > > > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
> > > > > > > BRW_CONDITIONAL_NZ));
> > > > > > >        break;
> > > > > > >  
> > > > > > > +   case nir_op_d2b: {
> > > > > > > +      /* two-argument instructions can't take 64-bit
> > > > > > > immediates */
> > > > > > > +      dst_reg zero = dst_reg(this,
> > > > > > > glsl_type::dvec4_type);
> > > > > > > +      emit(MOV(zero, brw_imm_df(0.0)));
> > > > > > > +
> > > > > > > +      dst_reg tmp = dst_reg(this,
> > > > > > > glsl_type::dvec4_type);
> > > > > > > +      emit(CMP(tmp, op[0], src_reg(zero),
> > > > > > > BRW_CONDITIONAL_NZ));
> > > > > > > +
> > > > > > > +      /* Convert the double CMP result to a single
> > > > > > > boolean
> > > > > > > result.
> > > > > > > For that
> > > > > > > +       * we take the low 32-bit chunk of each DF
> > > > > > > component
> > > > > > > in
> > > > > > > the
> > > > > > > result.
> > > > > > > +       * and do a final MOV to honor the original
> > > > > > > writemask
> > > > > > > +       */
> > > > > > > +      dst_reg result = dst_reg(this,
> > > > > > > glsl_type::bvec4_type);
> > > > > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result,
> > > > > > > src_reg(tmp));
> > > > > > > +      emit(MOV(dst, src_reg(result)));
> > > > > > Couldn't you just do a single CMP instruction of the
> > > > > > double-
> > > > > > precision
> > > > > > argument and a single-precision 0.0 immediate? 
> > > > > It never occurred to me that we could do that but it seems to
> > > > > work
> > > > > just
> > > > > fine.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >  I think you could
> > > > > > potentially also use a 32bit destination type on the CMP
> > > > > > instruction
> > > > > > so
> > > > > > you don't need to emit the PICK_LOW+MOV instructions
> > > > > > afterwards.
> > > > > This does not seem to work though.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >   It may
> > > > > > hit an instruction decompression bug but it could be
> > > > > > cleaned
> > > > > > up
> > > > > > by
> > > > > > the
> > > > > > SIMD lowering pass afterwards, AFAICT the result would be
> > > > > > two
> > > > > > uncompressed instructions instead of the two uncompressed
> > > > > > plus
> > > > > > two
> > > > > > compressed instructions above.
> > > > > I tried splitting the instruction just in case. Notice that
> > > > > doing
> > > > > this
> > > > > requires that we improve the splitting pass to support
> > > > > splitting
> > > > > instructions that write only one register (the original
> > > > > series
> > > > > did
> > > > > not
> > > > > implement support for that because so far the only cared to
> > > > > split
> > > > > DF
> > > > > instructions that wrote more than one register). I
> > > > > implemented
> > > > > a
> > > > > couple
> > > > > of quick hacks to get this done so I could test this but the
> > > > > result
> > > > > is
> > > > > still incorrect.
> > > > > 
> > > > > For reference, if we don't split, the resulting CMP looks
> > > > > like
> > > > > this
> > > > > (after full scalarization):
> > > > > 
> > > > > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q
> > > > > };
> > > > > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q
> > > > > };
> > > > > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q
> > > > > };
> > > > > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q
> > > > > };
> > > > > 
> > > > > And if we split the instruction in two we produce this:
> > > > > 
> > > > > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16
> > > > > 1N
> > > > > };
> > > > > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16
> > > > > 1N
> > > > > };
> > > > > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16
> > > > > 1N
> > > > > };
> > > > > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16
> > > > > 1N
> > > > > };
> > > > > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16
> > > > > 2N
> > > > > };
> > > > > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16
> > > > > 2N
> > > > > };
> > > > > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16
> > > > > 2N
> > > > > };
> > > > > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16
> > > > > 2N
> > > > > };
> > > > > 
> > > > > Both sequences of instructions look correct to me as far as
> > > > > the
> > > > > splitting and DF regioning go, so I guess the 32-bit CMP
> > > > > result
> > > > > is
> > > > > not
> > > > > doing what we want.
> > > > > 
> > > > > I am not too surprised about this. Even if we use a 32b dst
> > > > > type I
> > > > > suppose the instruction execution type is still 64b so I
> > > > > guess
> > > > > there
> > > > > should be a 64b to 32b conversion involved in writing to a
> > > > > 32b
> > > > > result.
> > > > > Seeing how 64b to 32b conversions require that we emit
> > > > > specific
> > > > > code
> > > > > using ALIGN1 mode to deal with alignment requirements I guess
> > > > > it is
> > > > > not
> > > > > too surprising that this does not work, right?
> > > > > 
> > > > Yeah, that's not too surprising, don't worry about it.  Another
> > > > idea
> > > > would be to do something along the lines of 'MOV.nz null, <df
> > > > source
> > > > goes here>' to check whether the source is non-zero (so you
> > > > avoid
> > > > loading the immediate which is a non-trivial operation on
> > > > Gen7),
> > > > and
> > > > then use a single-precision SEL instruction (or two predicated
> > > > MOV
> > > > instructions) to write true or false to the destination (so you
> > > > don't
> > > > need the PICK_LOW+MOV sequence).
> > > Using MOV.nz seems to have a different behavior than CMP.nz. The
> > > former
> > > will consider anything other than 0.0 to be non-zero, whereas the
> > > CMP
> > > version will consider small enough DF values to be zero as well,
> > > which
> > > I think is the behavior we want (I think I remember discussing
> > > this
> > > here when we were implementing this for broadwell).
> > > 
> > What do you mean by small enough?  Denormalized? 
> Yes, denormalized values, sorry I wasn't more clear.
> 
> > 
> >  MOV instructions can
> > flush them to zero as well, only raw moves won't, you can make sure
> > that
> > the MOV is not raw by applying a source modifier to the source
> > which
> > shouldn't affect the flag result for normalized sources.
> Oh, didn't know that, I can try this.

This works. I'll go with this approach instead of the CMP with float
constant solution since we can explain exactly why this solution works.

Iago

> > 
> > > 
> > > 
> > > Fortunately, we can do something like this:
> > > 
> > >    src_reg one = src_reg(this, glsl_type::ivec4_type);
> > >    emit(MOV(dst_reg(one), brw_imm_d(~0)));
> > > 
> > >    vec4_instruction *inst =
> > >       emit(CMP(dst_null_df(), op[0], brw_imm_f(0.0f),
> > > BRW_CONDITIONAL_NZ));
> > >    inst->regs_written = 1;
> > > 
> > >    inst = emit(BRW_OPCODE_SEL, dst, one, brw_imm_d(0));
> > >    inst->predicate = BRW_PREDICATE_NORMAL;
> > > 
> > > which is pretty much the same idea (notice that we use a float
> > > constant
> > > for the CMP) and keeps the original behavior.
> > > 
> > Hm, I thought you said that using an F immediate on a double-
> > precision
> > instruction doesn't work?
> No, that seems to work fine, at least if the float constant is 0.0 it
> seems to produce correct results in all the cases I tested. What
> doesn't work is the part where we also make the CMP produce a 32-bit
> result.
> 
> > 
> > > 
> > > 
> > > If you are curious about the inst->regs_written=1 in the CMP it
> > > is
> > > just because the vec4_constructor will figure out that because it
> > > has
> > > a DF destination it writes two registers, but it really doesn't
> > > because it is a null dst register.  If we leave it at 2, then the
> > > simd
> > > lowering pass will split the CMP instruction because the 0.0f
> > > constant
> > > argument only reads one register.
> > That seems like a bug.  There is no need to lower an instruction
> > with
> > a
> > mismatch in the number of registers read and written if the source
> > is
> > uniform:
> > 
> > > 
> > > 
> > >  When destination spans two registers, the source MUST span two
> > >  registers. The exception to the above rule:
> > > 
> > >   - When source is scalar, the source registers are not
> > > incremented.
> > >   - When source is packed integer Word and destination is
> > >     packed integer DWord, the source register is not incremented
> > >     but the source sub register is incremented.
> > The FS back-end implements both exceptions, but there's probably no
> > need
> > to implement the second one in the VEC4 back-end.
> > 
> > > 
> > > 
> > > Maybe we should make vec4_instruction set only 1 register
> > > written for null dsts or just make the simd lowering pass ignore
> > > the
> > > number of registers written for instructions with a null dst?
> > > 
> > > Iago
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > I can send you a trace if you want to have a look at it.
> > > > > 
> > > > > Iago
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > +      break;
> > > > > > > +   }
> > > > > > > +
> > > > > > >     case nir_op_i2b:
> > > > > > >        emit(CMP(dst, op[0], brw_imm_d(0),
> > > > > > > BRW_CONDITIONAL_NZ));
> > > > > > >        break;