[Mesa-dev,5/8] glsl: Rewrite atan2 implementation to fix accuracy and handling of zero/infinity.

Submitted by Francisco Jerez on Jan. 24, 2017, 11:26 p.m.

Details

Message ID 20170124232646.29544-5-currojerez@riseup.net
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Francisco Jerez Jan. 24, 2017, 11:26 p.m.
This addresses several issues of the current atan2 implementation:

 - Negative zero (and negative denorms which end up getting flushed to
   zero) isn't handled correctly by the current implementation.  The
   reason is that it does 'y >= 0' and 'x < 0' comparisons to decide
   on which side of the branch cut the argument is, which causes us to
   return incorrect results (off by up to 2π) for very small negative
   values.

 - There is a serious precision problem for x values of large enough
   magnitude introduced by the floating point division operation being
   implemented as a mul+rcp sequence.  This can lead to the quotient
   getting flushed to zero in some cases introducing an error of over
   8e6 ULP in the result -- Or in the most catastrophic case will
   cause us to return NaN instead of the correct value ±π/2 for y=±∞
   and x very large.  We can fix this easily by scaling down both
   arguments when the absolute value of the denominator goes above
   certain threshold.  The error of this atan2 implementation remains
   below 25 ULP in most of its domain except for a neighborhood of y=0
   where it reaches a maximum error of about 180 ULP.

 - It emits a bunch of instructions including no less than three
   if-else branches per scalar component that don't seem to get
   optimized out later on.  This implementation uses about 13% less
   instructions on Intel SKL hardware and doesn't emit any control
   flow instructions.
---
 src/compiler/glsl/builtin_functions.cpp | 82 ++++++++++++++++++---------------
 1 file changed, 46 insertions(+), 36 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/compiler/glsl/builtin_functions.cpp b/src/compiler/glsl/builtin_functions.cpp
index 4a6c5af..fd59381 100644
--- a/src/compiler/glsl/builtin_functions.cpp
+++ b/src/compiler/glsl/builtin_functions.cpp
@@ -3560,44 +3560,54 @@  builtin_builder::_acos(const glsl_type *type)
 ir_function_signature *
 builtin_builder::_atan2(const glsl_type *type)
 {
-   ir_variable *vec_y = in_var(type, "vec_y");
-   ir_variable *vec_x = in_var(type, "vec_x");
-   MAKE_SIG(type, always_available, 2, vec_y, vec_x);
-
-   ir_variable *vec_result = body.make_temp(type, "vec_result");
-   ir_variable *r = body.make_temp(glsl_type::float_type, "r");
-   for (int i = 0; i < type->vector_elements; i++) {
-      ir_variable *y = body.make_temp(glsl_type::float_type, "y");
-      ir_variable *x = body.make_temp(glsl_type::float_type, "x");
-      body.emit(assign(y, swizzle(vec_y, i, 1)));
-      body.emit(assign(x, swizzle(vec_x, i, 1)));
-
-      /* If |x| >= 1.0e-8 * |y|: */
-      ir_if *outer_if =
-         new(mem_ctx) ir_if(greater(abs(x), mul(imm(1.0e-8f), abs(y))));
-
-      ir_factory outer_then(&outer_if->then_instructions, mem_ctx);
-
-      /* Then...call atan(y/x) */
-      do_atan(outer_then, glsl_type::float_type, r, div(y, x));
-
-      /*     ...and fix it up: */
-      ir_if *inner_if = new(mem_ctx) ir_if(less(x, imm(0.0f)));
-      inner_if->then_instructions.push_tail(
-         if_tree(gequal(y, imm(0.0f)),
-                 assign(r, add(r, imm(M_PIf))),
-                 assign(r, sub(r, imm(M_PIf)))));
-      outer_then.emit(inner_if);
-
-      /* Else... */
-      outer_if->else_instructions.push_tail(
-         assign(r, mul(sign(y), imm(M_PI_2f))));
+   const unsigned n = type->vector_elements;
+   ir_variable *y = in_var(type, "y");
+   ir_variable *x = in_var(type, "x");
+   MAKE_SIG(type, always_available, 2, y, x);
 
-      body.emit(outer_if);
+   /* If we're on the left half-plane rotate the coordinates π/2 clock-wise
+    * for the y=0 discontinuity to end up aligned with the vertical
+    * discontinuity of atan(s/t) along t=0.
+    */
+   ir_variable *flip = body.make_temp(glsl_type::bvec(n), "flip");
+   body.emit(assign(flip, less(x, imm(0.0f, n))));
+   ir_variable *s = body.make_temp(type, "s");
+   body.emit(assign(s, csel(flip, abs(x), y)));
+   ir_variable *t = body.make_temp(type, "t");
+   body.emit(assign(t, csel(flip, y, abs(x))));
 
-      body.emit(assign(vec_result, r, 1 << i));
-   }
-   body.emit(ret(vec_result));
+   /* If the magnitude of the denominator exceeds some huge value, scale down
+    * the arguments in order to prevent the reciprocal operation from flushing
+    * its result to zero, which would cause precision problems, and for s
+    * infinite would cause us to return a NaN instead of the correct finite
+    * value.
+    */
+   ir_constant *huge = imm(1e37f, n);
+   ir_variable *scale = body.make_temp(type, "scale");
+   body.emit(assign(scale, csel(gequal(abs(t), huge),
+                                imm(0.0625f, n), imm(1.0f, n))));
+   ir_variable *rcp_scaled_t = body.make_temp(type, "rcp_scaled_t");
+   body.emit(assign(rcp_scaled_t, rcp(mul(t, scale))));
+   ir_expression *s_over_t = mul(mul(s, scale), rcp_scaled_t);
+
+   /* Calculate the arctangent and fix up the result if we had flipped the
+    * coordinate system.
+    */
+   ir_variable *arc = body.make_temp(type, "arc");
+   do_atan(body, type, arc, abs(s_over_t));
+   body.emit(assign(arc, add(arc, mul(b2f(flip), imm(M_PI_2f)))));
+
+   /* Rather convoluted calculation of the sign of the result.  When x < 0 we
+    * cannot use fsign because we need to be able to distinguish between
+    * negative and positive zero.  Unfortunately we cannot use bitwise
+    * arithmetic tricks either because of back-ends without integer support.
+    * When x >= 0 rcp_scaled_t will always be non-negative so this won't be
+    * able to distinguish between negative and positive zero, but we don't
+    * care because atan2 is continuous along the whole positive y = 0
+    * half-line, so it won't affect the result.
+    */
+   body.emit(ret(csel(less(min2(y, rcp_scaled_t), imm(0.0f, n)),
+                      neg(arc), arc)));
 
    return sig;
 }

Comments

It's a real bummer that we have two implementations of this function
that are basically written in assembly... I'm not sure what else you'd
call generating IR by hand.  The code review and maintenance costs are
of the same magnitude for sure.

We could move this to GLSL and let the standalone compiler generate the
builder code.  I don't think that is currently helpful.  However, for
future "soft" int64 and fp64 work the standalone compiler will need to
be extended to also generate NIR builder.  Once that is done, I think
the cost-benefit analysis changes.

On 01/24/2017 03:26 PM, Francisco Jerez wrote:
> This addresses several issues of the current atan2 implementation:
> 
>  - Negative zero (and negative denorms which end up getting flushed to
>    zero) isn't handled correctly by the current implementation.  The
>    reason is that it does 'y >= 0' and 'x < 0' comparisons to decide
>    on which side of the branch cut the argument is, which causes us to
>    return incorrect results (off by up to 2π) for very small negative
>    values.
> 
>  - There is a serious precision problem for x values of large enough
>    magnitude introduced by the floating point division operation being
>    implemented as a mul+rcp sequence.  This can lead to the quotient
>    getting flushed to zero in some cases introducing an error of over
>    8e6 ULP in the result -- Or in the most catastrophic case will
>    cause us to return NaN instead of the correct value ±π/2 for y=±∞
>    and x very large.  We can fix this easily by scaling down both
>    arguments when the absolute value of the denominator goes above
>    certain threshold.  The error of this atan2 implementation remains
>    below 25 ULP in most of its domain except for a neighborhood of y=0
>    where it reaches a maximum error of about 180 ULP.
> 
>  - It emits a bunch of instructions including no less than three
>    if-else branches per scalar component that don't seem to get
>    optimized out later on.  This implementation uses about 13% less
>    instructions on Intel SKL hardware and doesn't emit any control
>    flow instructions.
> ---
>  src/compiler/glsl/builtin_functions.cpp | 82 ++++++++++++++++++---------------
>  1 file changed, 46 insertions(+), 36 deletions(-)
> 
> diff --git a/src/compiler/glsl/builtin_functions.cpp b/src/compiler/glsl/builtin_functions.cpp
> index 4a6c5af..fd59381 100644
> --- a/src/compiler/glsl/builtin_functions.cpp
> +++ b/src/compiler/glsl/builtin_functions.cpp
> @@ -3560,44 +3560,54 @@ builtin_builder::_acos(const glsl_type *type)
>  ir_function_signature *
>  builtin_builder::_atan2(const glsl_type *type)
>  {
> -   ir_variable *vec_y = in_var(type, "vec_y");
> -   ir_variable *vec_x = in_var(type, "vec_x");
> -   MAKE_SIG(type, always_available, 2, vec_y, vec_x);
> -
> -   ir_variable *vec_result = body.make_temp(type, "vec_result");
> -   ir_variable *r = body.make_temp(glsl_type::float_type, "r");
> -   for (int i = 0; i < type->vector_elements; i++) {
> -      ir_variable *y = body.make_temp(glsl_type::float_type, "y");
> -      ir_variable *x = body.make_temp(glsl_type::float_type, "x");
> -      body.emit(assign(y, swizzle(vec_y, i, 1)));
> -      body.emit(assign(x, swizzle(vec_x, i, 1)));
> -
> -      /* If |x| >= 1.0e-8 * |y|: */
> -      ir_if *outer_if =
> -         new(mem_ctx) ir_if(greater(abs(x), mul(imm(1.0e-8f), abs(y))));
> -
> -      ir_factory outer_then(&outer_if->then_instructions, mem_ctx);
> -
> -      /* Then...call atan(y/x) */
> -      do_atan(outer_then, glsl_type::float_type, r, div(y, x));
> -
> -      /*     ...and fix it up: */
> -      ir_if *inner_if = new(mem_ctx) ir_if(less(x, imm(0.0f)));
> -      inner_if->then_instructions.push_tail(
> -         if_tree(gequal(y, imm(0.0f)),
> -                 assign(r, add(r, imm(M_PIf))),
> -                 assign(r, sub(r, imm(M_PIf)))));
> -      outer_then.emit(inner_if);
> -
> -      /* Else... */
> -      outer_if->else_instructions.push_tail(
> -         assign(r, mul(sign(y), imm(M_PI_2f))));
> +   const unsigned n = type->vector_elements;
> +   ir_variable *y = in_var(type, "y");
> +   ir_variable *x = in_var(type, "x");
> +   MAKE_SIG(type, always_available, 2, y, x);
>  
> -      body.emit(outer_if);
> +   /* If we're on the left half-plane rotate the coordinates π/2 clock-wise
> +    * for the y=0 discontinuity to end up aligned with the vertical
> +    * discontinuity of atan(s/t) along t=0.
> +    */
> +   ir_variable *flip = body.make_temp(glsl_type::bvec(n), "flip");
> +   body.emit(assign(flip, less(x, imm(0.0f, n))));
> +   ir_variable *s = body.make_temp(type, "s");
> +   body.emit(assign(s, csel(flip, abs(x), y)));
> +   ir_variable *t = body.make_temp(type, "t");
> +   body.emit(assign(t, csel(flip, y, abs(x))));
>  
> -      body.emit(assign(vec_result, r, 1 << i));
> -   }
> -   body.emit(ret(vec_result));
> +   /* If the magnitude of the denominator exceeds some huge value, scale down
> +    * the arguments in order to prevent the reciprocal operation from flushing
> +    * its result to zero, which would cause precision problems, and for s
> +    * infinite would cause us to return a NaN instead of the correct finite
> +    * value.
> +    */
> +   ir_constant *huge = imm(1e37f, n);
> +   ir_variable *scale = body.make_temp(type, "scale");
> +   body.emit(assign(scale, csel(gequal(abs(t), huge),
> +                                imm(0.0625f, n), imm(1.0f, n))));

Out of curiosity, how did you arrive at 1/16 for the scale?  I wonder if
r300-era GPUs would want a different huge-val and scale.  I'm pretty
sure r300 still only used 24-bit floats.  I think all the old NVIDIA and
Intel GPUs used 32-bit... I don't think there are any Mesa drivers for
other weird GPUs of that era.

> +   ir_variable *rcp_scaled_t = body.make_temp(type, "rcp_scaled_t");
> +   body.emit(assign(rcp_scaled_t, rcp(mul(t, scale))));
> +   ir_expression *s_over_t = mul(mul(s, scale), rcp_scaled_t);

If I'm reading this right,

    s_over_t = (s * scale) * rcp(t * scale);

I want to make sure we have a specific test case that will fail if
someone tries to add an algebraic optimization that tries to remove the
'* scale'.

> +
> +   /* Calculate the arctangent and fix up the result if we had flipped the
> +    * coordinate system.
> +    */
> +   ir_variable *arc = body.make_temp(type, "arc");
> +   do_atan(body, type, arc, abs(s_over_t));
> +   body.emit(assign(arc, add(arc, mul(b2f(flip), imm(M_PI_2f)))));

I'll spare the details of the deep rat hole I went down with algebraic
optimizations related to things like  b ? x : x + y vs x + float(b) * y.
 In this particular case, I think x + csel(flip, M_PI_2, 0) is
marginally better because the csel can be scheduled early and flip
killed.  Eh... but a csel with two immediate values usually results in
dumb code.  I'm already heading back down the rat hole... never mind.

> +
> +   /* Rather convoluted calculation of the sign of the result.  When x < 0 we
> +    * cannot use fsign because we need to be able to distinguish between
> +    * negative and positive zero.  Unfortunately we cannot use bitwise
> +    * arithmetic tricks either because of back-ends without integer support.
> +    * When x >= 0 rcp_scaled_t will always be non-negative so this won't be
> +    * able to distinguish between negative and positive zero, but we don't
> +    * care because atan2 is continuous along the whole positive y = 0
> +    * half-line, so it won't affect the result.
> +    */
> +   body.emit(ret(csel(less(min2(y, rcp_scaled_t), imm(0.0f, n)),
> +                      neg(arc), arc)));

FWIW, b ? -x : x vs. -float(b) * x was part of the previous rat hole.

>  
>     return sig;
>  }
>
Ian Romanick <idr@freedesktop.org> writes:

> It's a real bummer that we have two implementations of this function
> that are basically written in assembly... I'm not sure what else you'd
> call generating IR by hand.  The code review and maintenance costs are
> of the same magnitude for sure.
>
> We could move this to GLSL and let the standalone compiler generate the
> builder code.  I don't think that is currently helpful.  However, for
> future "soft" int64 and fp64 work the standalone compiler will need to
> be extended to also generate NIR builder.  Once that is done, I think
> the cost-benefit analysis changes.
>

Yes, I agree...  It was a real PITA to have to fix two different
implementations of atan2, we should come up with some way to share the
lowering of these built-ins even if they're still written in assembly --
It would still halve the amount of pain we inflict on ourselves.

> On 01/24/2017 03:26 PM, Francisco Jerez wrote:
>> This addresses several issues of the current atan2 implementation:
>> 
>>  - Negative zero (and negative denorms which end up getting flushed to
>>    zero) isn't handled correctly by the current implementation.  The
>>    reason is that it does 'y >= 0' and 'x < 0' comparisons to decide
>>    on which side of the branch cut the argument is, which causes us to
>>    return incorrect results (off by up to 2π) for very small negative
>>    values.
>> 
>>  - There is a serious precision problem for x values of large enough
>>    magnitude introduced by the floating point division operation being
>>    implemented as a mul+rcp sequence.  This can lead to the quotient
>>    getting flushed to zero in some cases introducing an error of over
>>    8e6 ULP in the result -- Or in the most catastrophic case will
>>    cause us to return NaN instead of the correct value ±π/2 for y=±∞
>>    and x very large.  We can fix this easily by scaling down both
>>    arguments when the absolute value of the denominator goes above
>>    certain threshold.  The error of this atan2 implementation remains
>>    below 25 ULP in most of its domain except for a neighborhood of y=0
>>    where it reaches a maximum error of about 180 ULP.
>> 
>>  - It emits a bunch of instructions including no less than three
>>    if-else branches per scalar component that don't seem to get
>>    optimized out later on.  This implementation uses about 13% less
>>    instructions on Intel SKL hardware and doesn't emit any control
>>    flow instructions.
>> ---
>>  src/compiler/glsl/builtin_functions.cpp | 82 ++++++++++++++++++---------------
>>  1 file changed, 46 insertions(+), 36 deletions(-)
>> 
>> diff --git a/src/compiler/glsl/builtin_functions.cpp b/src/compiler/glsl/builtin_functions.cpp
>> index 4a6c5af..fd59381 100644
>> --- a/src/compiler/glsl/builtin_functions.cpp
>> +++ b/src/compiler/glsl/builtin_functions.cpp
>> @@ -3560,44 +3560,54 @@ builtin_builder::_acos(const glsl_type *type)
>>  ir_function_signature *
>>  builtin_builder::_atan2(const glsl_type *type)
>>  {
>> -   ir_variable *vec_y = in_var(type, "vec_y");
>> -   ir_variable *vec_x = in_var(type, "vec_x");
>> -   MAKE_SIG(type, always_available, 2, vec_y, vec_x);
>> -
>> -   ir_variable *vec_result = body.make_temp(type, "vec_result");
>> -   ir_variable *r = body.make_temp(glsl_type::float_type, "r");
>> -   for (int i = 0; i < type->vector_elements; i++) {
>> -      ir_variable *y = body.make_temp(glsl_type::float_type, "y");
>> -      ir_variable *x = body.make_temp(glsl_type::float_type, "x");
>> -      body.emit(assign(y, swizzle(vec_y, i, 1)));
>> -      body.emit(assign(x, swizzle(vec_x, i, 1)));
>> -
>> -      /* If |x| >= 1.0e-8 * |y|: */
>> -      ir_if *outer_if =
>> -         new(mem_ctx) ir_if(greater(abs(x), mul(imm(1.0e-8f), abs(y))));
>> -
>> -      ir_factory outer_then(&outer_if->then_instructions, mem_ctx);
>> -
>> -      /* Then...call atan(y/x) */
>> -      do_atan(outer_then, glsl_type::float_type, r, div(y, x));
>> -
>> -      /*     ...and fix it up: */
>> -      ir_if *inner_if = new(mem_ctx) ir_if(less(x, imm(0.0f)));
>> -      inner_if->then_instructions.push_tail(
>> -         if_tree(gequal(y, imm(0.0f)),
>> -                 assign(r, add(r, imm(M_PIf))),
>> -                 assign(r, sub(r, imm(M_PIf)))));
>> -      outer_then.emit(inner_if);
>> -
>> -      /* Else... */
>> -      outer_if->else_instructions.push_tail(
>> -         assign(r, mul(sign(y), imm(M_PI_2f))));
>> +   const unsigned n = type->vector_elements;
>> +   ir_variable *y = in_var(type, "y");
>> +   ir_variable *x = in_var(type, "x");
>> +   MAKE_SIG(type, always_available, 2, y, x);
>>  
>> -      body.emit(outer_if);
>> +   /* If we're on the left half-plane rotate the coordinates π/2 clock-wise
>> +    * for the y=0 discontinuity to end up aligned with the vertical
>> +    * discontinuity of atan(s/t) along t=0.
>> +    */
>> +   ir_variable *flip = body.make_temp(glsl_type::bvec(n), "flip");
>> +   body.emit(assign(flip, less(x, imm(0.0f, n))));
>> +   ir_variable *s = body.make_temp(type, "s");
>> +   body.emit(assign(s, csel(flip, abs(x), y)));
>> +   ir_variable *t = body.make_temp(type, "t");
>> +   body.emit(assign(t, csel(flip, y, abs(x))));
>>  
>> -      body.emit(assign(vec_result, r, 1 << i));
>> -   }
>> -   body.emit(ret(vec_result));
>> +   /* If the magnitude of the denominator exceeds some huge value, scale down
>> +    * the arguments in order to prevent the reciprocal operation from flushing
>> +    * its result to zero, which would cause precision problems, and for s
>> +    * infinite would cause us to return a NaN instead of the correct finite
>> +    * value.
>> +    */
>> +   ir_constant *huge = imm(1e37f, n);
>> +   ir_variable *scale = body.make_temp(type, "scale");
>> +   body.emit(assign(scale, csel(gequal(abs(t), huge),
>> +                                imm(0.0625f, n), imm(1.0f, n))));
>
> Out of curiosity, how did you arrive at 1/16 for the scale?
>

It's pretty arbitrary, the correct scale depends on the exponent range
and bias (and of course the radix of your floating point
representation).  Unless you have some really exotic floating point
format I think any negative power of two lower or equal to 0.25 should
work -- If you like I can tighten up the constant to the largest
acceptable value for usual IEEE-like representations.

> I wonder if r300-era GPUs would want a different huge-val and scale.
> I'm pretty sure r300 still only used 24-bit floats.  I think all the
> old NVIDIA and Intel GPUs used 32-bit... I don't think there are any
> Mesa drivers for other weird GPUs of that era.

Yeah, I think ATI's FP24 representation is likely to be okay with the
same scale, but the loss of precision will probably start earlier so
they'll want a lower huge value of the order of 1e18.  If my estimates
are right using 1e18 instead shouldn't significantly affect precision on
FP32 platforms -- I've changed this locally.

>> +   ir_variable *rcp_scaled_t = body.make_temp(type, "rcp_scaled_t");
>> +   body.emit(assign(rcp_scaled_t, rcp(mul(t, scale))));
>> +   ir_expression *s_over_t = mul(mul(s, scale), rcp_scaled_t);
>
> If I'm reading this right,
>
>     s_over_t = (s * scale) * rcp(t * scale);
>
> I want to make sure we have a specific test case that will fail if
> someone tries to add an algebraic optimization that tries to remove the
> '* scale'.

Yeah, that's the kind of algebraic simplification I could see somebody
may want to do in the future, but it would qualify as imprecise math.

>
>> +
>> +   /* Calculate the arctangent and fix up the result if we had flipped the
>> +    * coordinate system.
>> +    */
>> +   ir_variable *arc = body.make_temp(type, "arc");
>> +   do_atan(body, type, arc, abs(s_over_t));
>> +   body.emit(assign(arc, add(arc, mul(b2f(flip), imm(M_PI_2f)))));
>
> I'll spare the details of the deep rat hole I went down with algebraic
> optimizations related to things like  b ? x : x + y vs x + float(b) * y.
>  In this particular case, I think x + csel(flip, M_PI_2, 0) is
> marginally better because the csel can be scheduled early and flip
> killed.  Eh... but a csel with two immediate values usually results in
> dumb code.  I'm already heading back down the rat hole... never mind.
>

float(b) * y seems to emit one less instruction for me...  Shouldn't
make much of a difference anyway.

>> +
>> +   /* Rather convoluted calculation of the sign of the result.  When x < 0 we
>> +    * cannot use fsign because we need to be able to distinguish between
>> +    * negative and positive zero.  Unfortunately we cannot use bitwise
>> +    * arithmetic tricks either because of back-ends without integer support.
>> +    * When x >= 0 rcp_scaled_t will always be non-negative so this won't be
>> +    * able to distinguish between negative and positive zero, but we don't
>> +    * care because atan2 is continuous along the whole positive y = 0
>> +    * half-line, so it won't affect the result.
>> +    */
>> +   body.emit(ret(csel(less(min2(y, rcp_scaled_t), imm(0.0f, n)),
>> +                      neg(arc), arc)));
>
> FWIW, b ? -x : x vs. -float(b) * x was part of the previous rat hole.
>

They're not equivalent ;), for b=false the former gives you x while the
latter gives you 0.

>>  
>>     return sig;
>>  }
>>
On 01/25/2017 12:55 PM, Francisco Jerez wrote:
> Ian Romanick <idr@freedesktop.org> writes:
> 
>> It's a real bummer that we have two implementations of this function
>> that are basically written in assembly... I'm not sure what else you'd
>> call generating IR by hand.  The code review and maintenance costs are
>> of the same magnitude for sure.
>>
>> We could move this to GLSL and let the standalone compiler generate the
>> builder code.  I don't think that is currently helpful.  However, for
>> future "soft" int64 and fp64 work the standalone compiler will need to
>> be extended to also generate NIR builder.  Once that is done, I think
>> the cost-benefit analysis changes.
> 
> Yes, I agree...  It was a real PITA to have to fix two different
> implementations of atan2, we should come up with some way to share the
> lowering of these built-ins even if they're still written in assembly --
> It would still halve the amount of pain we inflict on ourselves.
> 
>> On 01/24/2017 03:26 PM, Francisco Jerez wrote:
>>> This addresses several issues of the current atan2 implementation:
>>>
>>>  - Negative zero (and negative denorms which end up getting flushed to
>>>    zero) isn't handled correctly by the current implementation.  The
>>>    reason is that it does 'y >= 0' and 'x < 0' comparisons to decide
>>>    on which side of the branch cut the argument is, which causes us to
>>>    return incorrect results (off by up to 2π) for very small negative
>>>    values.
>>>
>>>  - There is a serious precision problem for x values of large enough
>>>    magnitude introduced by the floating point division operation being
>>>    implemented as a mul+rcp sequence.  This can lead to the quotient
>>>    getting flushed to zero in some cases introducing an error of over
>>>    8e6 ULP in the result -- Or in the most catastrophic case will
>>>    cause us to return NaN instead of the correct value ±π/2 for y=±∞
>>>    and x very large.  We can fix this easily by scaling down both
>>>    arguments when the absolute value of the denominator goes above
>>>    certain threshold.  The error of this atan2 implementation remains
>>>    below 25 ULP in most of its domain except for a neighborhood of y=0
>>>    where it reaches a maximum error of about 180 ULP.
>>>
>>>  - It emits a bunch of instructions including no less than three
>>>    if-else branches per scalar component that don't seem to get
>>>    optimized out later on.  This implementation uses about 13% less
>>>    instructions on Intel SKL hardware and doesn't emit any control
>>>    flow instructions.
>>> ---
>>>  src/compiler/glsl/builtin_functions.cpp | 82 ++++++++++++++++++---------------
>>>  1 file changed, 46 insertions(+), 36 deletions(-)
>>>
>>> diff --git a/src/compiler/glsl/builtin_functions.cpp b/src/compiler/glsl/builtin_functions.cpp
>>> index 4a6c5af..fd59381 100644
>>> --- a/src/compiler/glsl/builtin_functions.cpp
>>> +++ b/src/compiler/glsl/builtin_functions.cpp
>>> @@ -3560,44 +3560,54 @@ builtin_builder::_acos(const glsl_type *type)
>>>  ir_function_signature *
>>>  builtin_builder::_atan2(const glsl_type *type)
>>>  {
>>> -   ir_variable *vec_y = in_var(type, "vec_y");
>>> -   ir_variable *vec_x = in_var(type, "vec_x");
>>> -   MAKE_SIG(type, always_available, 2, vec_y, vec_x);
>>> -
>>> -   ir_variable *vec_result = body.make_temp(type, "vec_result");
>>> -   ir_variable *r = body.make_temp(glsl_type::float_type, "r");
>>> -   for (int i = 0; i < type->vector_elements; i++) {
>>> -      ir_variable *y = body.make_temp(glsl_type::float_type, "y");
>>> -      ir_variable *x = body.make_temp(glsl_type::float_type, "x");
>>> -      body.emit(assign(y, swizzle(vec_y, i, 1)));
>>> -      body.emit(assign(x, swizzle(vec_x, i, 1)));
>>> -
>>> -      /* If |x| >= 1.0e-8 * |y|: */
>>> -      ir_if *outer_if =
>>> -         new(mem_ctx) ir_if(greater(abs(x), mul(imm(1.0e-8f), abs(y))));
>>> -
>>> -      ir_factory outer_then(&outer_if->then_instructions, mem_ctx);
>>> -
>>> -      /* Then...call atan(y/x) */
>>> -      do_atan(outer_then, glsl_type::float_type, r, div(y, x));
>>> -
>>> -      /*     ...and fix it up: */
>>> -      ir_if *inner_if = new(mem_ctx) ir_if(less(x, imm(0.0f)));
>>> -      inner_if->then_instructions.push_tail(
>>> -         if_tree(gequal(y, imm(0.0f)),
>>> -                 assign(r, add(r, imm(M_PIf))),
>>> -                 assign(r, sub(r, imm(M_PIf)))));
>>> -      outer_then.emit(inner_if);
>>> -
>>> -      /* Else... */
>>> -      outer_if->else_instructions.push_tail(
>>> -         assign(r, mul(sign(y), imm(M_PI_2f))));
>>> +   const unsigned n = type->vector_elements;
>>> +   ir_variable *y = in_var(type, "y");
>>> +   ir_variable *x = in_var(type, "x");
>>> +   MAKE_SIG(type, always_available, 2, y, x);
>>>  
>>> -      body.emit(outer_if);
>>> +   /* If we're on the left half-plane rotate the coordinates π/2 clock-wise
>>> +    * for the y=0 discontinuity to end up aligned with the vertical
>>> +    * discontinuity of atan(s/t) along t=0.
>>> +    */
>>> +   ir_variable *flip = body.make_temp(glsl_type::bvec(n), "flip");
>>> +   body.emit(assign(flip, less(x, imm(0.0f, n))));
>>> +   ir_variable *s = body.make_temp(type, "s");
>>> +   body.emit(assign(s, csel(flip, abs(x), y)));
>>> +   ir_variable *t = body.make_temp(type, "t");
>>> +   body.emit(assign(t, csel(flip, y, abs(x))));
>>>  
>>> -      body.emit(assign(vec_result, r, 1 << i));
>>> -   }
>>> -   body.emit(ret(vec_result));
>>> +   /* If the magnitude of the denominator exceeds some huge value, scale down
>>> +    * the arguments in order to prevent the reciprocal operation from flushing
>>> +    * its result to zero, which would cause precision problems, and for s
>>> +    * infinite would cause us to return a NaN instead of the correct finite
>>> +    * value.
>>> +    */
>>> +   ir_constant *huge = imm(1e37f, n);
>>> +   ir_variable *scale = body.make_temp(type, "scale");
>>> +   body.emit(assign(scale, csel(gequal(abs(t), huge),
>>> +                                imm(0.0625f, n), imm(1.0f, n))));
>>
>> Out of curiosity, how did you arrive at 1/16 for the scale?
> 
> It's pretty arbitrary, the correct scale depends on the exponent range
> and bias (and of course the radix of your floating point
> representation).  Unless you have some really exotic floating point
> format I think any negative power of two lower or equal to 0.25 should
> work -- If you like I can tighten up the constant to the largest
> acceptable value for usual IEEE-like representations.

I don't have a strong opinion about picking a more optimal constant.
Perhaps some variation of this comment and the comment below would be
good to include in the code.  Otherwise the 0.0625 value seems
completely magic, and the next person will have no idea whether it's
safe to touch.

With some commentary added about the magic value, this patch is

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

>> I wonder if r300-era GPUs would want a different huge-val and scale.
>> I'm pretty sure r300 still only used 24-bit floats.  I think all the
>> old NVIDIA and Intel GPUs used 32-bit... I don't think there are any
>> Mesa drivers for other weird GPUs of that era.
> 
> Yeah, I think ATI's FP24 representation is likely to be okay with the
> same scale, but the loss of precision will probably start earlier so
> they'll want a lower huge value of the order of 1e18.  If my estimates
> are right using 1e18 instead shouldn't significantly affect precision on
> FP32 platforms -- I've changed this locally.
> 
>>> +   ir_variable *rcp_scaled_t = body.make_temp(type, "rcp_scaled_t");
>>> +   body.emit(assign(rcp_scaled_t, rcp(mul(t, scale))));
>>> +   ir_expression *s_over_t = mul(mul(s, scale), rcp_scaled_t);
>>
>> If I'm reading this right,
>>
>>     s_over_t = (s * scale) * rcp(t * scale);
>>
>> I want to make sure we have a specific test case that will fail if
>> someone tries to add an algebraic optimization that tries to remove the
>> '* scale'.
> 
> Yeah, that's the kind of algebraic simplification I could see somebody
> may want to do in the future, but it would qualify as imprecise math.
> 
>>> +
>>> +   /* Calculate the arctangent and fix up the result if we had flipped the
>>> +    * coordinate system.
>>> +    */
>>> +   ir_variable *arc = body.make_temp(type, "arc");
>>> +   do_atan(body, type, arc, abs(s_over_t));
>>> +   body.emit(assign(arc, add(arc, mul(b2f(flip), imm(M_PI_2f)))));
>>
>> I'll spare the details of the deep rat hole I went down with algebraic
>> optimizations related to things like  b ? x : x + y vs x + float(b) * y.
>>  In this particular case, I think x + csel(flip, M_PI_2, 0) is
>> marginally better because the csel can be scheduled early and flip
>> killed.  Eh... but a csel with two immediate values usually results in
>> dumb code.  I'm already heading back down the rat hole... never mind.
> 
> float(b) * y seems to emit one less instruction for me...  Shouldn't
> make much of a difference anyway.
> 
>>> +
>>> +   /* Rather convoluted calculation of the sign of the result.  When x < 0 we
>>> +    * cannot use fsign because we need to be able to distinguish between
>>> +    * negative and positive zero.  Unfortunately we cannot use bitwise
>>> +    * arithmetic tricks either because of back-ends without integer support.
>>> +    * When x >= 0 rcp_scaled_t will always be non-negative so this won't be
>>> +    * able to distinguish between negative and positive zero, but we don't
>>> +    * care because atan2 is continuous along the whole positive y = 0
>>> +    * half-line, so it won't affect the result.
>>> +    */
>>> +   body.emit(ret(csel(less(min2(y, rcp_scaled_t), imm(0.0f, n)),
>>> +                      neg(arc), arc)));
>>
>> FWIW, b ? -x : x vs. -float(b) * x was part of the previous rat hole.
>>
> 
> They're not equivalent ;), for b=false the former gives you x while the
> latter gives you 0.

Yes... there was a similar conditional neg thing that I encountered, but
now I don't recall.  Oh well.

>>>  
>>>     return sig;
>>>  }
>>>