Bug 127 - Transcendentals needed (SIN/COS/ATAN2/EXP/LOG/POW etc.)
Summary: Transcendentals needed (SIN/COS/ATAN2/EXP/LOG/POW etc.)
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on: 208
Blocks: 53 48 213
  Show dependency treegraph
 
Reported: 2019-08-04 22:45 BST by Luke Kenneth Casson Leighton
Modified: 2022-09-08 13:01 BST (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 1250
budget (EUR) for this task, excluding subtasks' budget: 1250
parent task for budget allocation: 53
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
programmerjake={amount=350,paid=2020-09-22} lkcl={amount=900,paid=2020-03-12}


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-08-04 22:45:56 BST
for highest-performance GPUs transcendental functions are needed,
these include (at the base level) trigonometric, power and hyperbolic
functions: extended ones include multiplication by pi etc.
Comment 1 Luke Kenneth Casson Leighton 2019-08-04 22:46:38 BST
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002290.html

Jacob Lifshay via lists.libre-riscv.org 
7:17 PM (3 hours ago)
to Libre-RISCV, Atif, Grant

On Sat, Aug 3, 2019, 22:25 Luke Kenneth Casson Leighton <lkcl@lkcl.net>
wrote:

> got an idea, transcendentals (scalar) proposal, similar to Zfrsqrt,
> need to find the space for sin, cos, atan, exp, pow, log, and so on.
> on-list first, then isa-dev?
>
Sounds good to me.

I think we should have our primitive instructions be correctly rounded,
since, for all but sin/cos/tan/sec/cosec/cotan, that doesn't take much more
precision. I think we should implement sinpi/cospi and friends since they
avoid the need to have an extremely (several hundred bit) accurate version
of pi.

Note for Atif and Grant: I'm currently working on an algebraic numbers
library that can be used to verify the fp implementations for
add/sub/mul/div/sqrt/rsqrt/cbrt/hypot.
https://salsa.debian.org/Kazan-team/algebraics

Note that even though sinpi/cospi theoretically are algebraic numbers for
rational inputs, the degree of the polynomials is prohibitive for large
denominators.

We should avoid the pitfall of intel's x87 sin/cos implementations:
https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/

The functions I think are worth implementing in addition to F/D:

trig-pi functions (range reduction is trivial (x mod 2.0)):
* sinpi
* cospi
* sincospi (non-standard; like sincos)
* atan2pi

extended trig-pi functions (separate extension; sincospi/atan2pi is
sufficient for graphics)
* tanpi
* asinpi
* acospi

non-*pi trig functions (in a separate extension since accurate range
reduction is quite difficult, approximating using the *pi functions will
work for graphics):
* sin
* cos
* sincos
* tan
* atan2
* asin
* acos

powers:
* cbrt
* hypot (avoids overflow/underflow with extended exponent range for
intermediates)
* rsqrt (proposed in Zfrsqrt extension)

general powers (as separate extension due to complexity; exp2/log2 plus
checking for odd powers/roots is sufficient for graphics):
* pow
* root

exp/log:
* exp2
* log2
* expm1 (extra precision around 0)
* logp1 (extra precision around 0)

extended exp/log as separate extension (not needed for graphics since
exp2/log2 is sufficient):
* exp
* log
* exp10
* log10

hyperbolics as separate extension (not needed for graphics since exp2/log2
is sufficient):
* acosh
* asinh
* atanh
* cosh
* sinh
* tanh (may want to split out in separate extension since sometimes used
for machine learning, however fmax(x,x*(1.0/256.0)) is a generally
sufficient replacement transfer function)

the erf/erfc/gamma/bessel/zeta/etc. functions can be left to software
implementations.

see also:
https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps
https://www.khronos.org/registry/spir-v/specs/unified1/GLSL.std.450.html
https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/chap40.html#spirvenv-precision-operation
Comment 2 Jacob Lifshay 2019-08-05 09:20:59 BST
I think that having sinpi, cospi, and tanpi in the base extension is better, since they are much simpler to implement. sin, cos, and tan require a very precise value of pi for converting the argument into the range [0,2*pi) and maintaining accuracy, hence why I had proposed putting them in the non-*pi trig extension. For consistency, all the rest of the *pi functions are in the base extension and the non-*pi are not in the base extension.

The above is how I had (apparently non-obviously) had the instructions grouped into extensions in http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002290.html

assuming I calculated properly, the precision required for pi for sin is over 1000 bits for f64, since the max value of f64 is slightly less than 2^1024 and we want the remainder to still have more than 53 bits of accuracy for proper rounding. f32 requires somewhere around 150 bits of pi.

This is the opposite of how they are assigned in the current Ztrans draft.

Jacob
Comment 3 Luke Kenneth Casson Leighton 2019-08-05 11:03:01 BST
(In reply to Jacob Lifshay from comment #2)
> I think that having sinpi, cospi, and tanpi in the base extension is better,
> since they are much simpler to implement. 

ok - can you make the changes, i am slightly confused

also - interesting: found this:
http://www.myhdl.org/docs/examples/sinecomp/
Comment 4 Luke Kenneth Casson Leighton 2019-08-05 12:18:42 BST
http://www.andraka.com/files/crdcsrvy.pdf

cool paper!
Comment 5 Luke Kenneth Casson Leighton 2019-08-05 12:48:02 BST
https://www.quinapalus.com/efunc.html

Example C code

Here is a sample C function to compute exp() using the above algorithm. The code assumes integers are at least 32 bits long. The (non-negative) argument and the result of the function are both expressed as fixed-point values with 16 fractional bits. Notice that after 11 steps of the algorithm the constants involved become such that the code is simply doing a multiplication: this is explained in the note below.

The extension to negative arguments is left as an exercise.

  int fxexp(int x) {
  int t,y;

  y=0x00010000;
  t=x-0x58b91; if(t>=0) x=t,y<<=8;
  t=x-0x2c5c8; if(t>=0) x=t,y<<=4;
  t=x-0x162e4; if(t>=0) x=t,y<<=2;
  t=x-0x0b172; if(t>=0) x=t,y<<=1;
  t=x-0x067cd; if(t>=0) x=t,y+=y>>1;
  t=x-0x03920; if(t>=0) x=t,y+=y>>2;
  t=x-0x01e27; if(t>=0) x=t,y+=y>>3;
  t=x-0x00f85; if(t>=0) x=t,y+=y>>4;
  t=x-0x007e1; if(t>=0) x=t,y+=y>>5;
  t=x-0x003f8; if(t>=0) x=t,y+=y>>6;
  t=x-0x001fe; if(t>=0) x=t,y+=y>>7;
  if(x&0x100)               y+=y>>8;
  if(x&0x080)               y+=y>>9;
  if(x&0x040)               y+=y>>10;
  if(x&0x020)               y+=y>>11;
  if(x&0x010)               y+=y>>12;
  if(x&0x008)               y+=y>>13;
  if(x&0x004)               y+=y>>14;
  if(x&0x002)               y+=y>>15;
  if(x&0x001)               y+=y>>16;
  return y;
  }

Note that the constants involved in the trial subtractions reduce by a factor of less than or equal to 2 at each step. This means that it is never necessary to test the same constant twice.
A note on the residual error

The error in the final answer depends on the residual value in x; in fact, the relative error is exp(x). Since for small x, exp(x) is approximately 1+x, it is possible to correct the final answer by multiplying it by 1+x. 

Example C code

Here is a sample C function to compute log() using the above algorithm. The code assumes integers are at least 32 bits long. The (positive) argument and the result of the function are both expressed as fixed-point values with 16 fractional bits, although intermediates are kept with 31 bits of precision to avoid loss of accuracy during shifts. After 12 steps of the algorithm the correction described above is applied.

  int fxlog(int x) {
  int t,y;

  y=0xa65af;
  if(x<0x00008000) x<<=16,              y-=0xb1721;
  if(x<0x00800000) x<<= 8,              y-=0x58b91;
  if(x<0x08000000) x<<= 4,              y-=0x2c5c8;
  if(x<0x20000000) x<<= 2,              y-=0x162e4;
  if(x<0x40000000) x<<= 1,              y-=0x0b172;
  t=x+(x>>1); if((t&0x80000000)==0) x=t,y-=0x067cd;
  t=x+(x>>2); if((t&0x80000000)==0) x=t,y-=0x03920;
  t=x+(x>>3); if((t&0x80000000)==0) x=t,y-=0x01e27;
  t=x+(x>>4); if((t&0x80000000)==0) x=t,y-=0x00f85;
  t=x+(x>>5); if((t&0x80000000)==0) x=t,y-=0x007e1;
  t=x+(x>>6); if((t&0x80000000)==0) x=t,y-=0x003f8;
  t=x+(x>>7); if((t&0x80000000)==0) x=t,y-=0x001fe;
  x=0x80000000-x;
  y-=x>>15;
  return y;
  }
Comment 7 Jacob Lifshay 2019-08-06 21:47:29 BST
proposal for splitting trig instructions out into 4 separate extensions: http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002329.html
Comment 8 Luke Kenneth Casson Leighton 2019-08-06 23:46:38 BST
(In reply to Jacob Lifshay from comment #7)
> proposal for splitting trig instructions out into 4 separate extensions:
> http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002329.html

got it - please do review.  i also added Zftrigh, it seemed to make sense.
https://libre-riscv.org/ztrans_proposal/
Comment 9 Jacob Lifshay 2019-08-06 23:56:05 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)
> (In reply to Jacob Lifshay from comment #7)
> > proposal for splitting trig instructions out into 4 separate extensions:
> > http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002329.html
> 
> got it - please do review.  i also added Zftrigh, it seemed to make sense.
> https://libre-riscv.org/ztrans_proposal/

I would name it Zfhyp or Zfhyper since they're usually called hyperbolic functions, not trigonometric hyperbolic functions.

Other than that, looks good to me.
Comment 10 Luke Kenneth Casson Leighton 2019-08-07 00:09:24 BST
(In reply to Jacob Lifshay from comment #9)

> I would name it Zfhyp or Zfhyper since they're usually called hyperbolic
> functions, not trigonometric hyperbolic functions.
> 
> Other than that, looks good to me.

done.  what about hypot?  https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

"Result Type, x and y must be floating-point or vector(2,3,4,8,16) of floating-point values."

do they mean x must be a vector and y must be a vector, and the result is:
map(hypot, zip(x, y))

or do they mean... no it's definitely not sqrt(x^2+y^2+z^2+w^2) is it.

hypot would be another separate funct5, have to be careful not to add too
many of those.
Comment 11 Jacob Lifshay 2019-08-07 00:11:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #10)
> done.  what about hypot? 
> https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.
> ExtendedInstructionSet.100.html
> 
> "Result Type, x and y must be floating-point or vector(2,3,4,8,16) of
> floating-point values."
> 
> do they mean x must be a vector and y must be a vector, and the result is:
> map(hypot, zip(x, y))
yup.

> hypot would be another separate funct5, have to be careful not to add too
> many of those.
Comment 12 Luke Kenneth Casson Leighton 2019-08-08 01:35:04 BST
https://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm

Since: ln(a) = 2Tanh-1( (a-1) / (a+1)
 
The function ln(α) is obtained by multiplying by 2 the final result
ZN. (Equation (4)), provided that Z0=0, X0= a+1, and Y0= a-1.
Comment 13 Luke Kenneth Casson Leighton 2019-08-10 09:07:52 BST
so CORDIC looks like it is the highest bang-per-buck algorithm that
will cover a large range of these.

initial thoughts: it's very similar to what's in div_core.py
however on second thoughts, CORDIC doesn't really lend itself
to higher RADIX so would need a separate pipeline anyway.
or... it does...
https://www.researchgate.net/publication/256970111_VLSI_architecture_for_parallel_radix-4_CORDIC/link/59ef33d7458515ec0c7b51a8/download

but *sigh* as usual it'll require "translation" of the paper...
or contating the authors.
Comment 14 Luke Kenneth Casson Leighton 2019-08-13 05:36:16 BST
decision on whether to have separate ATAN/ATAN2 needed:
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html
Comment 15 Jacob Lifshay 2019-08-13 05:40:03 BST
(In reply to Luke Kenneth Casson Leighton from comment #14)
> decision on whether to have separate ATAN/ATAN2 needed:
> http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html

see also http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002478.html
Comment 16 Jacob Lifshay 2019-08-19 21:10:14 BST
found a pair of algorithms that look like they'd be quite useful for implementing trig/log/exp/hypot/hyperbolics/sqrt/rsqrt/divide:

http://perso.ens-lyon.fr/jean-michel.muller/BKM94.pdf

they are shift-add algorithms, so should be rather simple to implement.

they use a redundant number representation, so avoid needing full carry chains before the last stage, allowing a reduction in power and area used.
Comment 17 Luke Kenneth Casson Leighton 2019-08-20 04:58:59 BST
(In reply to Jacob Lifshay from comment #16)
> found a pair of algorithms that look like they'd be quite useful for
> implementing trig/log/exp/hypot/hyperbolics/sqrt/rsqrt/divide:
> 
> http://perso.ens-lyon.fr/jean-michel.muller/BKM94.pdf
> 
> they are shift-add algorithms, so should be rather simple to implement.
> 
> they use a redundant number representation, so avoid needing full carry
> chains before the last stage, allowing a reduction in power and area used.

excellent.  there's a wikipedia page for the algorithm:
https://en.wikipedia.org/wiki/BKM_algorithm

which gives a lot more references.
Comment 18 Luke Kenneth Casson Leighton 2020-03-04 18:44:14 GMT
this bugreport resulted in https://libre-riscv.org/ztrans_proposal/ being written up, so can be declared closed. however additional research and dependencies should still be added.