for highest-performance GPUs transcendental functions are needed, these include (at the base level) trigonometric, power and hyperbolic functions: extended ones include multiplication by pi etc.
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002290.html Jacob Lifshay via lists.libre-riscv.org 7:17 PM (3 hours ago) to Libre-RISCV, Atif, Grant On Sat, Aug 3, 2019, 22:25 Luke Kenneth Casson Leighton <lkcl@lkcl.net> wrote: > got an idea, transcendentals (scalar) proposal, similar to Zfrsqrt, > need to find the space for sin, cos, atan, exp, pow, log, and so on. > on-list first, then isa-dev? > Sounds good to me. I think we should have our primitive instructions be correctly rounded, since, for all but sin/cos/tan/sec/cosec/cotan, that doesn't take much more precision. I think we should implement sinpi/cospi and friends since they avoid the need to have an extremely (several hundred bit) accurate version of pi. Note for Atif and Grant: I'm currently working on an algebraic numbers library that can be used to verify the fp implementations for add/sub/mul/div/sqrt/rsqrt/cbrt/hypot. https://salsa.debian.org/Kazan-team/algebraics Note that even though sinpi/cospi theoretically are algebraic numbers for rational inputs, the degree of the polynomials is prohibitive for large denominators. We should avoid the pitfall of intel's x87 sin/cos implementations: https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/ The functions I think are worth implementing in addition to F/D: trig-pi functions (range reduction is trivial (x mod 2.0)): * sinpi * cospi * sincospi (non-standard; like sincos) * atan2pi extended trig-pi functions (separate extension; sincospi/atan2pi is sufficient for graphics) * tanpi * asinpi * acospi non-*pi trig functions (in a separate extension since accurate range reduction is quite difficult, approximating using the *pi functions will work for graphics): * sin * cos * sincos * tan * atan2 * asin * acos powers: * cbrt * hypot (avoids overflow/underflow with extended exponent range for intermediates) * rsqrt (proposed in Zfrsqrt extension) general powers (as separate extension due to complexity; exp2/log2 plus checking for odd powers/roots is sufficient for graphics): * pow * root exp/log: * exp2 * log2 * expm1 (extra precision around 0) * logp1 (extra precision around 0) extended exp/log as separate extension (not needed for graphics since exp2/log2 is sufficient): * exp * log * exp10 * log10 hyperbolics as separate extension (not needed for graphics since exp2/log2 is sufficient): * acosh * asinh * atanh * cosh * sinh * tanh (may want to split out in separate extension since sometimes used for machine learning, however fmax(x,x*(1.0/256.0)) is a generally sufficient replacement transfer function) the erf/erfc/gamma/bessel/zeta/etc. functions can be left to software implementations. see also: https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps https://www.khronos.org/registry/spir-v/specs/unified1/GLSL.std.450.html https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/chap40.html#spirvenv-precision-operation
I think that having sinpi, cospi, and tanpi in the base extension is better, since they are much simpler to implement. sin, cos, and tan require a very precise value of pi for converting the argument into the range [0,2*pi) and maintaining accuracy, hence why I had proposed putting them in the non-*pi trig extension. For consistency, all the rest of the *pi functions are in the base extension and the non-*pi are not in the base extension. The above is how I had (apparently non-obviously) had the instructions grouped into extensions in http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002290.html assuming I calculated properly, the precision required for pi for sin is over 1000 bits for f64, since the max value of f64 is slightly less than 2^1024 and we want the remainder to still have more than 53 bits of accuracy for proper rounding. f32 requires somewhere around 150 bits of pi. This is the opposite of how they are assigned in the current Ztrans draft. Jacob
(In reply to Jacob Lifshay from comment #2) > I think that having sinpi, cospi, and tanpi in the base extension is better, > since they are much simpler to implement. ok - can you make the changes, i am slightly confused also - interesting: found this: http://www.myhdl.org/docs/examples/sinecomp/
http://www.andraka.com/files/crdcsrvy.pdf cool paper!
https://www.quinapalus.com/efunc.html Example C code Here is a sample C function to compute exp() using the above algorithm. The code assumes integers are at least 32 bits long. The (non-negative) argument and the result of the function are both expressed as fixed-point values with 16 fractional bits. Notice that after 11 steps of the algorithm the constants involved become such that the code is simply doing a multiplication: this is explained in the note below. The extension to negative arguments is left as an exercise. int fxexp(int x) { int t,y; y=0x00010000; t=x-0x58b91; if(t>=0) x=t,y<<=8; t=x-0x2c5c8; if(t>=0) x=t,y<<=4; t=x-0x162e4; if(t>=0) x=t,y<<=2; t=x-0x0b172; if(t>=0) x=t,y<<=1; t=x-0x067cd; if(t>=0) x=t,y+=y>>1; t=x-0x03920; if(t>=0) x=t,y+=y>>2; t=x-0x01e27; if(t>=0) x=t,y+=y>>3; t=x-0x00f85; if(t>=0) x=t,y+=y>>4; t=x-0x007e1; if(t>=0) x=t,y+=y>>5; t=x-0x003f8; if(t>=0) x=t,y+=y>>6; t=x-0x001fe; if(t>=0) x=t,y+=y>>7; if(x&0x100) y+=y>>8; if(x&0x080) y+=y>>9; if(x&0x040) y+=y>>10; if(x&0x020) y+=y>>11; if(x&0x010) y+=y>>12; if(x&0x008) y+=y>>13; if(x&0x004) y+=y>>14; if(x&0x002) y+=y>>15; if(x&0x001) y+=y>>16; return y; } Note that the constants involved in the trial subtractions reduce by a factor of less than or equal to 2 at each step. This means that it is never necessary to test the same constant twice. A note on the residual error The error in the final answer depends on the residual value in x; in fact, the relative error is exp(x). Since for small x, exp(x) is approximately 1+x, it is possible to correct the final answer by multiplying it by 1+x. Example C code Here is a sample C function to compute log() using the above algorithm. The code assumes integers are at least 32 bits long. The (positive) argument and the result of the function are both expressed as fixed-point values with 16 fractional bits, although intermediates are kept with 31 bits of precision to avoid loss of accuracy during shifts. After 12 steps of the algorithm the correction described above is applied. int fxlog(int x) { int t,y; y=0xa65af; if(x<0x00008000) x<<=16, y-=0xb1721; if(x<0x00800000) x<<= 8, y-=0x58b91; if(x<0x08000000) x<<= 4, y-=0x2c5c8; if(x<0x20000000) x<<= 2, y-=0x162e4; if(x<0x40000000) x<<= 1, y-=0x0b172; t=x+(x>>1); if((t&0x80000000)==0) x=t,y-=0x067cd; t=x+(x>>2); if((t&0x80000000)==0) x=t,y-=0x03920; t=x+(x>>3); if((t&0x80000000)==0) x=t,y-=0x01e27; t=x+(x>>4); if((t&0x80000000)==0) x=t,y-=0x00f85; t=x+(x>>5); if((t&0x80000000)==0) x=t,y-=0x007e1; t=x+(x>>6); if((t&0x80000000)==0) x=t,y-=0x003f8; t=x+(x>>7); if((t&0x80000000)==0) x=t,y-=0x001fe; x=0x80000000-x; y-=x>>15; return y; }
https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/8knne5BtlvM/QtdcVRSFFQAJ
proposal for splitting trig instructions out into 4 separate extensions: http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002329.html
(In reply to Jacob Lifshay from comment #7) > proposal for splitting trig instructions out into 4 separate extensions: > http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002329.html got it - please do review. i also added Zftrigh, it seemed to make sense. https://libre-riscv.org/ztrans_proposal/
(In reply to Luke Kenneth Casson Leighton from comment #8) > (In reply to Jacob Lifshay from comment #7) > > proposal for splitting trig instructions out into 4 separate extensions: > > http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002329.html > > got it - please do review. i also added Zftrigh, it seemed to make sense. > https://libre-riscv.org/ztrans_proposal/ I would name it Zfhyp or Zfhyper since they're usually called hyperbolic functions, not trigonometric hyperbolic functions. Other than that, looks good to me.
(In reply to Jacob Lifshay from comment #9) > I would name it Zfhyp or Zfhyper since they're usually called hyperbolic > functions, not trigonometric hyperbolic functions. > > Other than that, looks good to me. done. what about hypot? https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html "Result Type, x and y must be floating-point or vector(2,3,4,8,16) of floating-point values." do they mean x must be a vector and y must be a vector, and the result is: map(hypot, zip(x, y)) or do they mean... no it's definitely not sqrt(x^2+y^2+z^2+w^2) is it. hypot would be another separate funct5, have to be careful not to add too many of those.
(In reply to Luke Kenneth Casson Leighton from comment #10) > done. what about hypot? > https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL. > ExtendedInstructionSet.100.html > > "Result Type, x and y must be floating-point or vector(2,3,4,8,16) of > floating-point values." > > do they mean x must be a vector and y must be a vector, and the result is: > map(hypot, zip(x, y)) yup. > hypot would be another separate funct5, have to be careful not to add too > many of those.
https://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm Since: ln(a) = 2Tanh-1( (a-1) / (a+1) The function ln(α) is obtained by multiplying by 2 the final result ZN. (Equation (4)), provided that Z0=0, X0= a+1, and Y0= a-1.
so CORDIC looks like it is the highest bang-per-buck algorithm that will cover a large range of these. initial thoughts: it's very similar to what's in div_core.py however on second thoughts, CORDIC doesn't really lend itself to higher RADIX so would need a separate pipeline anyway. or... it does... https://www.researchgate.net/publication/256970111_VLSI_architecture_for_parallel_radix-4_CORDIC/link/59ef33d7458515ec0c7b51a8/download but *sigh* as usual it'll require "translation" of the paper... or contating the authors.
decision on whether to have separate ATAN/ATAN2 needed: http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html
(In reply to Luke Kenneth Casson Leighton from comment #14) > decision on whether to have separate ATAN/ATAN2 needed: > http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html see also http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002478.html
found a pair of algorithms that look like they'd be quite useful for implementing trig/log/exp/hypot/hyperbolics/sqrt/rsqrt/divide: http://perso.ens-lyon.fr/jean-michel.muller/BKM94.pdf they are shift-add algorithms, so should be rather simple to implement. they use a redundant number representation, so avoid needing full carry chains before the last stage, allowing a reduction in power and area used.
(In reply to Jacob Lifshay from comment #16) > found a pair of algorithms that look like they'd be quite useful for > implementing trig/log/exp/hypot/hyperbolics/sqrt/rsqrt/divide: > > http://perso.ens-lyon.fr/jean-michel.muller/BKM94.pdf > > they are shift-add algorithms, so should be rather simple to implement. > > they use a redundant number representation, so avoid needing full carry > chains before the last stage, allowing a reduction in power and area used. excellent. there's a wikipedia page for the algorithm: https://en.wikipedia.org/wiki/BKM_algorithm which gives a lot more references.
this bugreport resulted in https://libre-riscv.org/ztrans_proposal/ being written up, so can be declared closed. however additional research and dependencies should still be added.