Bug 44 - IEEE754 FPU inverse (reciprocal) sqrt
Summary: IEEE754 FPU inverse (reciprocal) sqrt
Status: PAYMENTPENDING FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Jacob Lifshay
URL:
Depends on: 110
Blocks: 48
  Show dependency treegraph
 
Reported: 2019-03-09 05:45 GMT by Luke Kenneth Casson Leighton
Modified: 2020-09-21 02:52 BST (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 1500
budget (EUR) for this task, excluding subtasks' budget: 1500
parent task for budget allocation: 48
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
# guessed date for lkcl lkcl={amount=600,paid=2019-08-10} programmerjake={amount=900,paid=2019-08-16}


Attachments
4-2 (and 3-2?) carry-save adder (53.31 KB, image/png)
2019-05-18 10:55 BST, Luke Kenneth Casson Leighton
Details
6 bit ripple add (7.77 KB, image/gif)
2019-05-18 20:39 BST, Luke Kenneth Casson Leighton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-03-09 05:45:04 GMT
for 32-bit, to use "wtf" algorithm:
https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code

16-bit and 64-bit, requires thought.
Comment 1 Jacob Lifshay 2019-04-29 08:55:02 BST
(In reply to Luke Kenneth Casson Leighton from comment #0)
> for 32-bit, to use "wtf" algorithm:
> https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code
> 
> 16-bit and 64-bit, requires thought.

The "wtf" algorithm is not well suited for our needs, since Vulkan requires a much more accurate inv-sqrt operation than that algorithm provides, it requires several floating-point mul-adds, and adding more iterations of newton's method to fix the accuracy problems will increase the latency well beyond what a radix-4 digit-recurrence algorithm needs due to needing several passes through a multiplier.

One paper I found that may be useful:
A Hardware Algorithm for Computing Reciprocal Square Root:
https://scholar.google.com/scholar?cluster=4957662328282312533&hl=en&as_sdt=0,48
http://www.acsel-lab.com/arithmetic/arith15/papers/ARITH15_Takagi.pdf
Comment 2 Jacob Lifshay 2019-04-29 09:15:19 BST
(In reply to Jacob Lifshay from comment #1)
> The "wtf" algorithm is not well suited for our needs, since Vulkan requires
> a much more accurate inv-sqrt operation than that algorithm provides
Vulkan requires an accuracy of 2 ULP (Units in the Last Place): https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/chap40.html#spirvenv-precision-operation search for inversesqrt
Comment 3 Luke Kenneth Casson Leighton 2019-04-29 10:46:01 BST
(In reply to Jacob Lifshay from comment #1)
> (In reply to Luke Kenneth Casson Leighton from comment #0)
> > for 32-bit, to use "wtf" algorithm:
> > https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code
> > 
> > 16-bit and 64-bit, requires thought.
> 
> The "wtf" algorithm is not well suited for our needs, since Vulkan requires
> a much more accurate inv-sqrt operation than that algorithm provides, it
> requires several floating-point mul-adds, and adding more iterations of
> newton's method to fix the accuracy problems will increase the latency well
> beyond what a radix-4 digit-recurrence algorithm needs due to needing
> several passes through a multiplier.

 drat.

> One paper I found that may be useful:
> A Hardware Algorithm for Computing Reciprocal Square Root:
> https://scholar.google.com/
> scholar?cluster=4957662328282312533&hl=en&as_sdt=0,48
> http://www.acsel-lab.com/arithmetic/arith15/papers/ARITH15_Takagi.pdf

 ok good find.  it's a little hard to interpret, i do like how it has
 radix-2 and radix-4 implementations, radix-2 multiply basically being
 "multiply by -1, 0 or 1", radix-4 multiply including only "-2 and +2"
 in that, which is a bit-shift.
Comment 4 Luke Kenneth Casson Leighton 2019-05-01 05:18:06 BST
different approach:

https://rei.iteso.mx/bitstream/handle/11117/5844/Design+and+Implementation+of+Reciprocal+Square+Root+Units+on+Digital+ASIC+Technology+For+Low+Power+Embedded+Applications.pdf

https://github.com/tomverbeure/math/blob/master/src/main/scala/math/FpxxRSqrt.scala

both these, as best i can tell, use a polynomial lookup table which requires
only 1 iteration of newton-raphson to give full accuracy.
Comment 5 Luke Kenneth Casson Leighton 2019-05-01 05:24:29 BST
for lolz... https://people.ece.cornell.edu/land/courses/ece5760/StudentWork/mje56/FPmath.v
Comment 6 Jacob Lifshay 2019-05-01 05:46:06 BST
(In reply to Luke Kenneth Casson Leighton from comment #4)
> different approach:
> 
> https://rei.iteso.mx/bitstream/handle/11117/5844/Design+and+Implementation+of+Reciprocal+Square+Root+Units+on+Digital+ASIC+Technology+For+Low+Power+Embedded+Applications.pdf
there was a server error when I tried to access this

> https://github.com/tomverbeure/math/blob/master/src/main/scala/math/FpxxRSqrt.scala

as far as I can tell, this is only a lookup table.

> both these, as best i can tell, use a polynomial lookup table which requires
> only 1 iteration of newton-raphson to give full accuracy.

I think it would be wise to avoid newton-raphson since it needs several multiplications and will occupy the multiplier for quite a few cycles.
Comment 7 Luke Kenneth Casson Leighton 2019-05-01 06:40:57 BST
(In reply to Jacob Lifshay from comment #6)
> (In reply to Luke Kenneth Casson Leighton from comment #4)
> > different approach:
> > 
> > https://rei.iteso.mx/bitstream/handle/11117/5844/Design+and+Implementation+of+Reciprocal+Square+Root+Units+on+Digital+ASIC+Technology+For+Low+Power+Embedded+Applications.pdf
> there was a server error when I tried to access this
> 
> > https://github.com/tomverbeure/math/blob/master/src/main/scala/math/FpxxRSqrt.scala
> 
> as far as I can tell, this is only a lookup table.
> 
> > both these, as best i can tell, use a polynomial lookup table which requires
> > only 1 iteration of newton-raphson to give full accuracy.
> 
> I think it would be wise to avoid newton-raphson since it needs several
> multiplications and will occupy the multiplier for quite a few cycles.

The fixed rsqrt algorithm i saw was able to do the mult in a single cycle, however it may have been a smaller mantissa.

The algorithm you found, I really like it, only the description is really obtuse. Assuming we can do a 4 radix version, 2 of those can be combinatorially chained to give 8bit per cycle, and for FP32 that gives only around a 5 stage pipeline with basic single-shift, add and MUX blocks.

What is the sort of performance actually needed? Is this critical for certain circumstances?
Comment 8 Jacob Lifshay 2019-05-02 09:34:02 BST
(In reply to Luke Kenneth Casson Leighton from comment #7)
> What is the sort of performance actually needed? Is this critical for
> certain circumstances?
Some common lighting algorithms use per-pixel interpolated surface normal vectors, those need to be normalized before they can be used in lighting calculations, requiring a 3x3 dot product, an inverse-sqrt, and a 1x3 scalar vector multiplication.

So, we should be able to support at least 1 inverse-sqrt per pixel or so (with possibly somewhat reduced framerate). Note that this is in addition to the depth division that is required per-pixel for basic 3D, so budget accordingly.
Comment 9 Luke Kenneth Casson Leighton 2019-05-09 15:36:42 BST
the Tanako paper is extremely obtuse, and assumes access to IEEE papers
is guaranteed.  i found an implementation of "On the Fly Conversion":
https://web.archive.org/web/20170808225020/http://etd.dtu.dk/thesis/211563/ep08_21.pdf

this paper does not require access to paywalls.  page 36.
Comment 10 Jacob Lifshay 2019-05-09 18:52:21 BST
(In reply to Luke Kenneth Casson Leighton from comment #9)
> the Tanako paper is extremely obtuse, and assumes access to IEEE papers
> is guaranteed.  i found an implementation of "On the Fly Conversion":
> https://web.archive.org/web/20170808225020/http://etd.dtu.dk/thesis/211563/
> ep08_21.pdf
> 
> this paper does not require access to paywalls.  page 36.

As far as I can tell, this paper doesn't actually have a rsqrt or sqrt algorithm.
Comment 11 Luke Kenneth Casson Leighton 2019-05-10 03:30:00 BST
(In reply to Jacob Lifshay from comment #10)
> (In reply to Luke Kenneth Casson Leighton from comment #9)
> > the Tanako paper is extremely obtuse, and assumes access to IEEE papers
> > is guaranteed.  i found an implementation of "On the Fly Conversion":
> > https://web.archive.org/web/20170808225020/http://etd.dtu.dk/thesis/211563/
> > ep08_21.pdf
> > 
> > this paper does not require access to paywalls.  page 36.
> 
> As far as I can tell, this paper doesn't actually have a rsqrt or sqrt
> algorithm.

it doesn't... what it *does* have is implementations of the primitives
which are *assumed* that the reader has access to or knows about (from
paywalled literature).

OTFC - on-the-fly-conversion - is available according to the Tanako paper -
in a paywalled journal.

searching "OTFC" is near-impossible to find, it is very obscure... the
only paper i could find that actually describes it has been *removed* from
the original location (etd.dtu.dk - the entire server is down).

likewise, it contains implementations of Carry-Save Adders (p122) which *again*
the Tanako paper *ASSUMES* that the reader is either familiar with or has
access to (paywalled) journals containing that information.

without these algorithms - OTFC and CSA - the Tanako paper is effectively
useless, unfortunately.

frustratingly, they're incredibly simple blocks.  the notes in the Tanako
paper state that OTFC may be implemented with a couple of MUXes and a shifter.
Comment 12 Luke Kenneth Casson Leighton 2019-05-10 03:36:12 BST
http://verilogcodes.blogspot.com/2017/11/verilog-code-for-carry-save-adder-with-testbench.html

module CSA
        (   input [3:0] x,y,z,
            output [4:0] s,
            output cout
            );
            
wire [3:0] c1,s1,c2;

fulladder fa_inst10(x[0],y[0],z[0],s1[0],c1[0]);
fulladder fa_inst11(x[1],y[1],z[1],s1[1],c1[1]);
fulladder fa_inst12(x[2],y[2],z[2],s1[2],c1[2]);
fulladder fa_inst13(x[3],y[3],z[3],s1[3],c1[3]); 

fulladder fa_inst20(s1[1],c1[0],1'b0,s[1],c2[1]);
fulladder fa_inst21(s1[2],c1[1],c2[1],s[2],c2[2]);
fulladder fa_inst22(s1[3],c1[2],c2[2],s[3],c2[3]);
fulladder fa_inst23(1'b0,c1[3],c2[3],s[4],cout); 

assign s[0] = s1[0];


module tb_adder;

    reg [3:0] x,y,z;
    wire [4:0] s;
    wire cout;  
    integer i,j,k,error;

    // Instantiate the Unit Under Test (UUT)
    CSA uut (x,y,z,s,cout);

//Stimulus block - all the input combinations are tested here.
//the number of errors are recorded in the signal named "error".
    initial begin
        // Initialize Inputs
        x = 0;
        y = 0;
        z = 0;
        error = 0;
        //three for loops to test all input combinations.
      for(i=0;i<16;i=i+1) begin
            for(j=0;j<16;j=j+1) begin
                for(k=0;k<16;k=k+1) begin
                     x = i;
                     y = j;
                     z = k;
                     #10;
                     if({cout,s} != (i+j+k)) 
                          error <= error + 1;
                end       
            end  
      end
    end 
    
endmodule
Comment 13 Luke Kenneth Casson Leighton 2019-05-10 03:37:33 BST
https://codestall.org/2017/06/26/carry-save-adder/

module csad(
output [7:0]s,
output co,
input [7:0]a,b,
input ci);
wire c1,c2,c3;
wire [3:0]s1,s2;
rcad u1(s[3:0], c1, a[3:0], b[3:0], ci);
rcad u2(s1, c2, a[7:4], b[7:4], 1’b0);
rcad u3(s2, c3, a[7:4], b[7:4], 1’b1);
mux4b_2_1 u4(s[7:4], s1, s2, c1);
mux2_1 u5(co, c2, c3, c1);
endmodule

module mux4b_2_1(
output [3:0]y,
input [3:0]a,b,
input c);
assign y = c ? b : a;
endmodule

module mux2_1(
output y,
input a, b, c);
assign y = c ? b : a;
endmodule
Comment 14 Luke Kenneth Casson Leighton 2019-05-11 10:22:14 BST
looking at the diagram from ep08-21.pdf on page 36 (figure 3.15)

* q[j+1] in radix 2 is a 2-bit signed binary number, -1 0 or +1
  (top of p35), in binary -1 = 0b10 [edit: 0b11??], 0 = 0b00, 1 = 0b01

* |q[j+1]| probably means "absolute value" of q[j+1], in other words
  when q[j+1] == 0b00, return 0b00, and when q[j+1] == 0b10 (-1) [edit: 0b11??),
  return 0b01 (+1).

* r is the "radix", confirmed (inferred) from the top paragraph of page 33

* the key equations to produce Q(next) and QM(next) are given in
  (3.21) and (3.22):

  if q[j+1] >= 0:
      Q(next) = Q << 1 | q[j+1]
  else:
      Q(next) = QM << 1 | (2-abs(q[j+1]))
  if q[j+1] > 0:
      QM(next) = QM << 1 | (q[j+1]-1)
  else:
      QM(next) = Q << 1 | (1-abs(q[j+1]))

those sums, "q[j+1]-1" "2-abs(q[j+1])" and "1-abs(q[j+1])" are in 2-bit
signed binary, so should be trivial to produce logic for.
Comment 15 Jacob Lifshay 2019-05-11 10:28:34 BST
(In reply to Luke Kenneth Casson Leighton from comment #14)
> looking at the diagram from ep08-21.pdf on page 36 (figure 3.15)
> 
> * q[j+1] in radix 2 is a 2-bit signed binary number, -1 0 or +1
>   (top of p35), in binary -1 = 0b10, 0 = 0b00, 1 = 0b01
shouldn't that be -1 = 0b11 assuming it's in 2s complement?
> 
> * |q[j+1]| probably means "absolute value" of q[j+1], in other words
>   when q[j+1] == 0b00, return 0b00, and when q[j+1] == 0b10 (-1),
>   return 0b10 (+1).
I think you meant 0b11 (-1), return 0b01 (+1). 
Even if they're not 2s complement, 1 and -1 shouldn't be identical.

I could be wrong, since I didn't actually check the paper.
Comment 16 Luke Kenneth Casson Leighton 2019-05-11 10:40:08 BST
(In reply to Jacob Lifshay from comment #15)
> (In reply to Luke Kenneth Casson Leighton from comment #14)
> > looking at the diagram from ep08-21.pdf on page 36 (figure 3.15)
> > 
> > * q[j+1] in radix 2 is a 2-bit signed binary number, -1 0 or +1
> >   (top of p35), in binary -1 = 0b10, 0 = 0b00, 1 = 0b01
> shouldn't that be -1 = 0b11 assuming it's in 2s complement?

err... the paper does say -1, 0 or +1 (as does Tanako), this will need
to be checked.

> > 
> > * |q[j+1]| probably means "absolute value" of q[j+1], in other words
> >   when q[j+1] == 0b00, return 0b00, and when q[j+1] == 0b10 (-1),
> >   return 0b10 (+1).
> I think you meant 0b11 (-1), return 0b01 (+1). 
> Even if they're not 2s complement, 1 and -1 shouldn't be identical.

yep got it, well spotted.
Comment 17 Jacob Lifshay 2019-05-11 10:44:15 BST
(In reply to Luke Kenneth Casson Leighton from comment #16)
> (In reply to Jacob Lifshay from comment #15)
> > (In reply to Luke Kenneth Casson Leighton from comment #14)
> > > looking at the diagram from ep08-21.pdf on page 36 (figure 3.15)
> > > 
> > > * q[j+1] in radix 2 is a 2-bit signed binary number, -1 0 or +1
> > >   (top of p35), in binary -1 = 0b10, 0 = 0b00, 1 = 0b01
> > shouldn't that be -1 = 0b11 assuming it's in 2s complement?
> 
> err... the paper does say -1, 0 or +1 (as does Tanako), this will need
> to be checked.
yeah, it is -1, 0, or 1. I had meant that -1 in 2s complement is 0b11, not 0b10.
Comment 18 Luke Kenneth Casson Leighton 2019-05-11 10:53:12 BST
(In reply to Jacob Lifshay from comment #17)

> > err... the paper does say -1, 0 or +1 (as does Tanako), this will need
> > to be checked.
> yeah, it is -1, 0, or 1. I had meant that -1 in 2s complement is 0b11, not
> 0b10.

yes, very strange. will need to see what happens on implementation.
Comment 19 Luke Kenneth Casson Leighton 2019-05-11 16:16:30 BST
> Could it be one's complemet?  Does anyone still use that for anything?
> -- hendrik

quite possibly, because this is carry-save representation.  honestly
though i am not so certain, i can't find anything that says "CSA is
done using 1's complement".
Comment 20 Luke Kenneth Casson Leighton 2019-05-18 10:55:14 BST
Created attachment 14 [details]
4-2 (and 3-2?) carry-save adder

CSAW2 and CSAW1 seem to be very simple straightforward 4-in 2-out and 3-in 2-out
carry-save adders.  attached diagram shows a 3-2 adder that can be chained
to create a 4-2.
Comment 21 Luke Kenneth Casson Leighton 2019-05-18 11:00:54 BST
i found an expired patent
https://patents.google.com/patent/US7620677B2/en

which at the very least contains links to prior art that
describe a 4-2 CSA and a 3-2 CSA adder.

if that's correct, it just leaves the "DS" function - "Digit Select"
which is described as "a 6-bit carry-propagate adder and simple
constant value comparators".

so the input would appear to be in a redundant (carry-save) form,
that has to be added up in order to make the comparison to see
if it's within 2 specified ranges.
Comment 22 Luke Kenneth Casson Leighton 2019-05-18 12:00:54 BST
CSA 3-2:
-------

SUM = A XOR B XOR C
CRY = NAND(NAND (A,B), NAND(B,C), NAND(A,C))

CSA 4-2:
-------

S0, Cout = CSA32(A,B,C)
SUM, CRY = CSA32(S0, D, Cin)

that err seems to be it, where normally there would be 5 inputs
and 3 outputs (Cin ripples through to Cout).

so err it's not CSA 4-2 at all, it's CSA 5-3 (oink?)

CSA 4-2 would presumably be much simpler?
Comment 23 Jacob Lifshay 2019-05-18 20:09:39 BST
(In reply to Luke Kenneth Casson Leighton from comment #22)
> CSA 3-2:
> -------
> 
> SUM = A XOR B XOR C
> CRY = NAND(NAND (A,B), NAND(B,C), NAND(A,C))
> 
> CSA 4-2:
> -------
> 
> S0, Cout = CSA32(A,B,C)
> SUM, CRY = CSA32(S0, D, Cin)
> 
> that err seems to be it, where normally there would be 5 inputs
> and 3 outputs (Cin ripples through to Cout).
> 
> so err it's not CSA 4-2 at all, it's CSA 5-3 (oink?)
> 
> CSA 4-2 would presumably be much simpler?

it's csa 4-2. notice that there isn't a path from cin to cout.

I personally prefer csa 3-2 since it takes half as much logic but adds more than half as many inputs for the same number of outputs, so it's more efficient.
Comment 24 Luke Kenneth Casson Leighton 2019-05-18 20:35:34 BST
(In reply to Jacob Lifshay from comment #23)
> (In reply to Luke Kenneth Casson Leighton from comment #22)
> > CSA 3-2:
> > -------
> > 
> > SUM = A XOR B XOR C
> > CRY = NAND(NAND (A,B), NAND(B,C), NAND(A,C))
> > 
> > CSA 4-2:
> > -------
> > 
> > S0, Cout = CSA32(A,B,C)
> > SUM, CRY = CSA32(S0, D, Cin)
> > 
> > that err seems to be it, where normally there would be 5 inputs
> > and 3 outputs (Cin ripples through to Cout).
> > 
> > so err it's not CSA 4-2 at all, it's CSA 5-3 (oink?)
> > 
> > CSA 4-2 would presumably be much simpler?
> 
> it's csa 4-2. notice that there isn't a path from cin to cout.

Am assuming it is for use in chaining.

> I personally prefer csa 3-2 since it takes half as much logic but adds more
> than half as many inputs for the same number of outputs, so it's more
> efficient.

The Tanako paper specifies an algorithm
that requires both.

A ripple carry adder is also mentioned.
That will be the next one to look up.
Comment 25 Luke Kenneth Casson Leighton 2019-05-18 20:39:43 BST
Created attachment 15 [details]
6 bit ripple add

6 bit ripple adder
Comment 26 Jacob Lifshay 2019-05-18 20:47:51 BST
(In reply to Luke Kenneth Casson Leighton from comment #25)
> Created attachment 15 [details]
> 6 bit ripple add
> 
> 6 bit ripple adder

the attachment is just a 3-bit + 3-bit -> 4-bit adder.
a = Signal(3)
b = Signal(3)
y = Cat(a, C(0, 1)) + Cat(b, C(0, 1))
Comment 27 Luke Kenneth Casson Leighton 2019-05-18 20:48:41 BST
Ok found a 6 bit ripple carry adder,
It produces a 4 bit answer, however only
the top 2 bits and sign matter.
According to the Tanako paper DS needs
a number that is signed and is between3
thresholds
-0.25
0
0.5

something like that.

This would be in 2 bits plus the sign.

This gives the 1 bit qj+1 which must be either -1 0 or +1

The algorithm can be checked against sqrt 
because apparently the OTFC stage produces
both Isqrt and Sqrt.
Comment 28 Luke Kenneth Casson Leighton 2019-05-18 20:57:53 BST
(In reply to Jacob Lifshay from comment #26)
> (In reply to Luke Kenneth Casson Leighton from comment #25)
> > Created attachment 15 [details]
> > 6 bit ripple add
> > 
> > 6 bit ripple adder
> 
> the attachment is just a 3-bit + 3-bit -> 4-bit adder.
> a = Signal(3)
> b = Signal(3)
> y = Cat(a, C(0, 1)) + Cat(b, C(0, 1))

Ah ok.

https://www.researchgate.net/publication/234151562_An_FPGA_Based_Generic_Framework_for_High_Speed_Sum_of_Absolute_Difference_Implementation/download

Apparently Tanako may mean just 2 Full Adders chained together.

6 bits in, 3 bits out. Might need 2 Full Adders and 1 Half.

Arg! :) Just the worst when academics assume entire areas of knowledge by readers.
Comment 29 Jacob Lifshay 2019-06-24 14:49:25 BST
writing this down here so I don't forget:
rsqrt can be calculated by a binary search for 1 by adding powers of 2 (such as 0.5, 0.25, 0.125) to x in the equation 1 = a * x^2 where a is the input to rsqrt, x starts at 1.0

to evaluate quickly, the next value of a * x^2 can be calculated by addition from the previous a * x^2 by replacing x with x + 2^b then using the identity: a * (x + 2^b)^2 = a * x^2 + a * x * 2^(b+1) + a * 2^(2*b) where b is usually a negative integer that is constant for each pipeline stage.

one benefit of this method is that all the arithmetic can be done exactly and as a result you know for sure how to round the result.
Comment 30 Luke Kenneth Casson Leighton 2019-06-24 14:56:18 BST
(In reply to Jacob Lifshay from comment #29)
> writing this down here so I don't forget:
> rsqrt can be calculated by a binary search for 1 by adding powers of 2 (such
> as 0.5, 0.25, 0.125) to x in the equation 1 = a * x^2 where a is the input
> to rsqrt, x starts at 1.0

 ooo it's a variant on that sqrt algorithm, the one that moves values
 from a to b in the factorisation of a^2 + 2ab + b^2 [a(a + 2ab) + b^2].

 nice.
Comment 31 Jacob Lifshay 2019-06-28 03:23:46 BST
If no one else is working on it, I'll start working on implementing the div/mod/fdiv/fsqrt/frsqrt pipeline.
Comment 32 Luke Kenneth Casson Leighton 2019-06-28 05:06:07 BST
(In reply to Jacob Lifshay from comment #31)
> If no one else is working on it, I'll start working on implementing the
> div/mod/fdiv/fsqrt/frsqrt pipeline.

keep it short, keep in touch continously (every few days), and use .pyi files,
do *not* put type information into the .py files, and for FP, follow the
format of the pipeline code rather than create an entirely new pipeline design
(use FPMUL and FPADD as cookie-cut templates).  otherwise we waste time
and money reworking two sets of completely disparate code.

also keep the integer and FP pipelines *separate* for now: the FPU code
is seriously complicated already, and in actual usage, int would hold up FP
and FP would hold up int.

i'm looking to put in another funding application to cover a "comprehensive"
version of the isqrt/fsqrt/fdiv code, including formal verification, for
which programmer-mathematicians could be attracted to help out.

that means keeping the timescales on FPDIV/ISQRT/SQRT *real* short.
Comment 33 Jacob Lifshay 2019-06-28 05:56:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #32)
> (In reply to Jacob Lifshay from comment #31)
> > If no one else is working on it, I'll start working on implementing the
> > div/mod/fdiv/fsqrt/frsqrt pipeline.
> 
> keep it short, keep in touch continously (every few days), and use .pyi
> files,
> do *not* put type information into the .py files, and for FP, follow the
> format of the pipeline code rather than create an entirely new pipeline
> design
> (use FPMUL and FPADD as cookie-cut templates).  otherwise we waste time
> and money reworking two sets of completely disparate code.
I was planning on creating an arithmetic pipeline similar to the multiplier in that operations just travel in a simple manner from start to finish and we can add on all the pipeline control circuitry externally (since we still haven't fully resolved how pipeline control will work).

> also keep the integer and FP pipelines *separate* for now: the FPU code
> is seriously complicated already, and in actual usage, int would hold up FP
> and FP would hold up int.
Actually, the fpdiv uses an integer div+mod internally, so having separate fp and integer pipelines here actually makes it more complicated. The only int operation supported is div+mod.

> i'm looking to put in another funding application to cover a "comprehensive"
> version of the isqrt/fsqrt/fdiv code, including formal verification, for
> which programmer-mathematicians could be attracted to help out.
I'm planning on implementing the algorithms such that they create an exact result at all times, the binary result and a remainder (or the equivalent for sqrt/rsqrt). those can then be converted to the mantissa and guard/round/sticky bits and relatively simply proven to be correct for all inputs.
> 
> that means keeping the timescales on FPDIV/ISQRT/SQRT *real* short.
Since the algorithms are relatively simple, I'm expecting the fixed-point core to be done in 1-2 days and then the FP unpacking/normalization/special-case-handling can be added on the front and back.
Comment 34 Luke Kenneth Casson Leighton 2019-06-28 06:09:28 BST
(In reply to Jacob Lifshay from comment #33)

> I was planning on creating an arithmetic pipeline similar to the multiplier
> in that operations just travel in a simple manner from start to finish and
> we can add on all the pipeline control circuitry externally (since we still
> haven't fully resolved how pipeline control will work).

 it turns out that scoreboard-style designs don't need stallable pipelines.
 i'd like to keep using the same FP code style (same class structure, same
 base classes) so that any code-morphing can be done to *all* FP* units.

> > also keep the integer and FP pipelines *separate* for now: the FPU code
> > is seriously complicated already, and in actual usage, int would hold up FP
> > and FP would hold up int.
> Actually, the fpdiv uses an integer div+mod internally, so having separate
> fp and integer pipelines here actually makes it more complicated. The only
> int operation supported is div+mod.

 i mean: creating a pipeline-stage bypass system where either
 INT or FP data can be put into the same unit, and the INT version
 skips the de/-normalisation, that's doable.

 creating an infrastructure that *shares* the integer pipeline stages:
 that's way more complex.

 however *for now*, please don't do either of those things: the FP code
 is complex enough as it is.  see below.

> Since the algorithms are relatively simple, I'm expecting the fixed-point
> core to be done in 1-2 days and then the FP
> unpacking/normalization/special-case-handling can be added on the front and
> back.

that's a good strategy.  you saw the ep08_15.pdf thesis i found, which
contains OTFC radix-2/4/8 INT-DIV?

can you also, for now, create _separate_ fixed-point pipeline classes,
(each stage separated out into its own class), put them together to create
a fixed-point div/mod, with unit tests, then move on to use the *same*
classes within the FP code.

you'll find that there already exist unpacking, normalisation and de-normalisation
classes, and pipelines that can be used.  the only major missing class that
needs writing is the "FP DIV special cases code" - i can do that, based on
Jon Dawson's FPU state-machine.

also, please don't put the classes in the simple-barrel-processor repository, put
them in the FPU repository where they belong.  or, we create a separate
"INT ALU" repository or they go directly in the soc repository.
Comment 35 Luke Kenneth Casson Leighton 2019-06-28 06:27:32 BST
https://git.libre-riscv.org/?p=ieee754fpu.git;a=commitdiff;h=ca5cce9412df04c5887387c9efe7b5f87082fce0

done.  copied FPMUL's specialcases.py, sat it side-by-side with
ieee754/fpdiv/nmigen_div_experiment.py.

untested: it doesn't have syntax errors.

btw there's no version of ieee754/fpmul/test/test_fpmul_pipe.py
yet, because there's only the FSM-based (jon dawson) variant of
FPDIV, not a corresponding pipelined version.

test_fpmul_pipe.py is the template to use to start creating an
FPDIV... actually, it's such an easy cookie-cut i might as
well do it: that will be ieee754/fpdiv/pipeline.py

then instead of from .mulstages import FPMulStages
that would be:

from .divstages import FPDivStages

and that's literally the only class that's needed to get an
operational FPDIV multi-in, multi-out Reservations-Style pipeline
up and running.
Comment 36 Luke Kenneth Casson Leighton 2019-06-28 07:14:55 BST
ok that's basically it: the infrastructure's in place.  it should
be clear that:

* FPDivStage0Mod is the first class that converts incoming
  (normalised) data into Q+R (whatever: the stuff that
  eventually becomes the div and rem)
* FPDivStage1Mod is part of the huuuuge chain that will
  take Q+R (whatever) and output Q+R (whatever)
* FPDivStage2Mod is the *last* one that takes Q+R (whatever)
  as input and drops it into the mantissa.

given that div *may* need the mantissa normalising to within a
range 0.5 to 1.0 (or somesuch), we *may* need an *extra*
pre-normalising stage.

it should be pretty clear from divstages.py how that can be
added.


index 94c8a78..e1f8b11 100644 (file)
--- a/src/ieee754/fpdiv/divstages.py
+++ b/src/ieee754/fpdiv/divstages.py
@@ -16,6 +16,7 @@ from ieee754.fpcommon.postcalc import FPAddStage1Data
 # TODO: write these
 from .div0 import FPDivStage0Mod
 from .div1 import FPDivStage1Mod
+from .div2 import FPDivStage2Mod
 
 
 class FPDivStages(FPState, SimpleHandshake):
@@ -40,10 +41,12 @@ class FPDivStages(FPState, SimpleHandshake):
         # TODO.  clearly, this would be a for-loop, here, creating
         # a huge number of stages (if radix-2 is used).  interestingly
         # the number of stages will be data-dependent.
-        m0mod = FPDivStage0Mod(self.width, self.id_wid)
-        m1mod = FPDivStage1Mod(self.width, self.id_wid)
+        divstages = [FPDivStage0Mod(self.width, self.id_wid)]
+        for i in range(self.width): # XXX TODO: work out actual number needed
+            divstages.append(FPDivStage1Mod(self.width, self.id_wid))
+        divstages.append(FPDivStage2Mod(self.width, self.id_wid))
 
-        chain = StageChain([m0mod, m1mod])
+        chain = StageChain(divstages)
         chain.setup(m, i)
 
         self.o = m1mod.o
Comment 37 Jacob Lifshay 2019-07-11 11:51:16 BST
see isa-dev proposal: https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/SLprSJzpgCQ
Comment 38 Luke Kenneth Casson Leighton 2019-07-23 17:58:33 BST
(In reply to Jacob Lifshay from comment #37)
> see isa-dev proposal:
> https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/
> SLprSJzpgCQ

recorded this under bug #110
Comment 39 Luke Kenneth Casson Leighton 2019-07-25 08:50:13 BST
context:
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-July/002161.html

On Thu, Jul 25, 2019 at 7:52 AM Jacob Lifshay <programmerjake@gmail.com> wrote:

> I'm reverting the change to DivPipeCore while I fix the problems.

 that only stops fpsqrt working.

> I'm
> going to also try to convert the division tests to use unittest to
> make it compatible with pytest-xdist.

ah cool - they're runnable as "python3 xxx/xxx/test/test_xxx.py" still.

> I also added a unittest.skip
> annotation to the divpipecore tests wider than 8 bits, since 8 bits is
> enough to catch almost all the bugs and it completes in 60s instead of
> a really long time.

great.  i vaguely recall that nosetests3 has a way to run "longer" tests.
Comment 41 Jacob Lifshay 2019-07-25 14:27:14 BST
don't mark fixed yet, need to add more tests
Comment 42 Jacob Lifshay 2019-07-25 14:27:23 BST
don't mark fixed yet, need to add more tests
Comment 43 Luke Kenneth Casson Leighton 2019-07-28 22:28:42 BST
Would like to close this one, leave FP64 RSQRT testing for a different bugreport  bug #122, what do you think?
Comment 44 Jacob Lifshay 2019-07-28 22:32:54 BST
(In reply to Luke Kenneth Casson Leighton from comment #43)
> Would like to close this one,
not yet, we don't have any special case tests for fsqrt or frsqrt.

> leave FP64 RSQRT testing for a different
> bugreport  bug #122, what do you think?

yeah, we should have fp64 frsqrt testing be a different bug.
Comment 45 Luke Kenneth Casson Leighton 2019-07-28 22:48:38 BST
(In reply to Jacob Lifshay from comment #44)
> (In reply to Luke Kenneth Casson Leighton from comment #43)
> > Would like to close this one,
> not yet, we don't have any special case tests for fsqrt or frsqrt.

Yepyep, must cut/paste eg the fcvt unit test for those.
Comment 46 Luke Kenneth Casson Leighton 2019-07-29 15:05:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #45)
> (In reply to Jacob Lifshay from comment #44)
> > (In reply to Luke Kenneth Casson Leighton from comment #43)
> > > Would like to close this one,
> > not yet, we don't have any special case tests for fsqrt or frsqrt.
> 
> Yepyep, must cut/paste eg the fcvt unit test for those.

added, run all of the fsqrt and frsqrt ones @ 16 and 32, test_fpsqrt_pipe_64.py
still running.