for 32-bit, to use "wtf" algorithm: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code 16-bit and 64-bit, requires thought.

(In reply to Luke Kenneth Casson Leighton from comment #0) > for 32-bit, to use "wtf" algorithm: > https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code > > 16-bit and 64-bit, requires thought. The "wtf" algorithm is not well suited for our needs, since Vulkan requires a much more accurate inv-sqrt operation than that algorithm provides, it requires several floating-point mul-adds, and adding more iterations of newton's method to fix the accuracy problems will increase the latency well beyond what a radix-4 digit-recurrence algorithm needs due to needing several passes through a multiplier. One paper I found that may be useful: A Hardware Algorithm for Computing Reciprocal Square Root: https://scholar.google.com/scholar?cluster=4957662328282312533&hl=en&as_sdt=0,48 http://www.acsel-lab.com/arithmetic/arith15/papers/ARITH15_Takagi.pdf

(In reply to Jacob Lifshay from comment #1) > The "wtf" algorithm is not well suited for our needs, since Vulkan requires > a much more accurate inv-sqrt operation than that algorithm provides Vulkan requires an accuracy of 2 ULP (Units in the Last Place): https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/chap40.html#spirvenv-precision-operation search for inversesqrt

(In reply to Jacob Lifshay from comment #1) > (In reply to Luke Kenneth Casson Leighton from comment #0) > > for 32-bit, to use "wtf" algorithm: > > https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code > > > > 16-bit and 64-bit, requires thought. > > The "wtf" algorithm is not well suited for our needs, since Vulkan requires > a much more accurate inv-sqrt operation than that algorithm provides, it > requires several floating-point mul-adds, and adding more iterations of > newton's method to fix the accuracy problems will increase the latency well > beyond what a radix-4 digit-recurrence algorithm needs due to needing > several passes through a multiplier. drat. > One paper I found that may be useful: > A Hardware Algorithm for Computing Reciprocal Square Root: > https://scholar.google.com/ > scholar?cluster=4957662328282312533&hl=en&as_sdt=0,48 > http://www.acsel-lab.com/arithmetic/arith15/papers/ARITH15_Takagi.pdf ok good find. it's a little hard to interpret, i do like how it has radix-2 and radix-4 implementations, radix-2 multiply basically being "multiply by -1, 0 or 1", radix-4 multiply including only "-2 and +2" in that, which is a bit-shift.

different approach: https://rei.iteso.mx/bitstream/handle/11117/5844/Design+and+Implementation+of+Reciprocal+Square+Root+Units+on+Digital+ASIC+Technology+For+Low+Power+Embedded+Applications.pdf https://github.com/tomverbeure/math/blob/master/src/main/scala/math/FpxxRSqrt.scala both these, as best i can tell, use a polynomial lookup table which requires only 1 iteration of newton-raphson to give full accuracy.

for lolz... https://people.ece.cornell.edu/land/courses/ece5760/StudentWork/mje56/FPmath.v

(In reply to Luke Kenneth Casson Leighton from comment #4) > different approach: > > https://rei.iteso.mx/bitstream/handle/11117/5844/Design+and+Implementation+of+Reciprocal+Square+Root+Units+on+Digital+ASIC+Technology+For+Low+Power+Embedded+Applications.pdf there was a server error when I tried to access this > https://github.com/tomverbeure/math/blob/master/src/main/scala/math/FpxxRSqrt.scala as far as I can tell, this is only a lookup table. > both these, as best i can tell, use a polynomial lookup table which requires > only 1 iteration of newton-raphson to give full accuracy. I think it would be wise to avoid newton-raphson since it needs several multiplications and will occupy the multiplier for quite a few cycles.

(In reply to Jacob Lifshay from comment #6) > (In reply to Luke Kenneth Casson Leighton from comment #4) > > different approach: > > > > https://rei.iteso.mx/bitstream/handle/11117/5844/Design+and+Implementation+of+Reciprocal+Square+Root+Units+on+Digital+ASIC+Technology+For+Low+Power+Embedded+Applications.pdf > there was a server error when I tried to access this > > > https://github.com/tomverbeure/math/blob/master/src/main/scala/math/FpxxRSqrt.scala > > as far as I can tell, this is only a lookup table. > > > both these, as best i can tell, use a polynomial lookup table which requires > > only 1 iteration of newton-raphson to give full accuracy. > > I think it would be wise to avoid newton-raphson since it needs several > multiplications and will occupy the multiplier for quite a few cycles. The fixed rsqrt algorithm i saw was able to do the mult in a single cycle, however it may have been a smaller mantissa. The algorithm you found, I really like it, only the description is really obtuse. Assuming we can do a 4 radix version, 2 of those can be combinatorially chained to give 8bit per cycle, and for FP32 that gives only around a 5 stage pipeline with basic single-shift, add and MUX blocks. What is the sort of performance actually needed? Is this critical for certain circumstances?

(In reply to Luke Kenneth Casson Leighton from comment #7) > What is the sort of performance actually needed? Is this critical for > certain circumstances? Some common lighting algorithms use per-pixel interpolated surface normal vectors, those need to be normalized before they can be used in lighting calculations, requiring a 3x3 dot product, an inverse-sqrt, and a 1x3 scalar vector multiplication. So, we should be able to support at least 1 inverse-sqrt per pixel or so (with possibly somewhat reduced framerate). Note that this is in addition to the depth division that is required per-pixel for basic 3D, so budget accordingly.

the Tanako paper is extremely obtuse, and assumes access to IEEE papers is guaranteed. i found an implementation of "On the Fly Conversion": https://web.archive.org/web/20170808225020/http://etd.dtu.dk/thesis/211563/ep08_21.pdf this paper does not require access to paywalls. page 36.

(In reply to Luke Kenneth Casson Leighton from comment #9) > the Tanako paper is extremely obtuse, and assumes access to IEEE papers > is guaranteed. i found an implementation of "On the Fly Conversion": > https://web.archive.org/web/20170808225020/http://etd.dtu.dk/thesis/211563/ > ep08_21.pdf > > this paper does not require access to paywalls. page 36. As far as I can tell, this paper doesn't actually have a rsqrt or sqrt algorithm.

(In reply to Jacob Lifshay from comment #10) > (In reply to Luke Kenneth Casson Leighton from comment #9) > > the Tanako paper is extremely obtuse, and assumes access to IEEE papers > > is guaranteed. i found an implementation of "On the Fly Conversion": > > https://web.archive.org/web/20170808225020/http://etd.dtu.dk/thesis/211563/ > > ep08_21.pdf > > > > this paper does not require access to paywalls. page 36. > > As far as I can tell, this paper doesn't actually have a rsqrt or sqrt > algorithm. it doesn't... what it *does* have is implementations of the primitives which are *assumed* that the reader has access to or knows about (from paywalled literature). OTFC - on-the-fly-conversion - is available according to the Tanako paper - in a paywalled journal. searching "OTFC" is near-impossible to find, it is very obscure... the only paper i could find that actually describes it has been *removed* from the original location (etd.dtu.dk - the entire server is down). likewise, it contains implementations of Carry-Save Adders (p122) which *again* the Tanako paper *ASSUMES* that the reader is either familiar with or has access to (paywalled) journals containing that information. without these algorithms - OTFC and CSA - the Tanako paper is effectively useless, unfortunately. frustratingly, they're incredibly simple blocks. the notes in the Tanako paper state that OTFC may be implemented with a couple of MUXes and a shifter.

http://verilogcodes.blogspot.com/2017/11/verilog-code-for-carry-save-adder-with-testbench.html module CSA ( input [3:0] x,y,z, output [4:0] s, output cout ); wire [3:0] c1,s1,c2; fulladder fa_inst10(x[0],y[0],z[0],s1[0],c1[0]); fulladder fa_inst11(x[1],y[1],z[1],s1[1],c1[1]); fulladder fa_inst12(x[2],y[2],z[2],s1[2],c1[2]); fulladder fa_inst13(x[3],y[3],z[3],s1[3],c1[3]); fulladder fa_inst20(s1[1],c1[0],1'b0,s[1],c2[1]); fulladder fa_inst21(s1[2],c1[1],c2[1],s[2],c2[2]); fulladder fa_inst22(s1[3],c1[2],c2[2],s[3],c2[3]); fulladder fa_inst23(1'b0,c1[3],c2[3],s[4],cout); assign s[0] = s1[0]; module tb_adder; reg [3:0] x,y,z; wire [4:0] s; wire cout; integer i,j,k,error; // Instantiate the Unit Under Test (UUT) CSA uut (x,y,z,s,cout); //Stimulus block - all the input combinations are tested here. //the number of errors are recorded in the signal named "error". initial begin // Initialize Inputs x = 0; y = 0; z = 0; error = 0; //three for loops to test all input combinations. for(i=0;i<16;i=i+1) begin for(j=0;j<16;j=j+1) begin for(k=0;k<16;k=k+1) begin x = i; y = j; z = k; #10; if({cout,s} != (i+j+k)) error <= error + 1; end end end end endmodule

https://codestall.org/2017/06/26/carry-save-adder/ module csad( output [7:0]s, output co, input [7:0]a,b, input ci); wire c1,c2,c3; wire [3:0]s1,s2; rcad u1(s[3:0], c1, a[3:0], b[3:0], ci); rcad u2(s1, c2, a[7:4], b[7:4], 1’b0); rcad u3(s2, c3, a[7:4], b[7:4], 1’b1); mux4b_2_1 u4(s[7:4], s1, s2, c1); mux2_1 u5(co, c2, c3, c1); endmodule module mux4b_2_1( output [3:0]y, input [3:0]a,b, input c); assign y = c ? b : a; endmodule module mux2_1( output y, input a, b, c); assign y = c ? b : a; endmodule

looking at the diagram from ep08-21.pdf on page 36 (figure 3.15) * q[j+1] in radix 2 is a 2-bit signed binary number, -1 0 or +1 (top of p35), in binary -1 = 0b10 [edit: 0b11??], 0 = 0b00, 1 = 0b01 * |q[j+1]| probably means "absolute value" of q[j+1], in other words when q[j+1] == 0b00, return 0b00, and when q[j+1] == 0b10 (-1) [edit: 0b11??), return 0b01 (+1). * r is the "radix", confirmed (inferred) from the top paragraph of page 33 * the key equations to produce Q(next) and QM(next) are given in (3.21) and (3.22): if q[j+1] >= 0: Q(next) = Q << 1 | q[j+1] else: Q(next) = QM << 1 | (2-abs(q[j+1])) if q[j+1] > 0: QM(next) = QM << 1 | (q[j+1]-1) else: QM(next) = Q << 1 | (1-abs(q[j+1])) those sums, "q[j+1]-1" "2-abs(q[j+1])" and "1-abs(q[j+1])" are in 2-bit signed binary, so should be trivial to produce logic for.

(In reply to Luke Kenneth Casson Leighton from comment #14) > looking at the diagram from ep08-21.pdf on page 36 (figure 3.15) > > * q[j+1] in radix 2 is a 2-bit signed binary number, -1 0 or +1 > (top of p35), in binary -1 = 0b10, 0 = 0b00, 1 = 0b01 shouldn't that be -1 = 0b11 assuming it's in 2s complement? > > * |q[j+1]| probably means "absolute value" of q[j+1], in other words > when q[j+1] == 0b00, return 0b00, and when q[j+1] == 0b10 (-1), > return 0b10 (+1). I think you meant 0b11 (-1), return 0b01 (+1). Even if they're not 2s complement, 1 and -1 shouldn't be identical. I could be wrong, since I didn't actually check the paper.

(In reply to Jacob Lifshay from comment #15) > (In reply to Luke Kenneth Casson Leighton from comment #14) > > looking at the diagram from ep08-21.pdf on page 36 (figure 3.15) > > > > * q[j+1] in radix 2 is a 2-bit signed binary number, -1 0 or +1 > > (top of p35), in binary -1 = 0b10, 0 = 0b00, 1 = 0b01 > shouldn't that be -1 = 0b11 assuming it's in 2s complement? err... the paper does say -1, 0 or +1 (as does Tanako), this will need to be checked. > > > > * |q[j+1]| probably means "absolute value" of q[j+1], in other words > > when q[j+1] == 0b00, return 0b00, and when q[j+1] == 0b10 (-1), > > return 0b10 (+1). > I think you meant 0b11 (-1), return 0b01 (+1). > Even if they're not 2s complement, 1 and -1 shouldn't be identical. yep got it, well spotted.

(In reply to Luke Kenneth Casson Leighton from comment #16) > (In reply to Jacob Lifshay from comment #15) > > (In reply to Luke Kenneth Casson Leighton from comment #14) > > > looking at the diagram from ep08-21.pdf on page 36 (figure 3.15) > > > > > > * q[j+1] in radix 2 is a 2-bit signed binary number, -1 0 or +1 > > > (top of p35), in binary -1 = 0b10, 0 = 0b00, 1 = 0b01 > > shouldn't that be -1 = 0b11 assuming it's in 2s complement? > > err... the paper does say -1, 0 or +1 (as does Tanako), this will need > to be checked. yeah, it is -1, 0, or 1. I had meant that -1 in 2s complement is 0b11, not 0b10.

(In reply to Jacob Lifshay from comment #17) > > err... the paper does say -1, 0 or +1 (as does Tanako), this will need > > to be checked. > yeah, it is -1, 0, or 1. I had meant that -1 in 2s complement is 0b11, not > 0b10. yes, very strange. will need to see what happens on implementation.

> Could it be one's complemet? Does anyone still use that for anything? > -- hendrik quite possibly, because this is carry-save representation. honestly though i am not so certain, i can't find anything that says "CSA is done using 1's complement".

Created attachment 14 [details] 4-2 (and 3-2?) carry-save adder CSAW2 and CSAW1 seem to be very simple straightforward 4-in 2-out and 3-in 2-out carry-save adders. attached diagram shows a 3-2 adder that can be chained to create a 4-2.

i found an expired patent https://patents.google.com/patent/US7620677B2/en which at the very least contains links to prior art that describe a 4-2 CSA and a 3-2 CSA adder. if that's correct, it just leaves the "DS" function - "Digit Select" which is described as "a 6-bit carry-propagate adder and simple constant value comparators". so the input would appear to be in a redundant (carry-save) form, that has to be added up in order to make the comparison to see if it's within 2 specified ranges.

CSA 3-2: ------- SUM = A XOR B XOR C CRY = NAND(NAND (A,B), NAND(B,C), NAND(A,C)) CSA 4-2: ------- S0, Cout = CSA32(A,B,C) SUM, CRY = CSA32(S0, D, Cin) that err seems to be it, where normally there would be 5 inputs and 3 outputs (Cin ripples through to Cout). so err it's not CSA 4-2 at all, it's CSA 5-3 (oink?) CSA 4-2 would presumably be much simpler?

(In reply to Luke Kenneth Casson Leighton from comment #22) > CSA 3-2: > ------- > > SUM = A XOR B XOR C > CRY = NAND(NAND (A,B), NAND(B,C), NAND(A,C)) > > CSA 4-2: > ------- > > S0, Cout = CSA32(A,B,C) > SUM, CRY = CSA32(S0, D, Cin) > > that err seems to be it, where normally there would be 5 inputs > and 3 outputs (Cin ripples through to Cout). > > so err it's not CSA 4-2 at all, it's CSA 5-3 (oink?) > > CSA 4-2 would presumably be much simpler? it's csa 4-2. notice that there isn't a path from cin to cout. I personally prefer csa 3-2 since it takes half as much logic but adds more than half as many inputs for the same number of outputs, so it's more efficient.

(In reply to Jacob Lifshay from comment #23) > (In reply to Luke Kenneth Casson Leighton from comment #22) > > CSA 3-2: > > ------- > > > > SUM = A XOR B XOR C > > CRY = NAND(NAND (A,B), NAND(B,C), NAND(A,C)) > > > > CSA 4-2: > > ------- > > > > S0, Cout = CSA32(A,B,C) > > SUM, CRY = CSA32(S0, D, Cin) > > > > that err seems to be it, where normally there would be 5 inputs > > and 3 outputs (Cin ripples through to Cout). > > > > so err it's not CSA 4-2 at all, it's CSA 5-3 (oink?) > > > > CSA 4-2 would presumably be much simpler? > > it's csa 4-2. notice that there isn't a path from cin to cout. Am assuming it is for use in chaining. > I personally prefer csa 3-2 since it takes half as much logic but adds more > than half as many inputs for the same number of outputs, so it's more > efficient. The Tanako paper specifies an algorithm that requires both. A ripple carry adder is also mentioned. That will be the next one to look up.

Created attachment 15 [details] 6 bit ripple add 6 bit ripple adder

(In reply to Luke Kenneth Casson Leighton from comment #25) > Created attachment 15 [details] > 6 bit ripple add > > 6 bit ripple adder the attachment is just a 3-bit + 3-bit -> 4-bit adder. a = Signal(3) b = Signal(3) y = Cat(a, C(0, 1)) + Cat(b, C(0, 1))

Ok found a 6 bit ripple carry adder, It produces a 4 bit answer, however only the top 2 bits and sign matter. According to the Tanako paper DS needs a number that is signed and is between3 thresholds -0.25 0 0.5 something like that. This would be in 2 bits plus the sign. This gives the 1 bit qj+1 which must be either -1 0 or +1 The algorithm can be checked against sqrt because apparently the OTFC stage produces both Isqrt and Sqrt.

(In reply to Jacob Lifshay from comment #26) > (In reply to Luke Kenneth Casson Leighton from comment #25) > > Created attachment 15 [details] > > 6 bit ripple add > > > > 6 bit ripple adder > > the attachment is just a 3-bit + 3-bit -> 4-bit adder. > a = Signal(3) > b = Signal(3) > y = Cat(a, C(0, 1)) + Cat(b, C(0, 1)) Ah ok. https://www.researchgate.net/publication/234151562_An_FPGA_Based_Generic_Framework_for_High_Speed_Sum_of_Absolute_Difference_Implementation/download Apparently Tanako may mean just 2 Full Adders chained together. 6 bits in, 3 bits out. Might need 2 Full Adders and 1 Half. Arg! :) Just the worst when academics assume entire areas of knowledge by readers.

writing this down here so I don't forget: rsqrt can be calculated by a binary search for 1 by adding powers of 2 (such as 0.5, 0.25, 0.125) to x in the equation 1 = a * x^2 where a is the input to rsqrt, x starts at 1.0 to evaluate quickly, the next value of a * x^2 can be calculated by addition from the previous a * x^2 by replacing x with x + 2^b then using the identity: a * (x + 2^b)^2 = a * x^2 + a * x * 2^(b+1) + a * 2^(2*b) where b is usually a negative integer that is constant for each pipeline stage. one benefit of this method is that all the arithmetic can be done exactly and as a result you know for sure how to round the result.

(In reply to Jacob Lifshay from comment #29) > writing this down here so I don't forget: > rsqrt can be calculated by a binary search for 1 by adding powers of 2 (such > as 0.5, 0.25, 0.125) to x in the equation 1 = a * x^2 where a is the input > to rsqrt, x starts at 1.0 ooo it's a variant on that sqrt algorithm, the one that moves values from a to b in the factorisation of a^2 + 2ab + b^2 [a(a + 2ab) + b^2]. nice.

If no one else is working on it, I'll start working on implementing the div/mod/fdiv/fsqrt/frsqrt pipeline.

(In reply to Jacob Lifshay from comment #31) > If no one else is working on it, I'll start working on implementing the > div/mod/fdiv/fsqrt/frsqrt pipeline. keep it short, keep in touch continously (every few days), and use .pyi files, do *not* put type information into the .py files, and for FP, follow the format of the pipeline code rather than create an entirely new pipeline design (use FPMUL and FPADD as cookie-cut templates). otherwise we waste time and money reworking two sets of completely disparate code. also keep the integer and FP pipelines *separate* for now: the FPU code is seriously complicated already, and in actual usage, int would hold up FP and FP would hold up int. i'm looking to put in another funding application to cover a "comprehensive" version of the isqrt/fsqrt/fdiv code, including formal verification, for which programmer-mathematicians could be attracted to help out. that means keeping the timescales on FPDIV/ISQRT/SQRT *real* short.

(In reply to Luke Kenneth Casson Leighton from comment #32) > (In reply to Jacob Lifshay from comment #31) > > If no one else is working on it, I'll start working on implementing the > > div/mod/fdiv/fsqrt/frsqrt pipeline. > > keep it short, keep in touch continously (every few days), and use .pyi > files, > do *not* put type information into the .py files, and for FP, follow the > format of the pipeline code rather than create an entirely new pipeline > design > (use FPMUL and FPADD as cookie-cut templates). otherwise we waste time > and money reworking two sets of completely disparate code. I was planning on creating an arithmetic pipeline similar to the multiplier in that operations just travel in a simple manner from start to finish and we can add on all the pipeline control circuitry externally (since we still haven't fully resolved how pipeline control will work). > also keep the integer and FP pipelines *separate* for now: the FPU code > is seriously complicated already, and in actual usage, int would hold up FP > and FP would hold up int. Actually, the fpdiv uses an integer div+mod internally, so having separate fp and integer pipelines here actually makes it more complicated. The only int operation supported is div+mod. > i'm looking to put in another funding application to cover a "comprehensive" > version of the isqrt/fsqrt/fdiv code, including formal verification, for > which programmer-mathematicians could be attracted to help out. I'm planning on implementing the algorithms such that they create an exact result at all times, the binary result and a remainder (or the equivalent for sqrt/rsqrt). those can then be converted to the mantissa and guard/round/sticky bits and relatively simply proven to be correct for all inputs. > > that means keeping the timescales on FPDIV/ISQRT/SQRT *real* short. Since the algorithms are relatively simple, I'm expecting the fixed-point core to be done in 1-2 days and then the FP unpacking/normalization/special-case-handling can be added on the front and back.

(In reply to Jacob Lifshay from comment #33) > I was planning on creating an arithmetic pipeline similar to the multiplier > in that operations just travel in a simple manner from start to finish and > we can add on all the pipeline control circuitry externally (since we still > haven't fully resolved how pipeline control will work). it turns out that scoreboard-style designs don't need stallable pipelines. i'd like to keep using the same FP code style (same class structure, same base classes) so that any code-morphing can be done to *all* FP* units. > > also keep the integer and FP pipelines *separate* for now: the FPU code > > is seriously complicated already, and in actual usage, int would hold up FP > > and FP would hold up int. > Actually, the fpdiv uses an integer div+mod internally, so having separate > fp and integer pipelines here actually makes it more complicated. The only > int operation supported is div+mod. i mean: creating a pipeline-stage bypass system where either INT or FP data can be put into the same unit, and the INT version skips the de/-normalisation, that's doable. creating an infrastructure that *shares* the integer pipeline stages: that's way more complex. however *for now*, please don't do either of those things: the FP code is complex enough as it is. see below. > Since the algorithms are relatively simple, I'm expecting the fixed-point > core to be done in 1-2 days and then the FP > unpacking/normalization/special-case-handling can be added on the front and > back. that's a good strategy. you saw the ep08_15.pdf thesis i found, which contains OTFC radix-2/4/8 INT-DIV? can you also, for now, create _separate_ fixed-point pipeline classes, (each stage separated out into its own class), put them together to create a fixed-point div/mod, with unit tests, then move on to use the *same* classes within the FP code. you'll find that there already exist unpacking, normalisation and de-normalisation classes, and pipelines that can be used. the only major missing class that needs writing is the "FP DIV special cases code" - i can do that, based on Jon Dawson's FPU state-machine. also, please don't put the classes in the simple-barrel-processor repository, put them in the FPU repository where they belong. or, we create a separate "INT ALU" repository or they go directly in the soc repository.

https://git.libre-riscv.org/?p=ieee754fpu.git;a=commitdiff;h=ca5cce9412df04c5887387c9efe7b5f87082fce0 done. copied FPMUL's specialcases.py, sat it side-by-side with ieee754/fpdiv/nmigen_div_experiment.py. untested: it doesn't have syntax errors. btw there's no version of ieee754/fpmul/test/test_fpmul_pipe.py yet, because there's only the FSM-based (jon dawson) variant of FPDIV, not a corresponding pipelined version. test_fpmul_pipe.py is the template to use to start creating an FPDIV... actually, it's such an easy cookie-cut i might as well do it: that will be ieee754/fpdiv/pipeline.py then instead of from .mulstages import FPMulStages that would be: from .divstages import FPDivStages and that's literally the only class that's needed to get an operational FPDIV multi-in, multi-out Reservations-Style pipeline up and running.

ok that's basically it: the infrastructure's in place. it should be clear that: * FPDivStage0Mod is the first class that converts incoming (normalised) data into Q+R (whatever: the stuff that eventually becomes the div and rem) * FPDivStage1Mod is part of the huuuuge chain that will take Q+R (whatever) and output Q+R (whatever) * FPDivStage2Mod is the *last* one that takes Q+R (whatever) as input and drops it into the mantissa. given that div *may* need the mantissa normalising to within a range 0.5 to 1.0 (or somesuch), we *may* need an *extra* pre-normalising stage. it should be pretty clear from divstages.py how that can be added. index 94c8a78..e1f8b11 100644 (file) --- a/src/ieee754/fpdiv/divstages.py +++ b/src/ieee754/fpdiv/divstages.py @@ -16,6 +16,7 @@ from ieee754.fpcommon.postcalc import FPAddStage1Data # TODO: write these from .div0 import FPDivStage0Mod from .div1 import FPDivStage1Mod +from .div2 import FPDivStage2Mod class FPDivStages(FPState, SimpleHandshake): @@ -40,10 +41,12 @@ class FPDivStages(FPState, SimpleHandshake): # TODO. clearly, this would be a for-loop, here, creating # a huge number of stages (if radix-2 is used). interestingly # the number of stages will be data-dependent. - m0mod = FPDivStage0Mod(self.width, self.id_wid) - m1mod = FPDivStage1Mod(self.width, self.id_wid) + divstages = [FPDivStage0Mod(self.width, self.id_wid)] + for i in range(self.width): # XXX TODO: work out actual number needed + divstages.append(FPDivStage1Mod(self.width, self.id_wid)) + divstages.append(FPDivStage2Mod(self.width, self.id_wid)) - chain = StageChain([m0mod, m1mod]) + chain = StageChain(divstages) chain.setup(m, i) self.o = m1mod.o

see isa-dev proposal: https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/SLprSJzpgCQ

(In reply to Jacob Lifshay from comment #37) > see isa-dev proposal: > https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/ > SLprSJzpgCQ recorded this under bug #110

context: http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-July/002161.html On Thu, Jul 25, 2019 at 7:52 AM Jacob Lifshay <programmerjake@gmail.com> wrote: > I'm reverting the change to DivPipeCore while I fix the problems. that only stops fpsqrt working. > I'm > going to also try to convert the division tests to use unittest to > make it compatible with pytest-xdist. ah cool - they're runnable as "python3 xxx/xxx/test/test_xxx.py" still. > I also added a unittest.skip > annotation to the divpipecore tests wider than 8 bits, since 8 bits is > enough to catch almost all the bugs and it completes in 60s instead of > a really long time. great. i vaguely recall that nosetests3 has a way to run "longer" tests.

See http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-July/002176.html

don't mark fixed yet, need to add more tests

Would like to close this one, leave FP64 RSQRT testing for a different bugreport bug #122, what do you think?

(In reply to Luke Kenneth Casson Leighton from comment #43) > Would like to close this one, not yet, we don't have any special case tests for fsqrt or frsqrt. > leave FP64 RSQRT testing for a different > bugreport bug #122, what do you think? yeah, we should have fp64 frsqrt testing be a different bug.

(In reply to Jacob Lifshay from comment #44) > (In reply to Luke Kenneth Casson Leighton from comment #43) > > Would like to close this one, > not yet, we don't have any special case tests for fsqrt or frsqrt. Yepyep, must cut/paste eg the fcvt unit test for those.

(In reply to Luke Kenneth Casson Leighton from comment #45) > (In reply to Jacob Lifshay from comment #44) > > (In reply to Luke Kenneth Casson Leighton from comment #43) > > > Would like to close this one, > > not yet, we don't have any special case tests for fsqrt or frsqrt. > > Yepyep, must cut/paste eg the fcvt unit test for those. added, run all of the fsqrt and frsqrt ones @ 16 and 32, test_fpsqrt_pipe_64.py still running.