a pipelined FPDIV unit is needed (as opposed to a blocking, FSM version)
https://github.com/wmy367/Radix-2-division/blob/master/Radix_2_div.v example radix 2 divide which could probably be pipelined.
Created attachment 17 [details] thesis with some pipelined divide and other algorithms
pipelined divide can probably share most of the logic with pipelined sqrt/rsqrt
https://git.libre-riscv.org/?p=ieee754fpu.git;a=commitdiff;h=2789fb65d1d70e70d45c0e2506dfc18d13939bb1 src/ieee754/fpdiv/divstages.py done test_fpdiv_pipe.py done pipeline.py done all cookie-cut. div0.py is an "example" class, again cookie-cut. FPDIVStage0Data is where the incoming data (from pre-normalisation) gets turned into "first stage stuff that contains intermediary results such as Q and R".
btw you'll see these puzzling things in each stage: with m.If(~self.i.out_do_z): these are activated back in the special-cases: # if b is zero return Inf with m.Elif(b1.is_zero): m.d.comb += self.o.out_do_z.eq(1) m.d.comb += self.o.z.inf(sabx) such that the result (which is already valid, as a special-case) can propagate down the (entire) pipeline, unmodified. this because i haven't worked out early-out bypassing, yet. hmm must write a separate bugreport as we need cancellation (go_die).
we need to keep the FPDIV pipeline stage as short as possible: RADIX-8 would be the preferred option. this to reduce the number of Reservation Stations (Function Units) needed to keep a Concurrent Computation Unit (aka "pipeline") fully occupied and not cause freezing of the entire issue stage. see here for details: http://bugs.libre-riscv.org/show_bug.cgi?id=101#c2 basically if we have a 24-stage pipeline (not unreasonable for radix-2 divide) we need a staggering *TWENTY FOUR* Function Units (src/result latch sets, plus 24 rows in the Dependency Matrices), just to keep track of all the (potential) results. bottom line: Radix-8 is, well, pretty high priority would be an understatement :) FP64 is just not a high priority (no need to be concerned about stalling, there, we can even use the FSM version), it's FP32 that's the biggest issue due to needing one FPDIV per pixel on 3D.
All the short variable names need at least a comment at the definition with the full name, otherwise it's quite hard to understand at first. For instance, in FPRoundData, self.oz is very difficult to guess at first. In my opinion, short variable names (fragments of words) are only appropriate for widely accepted names such as i, j, k for iteration, or widely known abbreviations, such as num instead of number, or for cases where the definition is immediately adjacent to all uses. out_do_z should be out_do_zero, or, better yet, result_is_zero. FPRoundData.z is totally unintelligible (it's not zero), I would name it base or have FPRoundData extend FPNumBaseRecord. There comes a certain point where the additional readability from the reduction in lines of code is entirely negated by the lack of comprehensible names.
http://bugs.libre-riscv.org/attachment.cgi?id=17 radix-8 is on page 40, section 3.3.4, figure 3.20 on p41. what's nice is: he really does a good job explaining. figure 3.21 explains really clearly how a 3:2 CSA is implemented. figure 3.15 (p36) explains how OTFC works. r btw is the radix (base) - eq 3.15, 3.16. took a looong time to work this out. also, the VHDL is in the appendix.
(In reply to Jacob Lifshay from comment #7) > There comes a certain point where the additional readability from the > reduction in lines of code is entirely negated by the lack of comprehensible > names. yehh i took shortcuts, and having five 80x65 xterms to jam as much information on-screen as possible is an absolute necessity for a multi-class modular hierarchy: far more important than name length. long names results in extension of functions beyond reasonable limits because arguments end up having to span multiple lines. that in turn often pushes a function well beyond what can fit (*vertically*) onto a single page, requiring multiple page-up page-down page-up page-down page-up page-down actions in *multiple windows* in order to work out what the hell's calling what calling what. plus, the original code from jon dawson was so short that it was pretty obvious. a, b, z. a is operand1, b is operand2, z is clearly the result. functions (actually, properties) in FPBase definitely got named fully after their purpose. so, z is often actually a module (an Elaboratable), which then has properties "is_zero", "is_nan" and so on. basically, i'm aware there's a lot of clean-up and code-morphing that still needs to be done. the challenge was in keeping the unit tests running at all times, as debugging is, to be frank, hell. introducing even a single error is a bitch to track down: there's one now which was detected from FP16 MUL. oh, btw, FP16 DIV is actually a higher priority for dev / debugging purposes than FP32. the reason is that the code-coverage from random result generation hits corner-cases far more often. the rounding error introduced last week into FPMUL would have been flat-out statistically impossible to encounter on FP32 because there was no way that random result generation would have created the case. in fact there is no reason why, with FP16, we should not consider doing a full (complete) coverage of 0x0000-0xffff for both src1 and src2, particularly on FPGAs or when converted to iverilog.
https://git.libre-riscv.org/?p=ieee754fpu.git;a=commitdiff;h=4d0caba0e95751f05690323fbe25fd0286cea40f + # NOTE: difference between z and oz is that oz is created by + # special-cases module(s) and will propagate, along with its + # "bypass" signal out_do_z, through the pipeline, *disabling* + # all processing of all subsequent stages. + self.a = FPNumBaseRecord(width, m_extra) # operand a + self.b = FPNumBaseRecord(width, m_extra) # operand b + self.z = FPNumBaseRecord(width, False) # denormed result + self.oz = Signal(width, reset_less=True) # "finished" (bypass) result + self.out_do_z = Signal(reset_less=True) # "bypass" enabled + self.mid = Signal(id_wid, reset_less=True) # multiplexer ID diff --git a/src/ieee754/fpcommon/roundz.py b/src/ieee754/fpcommon/roundz.py index db4482b..36253d3 100644 (file) --- a/src/ieee754/fpcommon/roundz.py +++ b/src/ieee754/fpcommon/roundz.py @@ -14,9 +14,10 @@ class FPRoundData: def __init__(self, width, id_wid): self.z = FPNumBaseRecord(width, False) + self.mid = Signal(id_wid, reset_less=True) # multiplex ID + # pipeline bypass [data comes from specialcases] self.out_do_z = Signal(reset_less=True) self.oz = Signal(width, reset_less=True) - self.mid = Signal(id_wid, reset_less=True) def eq(self, i): return [self.z.eq(i.z), self.out_do_z.eq(i.out_do_z), self.oz.eq(i.oz), @@ -47,7 +48,7 @@ class FPRoundMod(Elaboratable): def elaborate(self, platform): m = Module() m.d.comb += self.out_z.eq(self.i) # copies mid, z, out_do_z - with m.If(~self.i.out_do_z): + with m.If(~self.i.out_do_z): # bypass wasn't enabled with m.If(self.i.roundz): m.d.comb += self.out_z.z.m.eq(self.i.z.m + 1) # mantissa up with m.If(self.i.z.m == self.i.z.m1s): # all 1s
radix-8 3-2 CSA: figure 3.21 from p42: def csa(input, mux): x, c = halfadd(input[0], input[1]) x1, c1 = halfadd(c, mux) return [x & x1, c] def csa32(input, mux): # 2 bit input, 2 bit mux) x = csa(input, mux[0]) return csa(x, mux[1])
figure 3.18 p40 ep08_15 TODO: edit/sort this, can't work out mux31 from figure 3.13, p34 # input = [a, b, c], sel = [10, 00, 01] to represent {-d, 0, +d} def mux31(input, sel): if sel == 10: return a def mux51(input, sel): # input = [a,b,c,d,e] x = mux31(input[:3], sel[0]) return mux31([x] + input[3:], sel[1])
Just realised that FPDIVStages is a single combinatorial stage, can't do that. Tomorrow I will do pipeline.py pipe1,2,3 as pipesc, pipe1,2,3,4, pipepost The first one pipe1 will need a tiny bit of conversion at the front. The last one, pipe4, again conversion at the end. All of pipe1 to 6 will need to be radix8 (three bits at a time) and 2 StageChained radix8 to give 6 bits at a time. That gives only 4 stages, and we stand a chance of the pipeline not being insanely long. 6 stages is still one hell of a lot because it will need 6 FUs at the Matrix to keep 100% throughput. With 128 registers and likely something around 30 FUs we could be looking at a quarter million gates just for the Dependency Matrices. Getting that number down is really critical. So the less stages in FPDIV FP32 the better.
I did say I would work on the div/rem/fdiv/fsqrt/frsqrt pipeline...
(In reply to Luke Kenneth Casson Leighton from comment #13) > Just realised that FPDIVStages is a single combinatorial stage, can't do > that. > > Tomorrow I will do pipeline.py pipe1,2,3 as pipesc, pipe1,2,3,4, pipepost > > The first one pipe1 will need a tiny bit of conversion at the front. > > The last one, pipe4, again conversion at the end. > > All of pipe1 to 6 will need to be radix8 (three bits at a time) and 2 > StageChained radix8 to give 6 bits at a time. > > That gives only 4 stages, and we stand a chance of the pipeline not being > insanely long. > > 6 stages is still one hell of a lot because it will need 6 FUs at the Matrix > to keep 100% throughput. > > With 128 registers and likely something around 30 FUs we could be looking at > a quarter million gates just for the Dependency Matrices. with 128 registers, we really should binary encode the register portion of the dependency matrix, since if we can take less than 256 inverters (128 and gates) per FU, than we save gates and that allows us to have more FUs. > > Getting that number down is really critical. > > So the less stages in FPDIV FP32 the better. it would probably be 11 or 12 stages, since a 32-bit integer division takes ceil(32/3) == 11 stages for radix 8 division. 32-bit fp operations would be able to early exit 1-2 stages earlier, since they only need 27 bits of result (24-bit mantissa + guard/round/sticky) it might be possible to switch to radix 16 (allowing 8 or 9 stages), but I'd be worried about excessive area (more than the full 4x32 multiplier alu) and/or excessive per-stage delay, reducing max clock rate. I'm thinking it'll be a good idea to allow the pipeline to compute 2x16-bit, 1x16-bit + 2x8-bit, and 4x8-bit operations per clock. I was also thinking widening the pipeline to 64-bit wide would allow 64-bit operations to be computed in 2 passes through the pipeline, 2x32-bit operations in 1 pass, 4x16-bit in 1 pass, 8x8-bit in 1 pass, and aligned combinations like the multiplier. Widening would double the area, but wouldn't take any more pipeline stages. For radix 8, I estimate 10k-15k gates for the 64-bit wide version, taking about the same area as the 4x32-bit multiplier. For radix 16, I estimate about 20k-30k gates.
one other thing, the pipeline code needs to support one class having multiple stages inside it, since the multiplier will go inside one class. additionally parts of the multiplier will be used in the divider pipeline.
(In reply to Jacob Lifshay from comment #16) > one other thing, the pipeline code needs to support one class having > multiple stages inside it, if you can wrap them with a StageChain and create a separate class that has multiple stages because *StageChain* separates the multiple stages, that would be better. however it is not strictly necessary to do that: there's nothing to stop an individual stage having 2+ clock latency. the only thing is: anything that is not full combinatorial *cannot* then later be wrapped with StageChain, it *has* to be embedded in a full pipe class.
(In reply to Jacob Lifshay from comment #14) > I did say I would work on the div/rem/fdiv/fsqrt/frsqrt pipeline... i know - i'm pointing you in the direction of the correct use of the pipeline API, to save time.
(In reply to Jacob Lifshay from comment #15) > > With 128 registers and likely something around 30 FUs we could be looking at > > a quarter million gates just for the Dependency Matrices. > with 128 registers, we really should binary encode the register portion of > the dependency matrix, deep breath: we can't. the implications of moving to a binary encoding: transitive register dependency relationship expression, absolutely critical for simple multi-issue, would be destroyed. mitch alsup explained that in a multi-issue system, all that is needed is to *accumulate* the read/write dependency relationships from the previous instruction. this *requires* an unary register format because now rather than just one bit being set, it is now *multiple* bits being set. there is another way: a lookup table that goes between the "real" registers and the DM. the only problem is, stall/freeze is needed on issue if the numbers are too small. > > > > Getting that number down is really critical. > > > > So the less stages in FPDIV FP32 the better. > > it would probably be 11 or 12 stages, since a 32-bit integer division takes > ceil(32/3) == 11 stages for radix 8 division. i cannot emphasise enough how insane it is to have 12 Function Units dedicated to one task (even if they're capable of doing more than one job). do we *really* need 32-bit integer DIV as a high priority? i.e. is 32-bit DIV used anywhere in the FPU code? > 32-bit fp operations would be able to early exit 1-2 stages earlier, since > they only need 27 bits of result (24-bit mantissa + guard/round/sticky) this is an optimisation. can we please take sharing of INT/FP pipeline capabilities off the table for this budget-limited task and move it to the (entirely separate) proposal, which will have a lot more money (and time) to consider serious optimisations. > it might be possible to switch to radix 16 (allowing 8 or 9 stages), but I'd > be worried about excessive area (more than the full 4x32 multiplier alu) > and/or excessive per-stage delay, reducing max clock rate. it's why i suggested two combinatorially-linked (StageChained) radix-8 blocks. > I'm thinking it'll be a good idea to allow the pipeline to compute 2x16-bit, > 1x16-bit + 2x8-bit, and 4x8-bit operations per clock. if it's as complex and as time-consuming as the MUL to develop, please create a separate bugreport, mark it as "TODO later", and we can put it into the "FP optimisation and formal proof" proposal. it's really important to keep this *real* simple, get a "first version" done and *schedule* optimisations for later. planning ahead is fine, creating classes that can be morphed or used later, but not if doing so interferes drastically with getting this done *quickly*. > I was also thinking widening the pipeline to 64-bit wide would allow 64-bit > operations to be computed in 2 passes through the pipeline, 2x32-bit > operations in 1 pass, 4x16-bit in 1 pass, 8x8-bit in 1 pass, and aligned > combinations like the multiplier. Widening would double the area, but > wouldn't take any more pipeline stages. For radix 8, I estimate 10k-15k > gates for the 64-bit wide version, taking about the same area as the > 4x32-bit multiplier. For radix 16, I estimate about 20k-30k gates. if the classes can be done parameterised so that something may be constructed later, under a separate funding proposal, great.
this is what the stack looks like. it looks complicated: it's not. the work's "already done". the entry-points for the INT-based DIV unit are in div0.py, div1.py and div2.py the only reason for the existence of those three classes is purely to deal with the fact that the data has to be converted on exit from the pre-normalisation phase into the Q/REM DIV-chain, and also converted on entry to the *post* normalisation phase. div0.py deals with the "setup" div1.py deals with one Q/Rem "step" div2.py deals with the "exit" if you can get an INT DIV radix-2 (or 4, or 8) unit up and running in the next day or so, with appropriate parameterised classes that do setup, step and exit, i can shoe-horn them in and we can see how it goes. scnorm - FPDIVSpecialCasesDeNorm ispec FPADDBaseData ospec FPSCData StageChain: FPDIVSpecialCasesMod, FPAddDeNormMod pipediv0 - FPDivStages(start=true) ispec FPSCData ospec FPDivStage0Data StageChain: FPDivStage0Mod, FPDivStage1Mod, ... FPDivStage1Mod pipediv1 - FPDivStages() ispec FPDivStage0Data ospec FPDivStage0Data StageChain: FPDivStage1Mod, ... FPDivStage1Mod ... ... pipediv5 - FPDivStages(end=true ispec FPDivStage0Data ospec FPAddStage1Data StageChain: FPDivStage1Mod, ... FPDivStage1Mod, FPDivStage2Mod normpack - FPNormToPack ispec FPAddStage1Data ospec FPPackData StageChain: Norm1ModSingle, RoundMod, CorrectionsMod, PackMod
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-July/001948.html
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-July/002154.html almost there. tests currently running. FP64 is alarmingly large, will need to be split (2 phases). however, FP16/32/64 all "work".