99 – IEEE754 *pipelined* FPDIV unit needed

Bug 99 - IEEE754 *pipelined* FPDIV unit needed

Summary: IEEE754 *pipelined* FPDIV unit needed

Status:	PAYMENTPENDING FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:

Depends on:
Blocks:	48
	Show dependency tree / graph

Reported:	2019-06-24 14:13 BST by Luke Kenneth Casson Leighton
Modified:	2020-09-21 02:59 BST (History)
CC List:	2 users (show)

See Also:
NLnet milestone:	NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks:	1000
budget (EUR) for this task, excluding subtasks' budget:	1000
parent task for budget allocation:	48
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	# guessed date for lkcl lkcl={amount=400,paid=2019-08-10} programmerjake={amount=600,paid=2019-08-16}

Attachments
thesis with some pipelined divide and other algorithms (768.06 KB, application/pdf) 2019-06-24 14:14 BST, Luke Kenneth Casson Leighton	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2019-06-24 14:13:11 BST

a pipelined FPDIV unit is needed (as opposed to a blocking, FSM
version)

Comment 1 Luke Kenneth Casson Leighton 2019-06-24 14:13:59 BST

https://github.com/wmy367/Radix-2-division/blob/master/Radix_2_div.v

example radix 2 divide which could probably be pipelined.

Comment 2 Luke Kenneth Casson Leighton 2019-06-24 14:14:57 BST

Created attachment 17 [details]
thesis with some pipelined divide and other algorithms

Comment 3 Jacob Lifshay 2019-06-24 14:17:13 BST

pipelined divide can probably share most of the logic with pipelined sqrt/rsqrt

Comment 4 Luke Kenneth Casson Leighton 2019-06-28 06:48:30 BST

https://git.libre-riscv.org/?p=ieee754fpu.git;a=commitdiff;h=2789fb65d1d70e70d45c0e2506dfc18d13939bb1

src/ieee754/fpdiv/divstages.py done
test_fpdiv_pipe.py done
pipeline.py done

all cookie-cut.

div0.py is an "example" class, again cookie-cut.  FPDIVStage0Data
is where the incoming data (from pre-normalisation) gets turned
into "first stage stuff that contains intermediary results
such as Q and R".

Comment 5 Luke Kenneth Casson Leighton 2019-06-28 07:20:23 BST

btw you'll see these puzzling things in each stage:

   with m.If(~self.i.out_do_z):

these are activated back in the special-cases:

        # if b is zero return Inf
        with m.Elif(b1.is_zero):
            m.d.comb += self.o.out_do_z.eq(1)
            m.d.comb += self.o.z.inf(sabx)

such that the result (which is already valid, as a special-case)
can propagate down the (entire) pipeline, unmodified.

this because i haven't worked out early-out bypassing, yet.

hmm must write a separate bugreport as we need cancellation (go_die).

Comment 6 Luke Kenneth Casson Leighton 2019-06-28 08:00:39 BST

we need to keep the FPDIV pipeline stage as short as possible:
RADIX-8 would be the preferred option.

this to reduce the number of Reservation Stations (Function Units)
needed to keep a Concurrent Computation Unit (aka "pipeline") fully
occupied and not cause freezing of the entire issue stage.  see here
for details:

http://bugs.libre-riscv.org/show_bug.cgi?id=101#c2

basically if we have a 24-stage pipeline (not unreasonable for
radix-2 divide) we need a staggering *TWENTY FOUR* Function Units
(src/result latch sets, plus 24 rows in the Dependency Matrices),
just to keep track of all the (potential) results.

bottom line: Radix-8 is, well, pretty high priority would be an
understatement :)

FP64 is just not a high priority (no need to be concerned about stalling,
there, we can even use the FSM version), it's FP32 that's the biggest issue
due to needing one FPDIV per pixel on 3D.

Comment 7 Jacob Lifshay 2019-06-28 08:20:04 BST

All the short variable names need at least a comment at the definition with the full name, otherwise it's quite hard to understand at first. For instance, in FPRoundData, self.oz is very difficult to guess at first.
In my opinion, short variable names (fragments of words) are only appropriate for widely accepted names such as i, j, k for iteration, or widely known abbreviations, such as num instead of number, or for cases where the definition is immediately adjacent to all uses.
out_do_z should be out_do_zero, or, better yet, result_is_zero. FPRoundData.z is totally unintelligible (it's not zero), I would name it base or have FPRoundData extend FPNumBaseRecord.

There comes a certain point where the additional readability from the reduction in lines of code is entirely negated by the lack of comprehensible names.

Comment 8 Luke Kenneth Casson Leighton 2019-06-28 08:23:17 BST

http://bugs.libre-riscv.org/attachment.cgi?id=17

radix-8 is on page 40, section 3.3.4, figure 3.20 on p41.

what's nice is: he really does a good job explaining.
figure 3.21 explains really clearly how a 3:2 CSA
is implemented.  figure 3.15 (p36) explains how
OTFC works.

r btw is the radix (base) - eq 3.15, 3.16.  took a
looong time to work this out.  

also, the VHDL is in the appendix.

Comment 9 Luke Kenneth Casson Leighton 2019-06-28 08:37:40 BST

(In reply to Jacob Lifshay from comment #7)

> There comes a certain point where the additional readability from the
> reduction in lines of code is entirely negated by the lack of comprehensible
> names.

yehh i took shortcuts, and having five 80x65 xterms to jam as much information
on-screen as possible is an absolute necessity for a multi-class modular
hierarchy: far more important than name length.  long names results in
extension of functions beyond reasonable limits because arguments end up
having to span multiple lines.  that in turn often pushes a function well beyond
what can fit (*vertically*) onto a single page, requiring multiple page-up
page-down page-up page-down page-up page-down actions in *multiple windows*
in order to work out what the hell's calling what calling what.

plus, the original code from jon dawson was so short that it was pretty obvious.  
a, b, z.  a is operand1, b is operand2, z is clearly the result.

functions (actually, properties) in FPBase definitely got named fully
after their purpose.

so, z is often actually a module (an Elaboratable), which then has properties
"is_zero", "is_nan" and so on.

basically, i'm aware there's a lot of clean-up and code-morphing that still
needs to be done.  the challenge was in keeping the unit tests running at
all times, as debugging is, to be frank, hell.  introducing even a single
error is a bitch to track down: there's one now which was detected from
FP16 MUL.


oh, btw, FP16 DIV is actually a higher priority for dev / debugging purposes
than FP32.  the reason is that the code-coverage from random result generation
hits corner-cases far more often.

the rounding error introduced last week into FPMUL would have been flat-out
statistically impossible to encounter on FP32 because there was no way that
random result generation would have created the case.

in fact there is no reason why, with FP16, we should not consider doing
a full (complete) coverage of 0x0000-0xffff for both src1 and src2,
particularly on FPGAs or when converted to iverilog.

Comment 10 Luke Kenneth Casson Leighton 2019-06-28 08:44:46 BST

https://git.libre-riscv.org/?p=ieee754fpu.git;a=commitdiff;h=4d0caba0e95751f05690323fbe25fd0286cea40f

+        # NOTE: difference between z and oz is that oz is created by
+        # special-cases module(s) and will propagate, along with its
+        # "bypass" signal out_do_z, through the pipeline, *disabling*
+        # all processing of all subsequent stages.
+        self.a = FPNumBaseRecord(width, m_extra)   # operand a
+        self.b = FPNumBaseRecord(width, m_extra)   # operand b
+        self.z = FPNumBaseRecord(width, False)     # denormed result 
+        self.oz = Signal(width, reset_less=True)   # "finished" (bypass) result
+        self.out_do_z = Signal(reset_less=True)    # "bypass" enabled
+        self.mid = Signal(id_wid, reset_less=True) # multiplexer ID


diff --git a/src/ieee754/fpcommon/roundz.py b/src/ieee754/fpcommon/roundz.py
index db4482b..36253d3 100644 (file)
--- a/src/ieee754/fpcommon/roundz.py
+++ b/src/ieee754/fpcommon/roundz.py
@@ -14,9 +14,10 @@ class FPRoundData:
 
     def __init__(self, width, id_wid):
         self.z = FPNumBaseRecord(width, False)
+        self.mid = Signal(id_wid, reset_less=True) # multiplex ID
+        # pipeline bypass [data comes from specialcases]
         self.out_do_z = Signal(reset_less=True)
         self.oz = Signal(width, reset_less=True)
-        self.mid = Signal(id_wid, reset_less=True)
 
     def eq(self, i):
         return [self.z.eq(i.z), self.out_do_z.eq(i.out_do_z), self.oz.eq(i.oz),
@@ -47,7 +48,7 @@ class FPRoundMod(Elaboratable):
     def elaborate(self, platform):
         m = Module()
         m.d.comb += self.out_z.eq(self.i) # copies mid, z, out_do_z
-        with m.If(~self.i.out_do_z):
+        with m.If(~self.i.out_do_z): # bypass wasn't enabled
             with m.If(self.i.roundz):
                 m.d.comb += self.out_z.z.m.eq(self.i.z.m + 1) # mantissa up
                 with m.If(self.i.z.m == self.i.z.m1s): # all 1s

Comment 11 Luke Kenneth Casson Leighton 2019-06-28 17:02:24 BST

radix-8 3-2 CSA:

figure 3.21 from p42:

def csa(input, mux):
  x, c = halfadd(input[0], input[1])
  x1, c1 = halfadd(c, mux)
  return [x & x1, c]

def csa32(input, mux): # 2 bit input, 2 bit mux)
  x = csa(input, mux[0])
  return csa(x, mux[1])

Comment 12 Luke Kenneth Casson Leighton 2019-06-28 17:22:46 BST

figure 3.18 p40 ep08_15
TODO: edit/sort this, can't work out mux31 from figure 3.13, p34

# input = [a, b, c], sel = [10, 00, 01] to represent {-d, 0, +d}
def mux31(input, sel):
   if sel == 10:
       return a

def mux51(input, sel): # input = [a,b,c,d,e]
   x = mux31(input[:3], sel[0])
   return mux31([x] + input[3:], sel[1])

Comment 13 Luke Kenneth Casson Leighton 2019-06-28 21:20:27 BST

Just realised that FPDIVStages is a single combinatorial stage, can't do that.

Tomorrow I will do pipeline.py pipe1,2,3 as pipesc, pipe1,2,3,4, pipepost

The first one pipe1 will need a tiny bit of conversion at the front.

The last one, pipe4, again conversion at the end.

All of pipe1 to 6 will need to be radix8 (three bits at a time) and 2 StageChained radix8 to give 6 bits at a time.

That gives only 4 stages, and we stand a chance of the pipeline not being insanely long.

6 stages is still one hell of a lot because it will need 6 FUs at the Matrix to keep 100% throughput.

With 128 registers and likely something around 30 FUs we could be looking at a quarter million gates just for the Dependency Matrices.

Getting that number down is really critical.

So the less stages in FPDIV FP32 the better.

Comment 14 Jacob Lifshay 2019-06-28 22:07:08 BST

I did say I would work on the div/rem/fdiv/fsqrt/frsqrt pipeline...

Comment 15 Jacob Lifshay 2019-06-28 22:43:48 BST

(In reply to Luke Kenneth Casson Leighton from comment #13)
> Just realised that FPDIVStages is a single combinatorial stage, can't do
> that.
> 
> Tomorrow I will do pipeline.py pipe1,2,3 as pipesc, pipe1,2,3,4, pipepost
> 
> The first one pipe1 will need a tiny bit of conversion at the front.
> 
> The last one, pipe4, again conversion at the end.
> 
> All of pipe1 to 6 will need to be radix8 (three bits at a time) and 2
> StageChained radix8 to give 6 bits at a time.
> 
> That gives only 4 stages, and we stand a chance of the pipeline not being
> insanely long.
> 
> 6 stages is still one hell of a lot because it will need 6 FUs at the Matrix
> to keep 100% throughput.
> 
> With 128 registers and likely something around 30 FUs we could be looking at
> a quarter million gates just for the Dependency Matrices.
with 128 registers, we really should binary encode the register portion of the dependency matrix, since if we can take less than 256 inverters (128 and gates) per FU, than we save gates and that allows us to have more FUs.
> 
> Getting that number down is really critical.
> 
> So the less stages in FPDIV FP32 the better.

it would probably be 11 or 12 stages, since a 32-bit integer division takes ceil(32/3) == 11 stages for radix 8 division.

32-bit fp operations would be able to early exit 1-2 stages earlier, since they only need 27 bits of result (24-bit mantissa + guard/round/sticky)

it might be possible to switch to radix 16 (allowing 8 or 9 stages), but I'd be worried about excessive area (more than the full 4x32 multiplier alu) and/or excessive per-stage delay, reducing max clock rate.

I'm thinking it'll be a good idea to allow the pipeline to compute 2x16-bit, 1x16-bit + 2x8-bit, and 4x8-bit operations per clock.

I was also thinking widening the pipeline to 64-bit wide would allow 64-bit operations to be computed in 2 passes through the pipeline, 2x32-bit operations in 1 pass, 4x16-bit in 1 pass, 8x8-bit in 1 pass, and aligned combinations like the multiplier. Widening would double the area, but wouldn't take any more pipeline stages. For radix 8, I estimate 10k-15k gates for the 64-bit wide version, taking about the same area as the 4x32-bit multiplier. For radix 16, I estimate about 20k-30k gates.

Comment 16 Jacob Lifshay 2019-06-28 22:48:37 BST

one other thing, the pipeline code needs to support one class having multiple stages inside it, since the multiplier will go inside one class. additionally parts of the multiplier will be used in the divider pipeline.

Comment 17 Luke Kenneth Casson Leighton 2019-06-29 06:32:42 BST

(In reply to Jacob Lifshay from comment #16)
> one other thing, the pipeline code needs to support one class having
> multiple stages inside it,

 if you can wrap them with a StageChain and create a separate class
 that has multiple stages because *StageChain* separates the multiple
 stages, that would be better.

 however it is not strictly necessary to do that: there's nothing
 to stop an individual stage having 2+ clock latency.

 the only thing is: anything that is not full combinatorial *cannot*
 then later be wrapped with StageChain, it *has* to be embedded in
 a full pipe class.

Comment 18 Luke Kenneth Casson Leighton 2019-06-29 06:34:21 BST

(In reply to Jacob Lifshay from comment #14)
> I did say I would work on the div/rem/fdiv/fsqrt/frsqrt pipeline...

i know - i'm pointing you in the direction of the correct use of
the pipeline API, to save time.

Comment 19 Luke Kenneth Casson Leighton 2019-06-29 07:10:58 BST

(In reply to Jacob Lifshay from comment #15)

> > With 128 registers and likely something around 30 FUs we could be looking at
> > a quarter million gates just for the Dependency Matrices.
> with 128 registers, we really should binary encode the register portion of
> the dependency matrix, 

 deep breath: we can't.  the implications of moving to a binary encoding:
 transitive register dependency relationship expression, absolutely critical
 for simple multi-issue, would be destroyed.

 mitch alsup explained that in a multi-issue system, all that is needed is
 to *accumulate* the read/write dependency relationships from the previous
 instruction.

 this *requires* an unary register format because now rather than just one
 bit being set, it is now *multiple* bits being set.

 there is another way: a lookup table that goes between the "real"
 registers and the DM.  the only problem is, stall/freeze is needed
 on issue if the numbers are too small.


> > 
> > Getting that number down is really critical.
> > 
> > So the less stages in FPDIV FP32 the better.
> 
> it would probably be 11 or 12 stages, since a 32-bit integer division takes
> ceil(32/3) == 11 stages for radix 8 division.

 i cannot emphasise enough how insane it is to have 12 Function Units
 dedicated to one task (even if they're capable of doing more than one
 job).

 do we *really* need 32-bit integer DIV as a high priority? i.e. is 32-bit
 DIV used anywhere in the FPU code?

> 32-bit fp operations would be able to early exit 1-2 stages earlier, since
> they only need 27 bits of result (24-bit mantissa + guard/round/sticky)

 this is an optimisation.

 can we please take sharing of INT/FP pipeline capabilities off the table
 for this budget-limited task and move it to the (entirely separate)
 proposal, which will have a lot more money (and time) to consider
 serious optimisations.

> it might be possible to switch to radix 16 (allowing 8 or 9 stages), but I'd
> be worried about excessive area (more than the full 4x32 multiplier alu)
> and/or excessive per-stage delay, reducing max clock rate.

 it's why i suggested two combinatorially-linked (StageChained) radix-8
 blocks.
 
> I'm thinking it'll be a good idea to allow the pipeline to compute 2x16-bit,
> 1x16-bit + 2x8-bit, and 4x8-bit operations per clock.

 if it's as complex and as time-consuming as the MUL to develop, please
 create a separate bugreport, mark it as "TODO later", and we can put
 it into the "FP optimisation and formal proof" proposal.

 it's really important to keep this *real* simple, get a "first version"
 done and *schedule* optimisations for later.

 planning ahead is fine, creating classes that can be morphed or used later,
 but not if doing so interferes drastically with getting this done *quickly*.


> I was also thinking widening the pipeline to 64-bit wide would allow 64-bit
> operations to be computed in 2 passes through the pipeline, 2x32-bit
> operations in 1 pass, 4x16-bit in 1 pass, 8x8-bit in 1 pass, and aligned
> combinations like the multiplier. Widening would double the area, but
> wouldn't take any more pipeline stages. For radix 8, I estimate 10k-15k
> gates for the 64-bit wide version, taking about the same area as the
> 4x32-bit multiplier. For radix 16, I estimate about 20k-30k gates.

 if the classes can be done parameterised so that something may be
 constructed later, under a separate funding proposal, great.

Comment 20 Luke Kenneth Casson Leighton 2019-06-29 10:38:01 BST

this is what the stack looks like.  it looks complicated: it's not.
the work's "already done".

the entry-points for the INT-based DIV unit are in div0.py, div1.py and div2.py

the only reason for the existence of those three classes is purely to
deal with the fact that the data has to be converted on exit from the
pre-normalisation phase into the Q/REM DIV-chain, and also converted
on entry to the *post* normalisation phase.

div0.py deals with the "setup"
div1.py deals with one Q/Rem "step"
div2.py deals with the "exit"

if you can get an INT DIV radix-2 (or 4, or 8) unit up and running
in the next day or so, with appropriate parameterised classes that
do setup, step and exit, i can shoe-horn them in and we can see how
it goes.


scnorm   - FPDIVSpecialCasesDeNorm ispec FPADDBaseData  ospec FPSCData
            StageChain: FPDIVSpecialCasesMod,
                        FPAddDeNormMod

pipediv0 - FPDivStages(start=true) ispec FPSCData       ospec FPDivStage0Data
            StageChain: FPDivStage0Mod,
                        FPDivStage1Mod,
                        ...
                        FPDivStage1Mod

pipediv1 - FPDivStages()           ispec FPDivStage0Data ospec FPDivStage0Data
            StageChain: FPDivStage1Mod,
                        ...
                        FPDivStage1Mod
...
...

pipediv5 - FPDivStages(end=true    ispec FPDivStage0Data ospec FPAddStage1Data
            StageChain: FPDivStage1Mod,
                        ...
                        FPDivStage1Mod,
                        FPDivStage2Mod

normpack - FPNormToPack            ispec FPAddStage1Data ospec FPPackData
            StageChain: Norm1ModSingle,
                        RoundMod,
                        CorrectionsMod,
                        PackMod

Comment 21 Luke Kenneth Casson Leighton 2019-07-02 09:49:46 BST

http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-July/001948.html

Comment 22 Luke Kenneth Casson Leighton 2019-07-23 11:13:46 BST

http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-July/002154.html

almost there.  tests currently running.  FP64 is alarmingly large, will
need to be split (2 phases).  however, FP16/32/64 all "work".