413 – DIV "trial" blocks are too large

Bug 413 - DIV "trial" blocks are too large

Summary: DIV "trial" blocks are too large

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Hardware Layout (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:	324
	Show dependency tree / graph

Reported:	2020-07-02 20:02 BST by Luke Kenneth Casson Leighton
Modified:	2020-07-23 03:45 BST (History)
CC List:	2 users (show)

See Also:	417 436
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-07-02 20:02:32 BST

the "trial" blocks used in the div pipelines create 2.1 megabyte VHDL
files each, during layout, and there are 270 of them.  this results
in a jump in the size of the BLIF file from 10 megabytes to 100 megabytes,
and a jump from 85,000 blocks (which coriolis2 can cope with)
to 500,000 (which it cannot).

Comment 1 Jacob Lifshay 2020-07-02 20:11:56 BST

(In reply to Luke Kenneth Casson Leighton from comment #0)
> the "trial" blocks used in the div pipelines create 2.1 megabyte VHDL
> files each, during layout, and there are 270 of them.  this results
> in a jump in the size of the BLIF file from 10 megabytes to 100 megabytes,
> and a jump from 85,000 blocks (which coriolis2 can cope with)
> to 500,000 (which it cannot).

yeah, I'm aware the div pipeline is currently too big. I'm planning on rewriting it to shrink it by a large factor. For now, we can change the number of bits calculated per pipeline stage to 1, which should reduce it to 3/8 the current size. The downside is that takes 64 compute stages (32 pipeline stages), so, for now we'll just have to live with not being able to keep the div pipeline full due to limited fu count.

To do that, set log2_radix=1 here:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/div/pipe_data.py;h=035290ab6dc50c99d36039b966e65bb048d9677b;hb=HEAD#l26

Comment 2 Luke Kenneth Casson Leighton 2020-07-02 20:13:36 BST

jacob can you please investigate this as high priority, with some urgency.

it could be as simple as not having so many Const << Const shifts
and instead actually computing them.  also where the shifts are
constant, using sig[N:] instead.

however that seems unlikely given that just running "proc trial3"
and "opt trial3" followed by "show trial3" shows that it does
pretty much exactly that conversion.

we need to find out what's going on, here, and get the size of DIV
down by a factor of 10.

Comment 3 Luke Kenneth Casson Leighton 2020-07-02 20:15:46 BST

(In reply to Jacob Lifshay from comment #1)
> (In reply to Luke Kenneth Casson Leighton from comment #0)
> > the "trial" blocks used in the div pipelines create 2.1 megabyte VHDL
> > files each, during layout, and there are 270 of them.  this results
> > in a jump in the size of the BLIF file from 10 megabytes to 100 megabytes,
> > and a jump from 85,000 blocks (which coriolis2 can cope with)
> > to 500,000 (which it cannot).
> 
> yeah, I'm aware the div pipeline is currently too big. I'm planning on
> rewriting it to shrink it by a large factor. For now, we can change the
> number of bits calculated per pipeline stage to 1, which should reduce it to
> 3/8 the current size.

it really needs to be a factor of 10 reduction.  i'll cancel the current
build and see what happens when radix=1.

> The downside is that takes 64 compute stages (32
> pipeline stages), so, for now we'll just have to live with not being able to
> keep the div pipeline full due to limited fu count.

that's fine - it was always never going to be full until we have 8-12
parallel Function Units dedicated to DIV.

Comment 4 Jacob Lifshay 2020-07-02 20:22:57 BST

(In reply to Luke Kenneth Casson Leighton from comment #0)
> the "trial" blocks used in the div pipelines create 2.1 megabyte VHDL
> files each, during layout, and there are 270 of them.  this results
> in a jump in the size of the BLIF file from 10 megabytes to 100 megabytes,
> and a jump from 85,000 blocks (which coriolis2 can cope with)
> to 500,000 (which it cannot).

one additional way that the size might be unnecessarily large is if yosys for some reason doesn't do the almost-trivial constant propagation on the DivPipeCoreOperation, since I'm relying on that to discard the dead code for the sqrt/rsqrt portions of the div pipe. I didn't check if yosys discards that. It probably needs to be flattened for yosys to optimize properly.

Comment 5 Luke Kenneth Casson Leighton 2020-07-02 20:39:40 BST

(In reply to Jacob Lifshay from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #0)
> > the "trial" blocks used in the div pipelines create 2.1 megabyte VHDL
> > files each, during layout, and there are 270 of them.  this results
> > in a jump in the size of the BLIF file from 10 megabytes to 100 megabytes,
> > and a jump from 85,000 blocks (which coriolis2 can cope with)
> > to 500,000 (which it cannot).
> 
> one additional way that the size might be unnecessarily large is if yosys
> for some reason doesn't do the almost-trivial constant propagation on the
> DivPipeCoreOperation, since I'm relying on that to discard the dead code for
> the sqrt/rsqrt portions of the div pipe.

inside a module that's fine.  in *between* modules it is not fine because
yosys cannot detect those cases without flattening.

so for example passing a Const as an input to a module: that's guaranteed
not to be detected.

> I didn't check if yosys discards
> that. It probably needs to be flattened for yosys to optimize properly.

which will need a customised script to do that as it will be the
only module that needs custom flattening.

these are the size of the trial blocks, now:

    876K Jul  2 19:31 trial1_311.vst

that's down from 2 megabytes (each).  the reduction has got things down
from 500,000 blocks down to 250,000 blocks which is still 2.5 times larger
than anything that anyone has done with coriolis2 before.

Comment 6 Luke Kenneth Casson Leighton 2020-07-03 03:32:20 BST

looking at Trial i think what i will do tomorrow is make it dynamically possible to set which DivCore ops Trial supports.

i'd love to create some classes or a table if  metafunctions that split out each of the three types of operations separately however that could actually interfere with the flow if you know what i mean.

by cutting out SQRT and RSQRT for now by dynamically dropping the switch statements it should be possible to cut the number of gates by a factor of 3.

also i think having 3 sets of signals each of which is 200 wires and only one of which is ever used at one time, this is excessive and wasteful: the same signals should be used, just given different python names (for convenience and readability)

Comment 7 Luke Kenneth Casson Leighton 2020-07-03 04:26:23 BST

>>> 434*434
188356

down from 500,000 it is going to be several hours on placement alone.

each core section also looks too large, containing rr multiply that
is not needed.  will try cutting that.

Comment 8 Jacob Lifshay 2020-07-03 04:31:47 BST

(In reply to Luke Kenneth Casson Leighton from comment #7)
> >>> 434*434
> 188356
> 
> down from 500,000 it is going to be several hours on placement alone.
> 
> each core section also looks too large, containing rr multiply that
> is not needed.  will try cutting that.

all the multiplies should be multiplying by small constants, which should convert to a few adds.

if the div pipe is flattened, their is probably a lot more that can be shared between all the different parts, such as every stage multiplying the divisor by the same constants.

Comment 9 Luke Kenneth Casson Leighton 2020-07-03 05:05:52 BST

nope. far too big.  i think we need to cut back to only using pipelines for 32 bit div operations which will not only cut the number of stage in half it will cut the width of the intermediates in half as well).  use a FSM for 64-bit ones (which means 128 bit intermediates)

Comment 10 Luke Kenneth Casson Leighton 2020-07-03 05:23:21 BST

(In reply to Jacob Lifshay from comment #8)
> (In reply to Luke Kenneth Casson Leighton from comment #7)
> > >>> 434*434
> > 188356
> > 
> > down from 500,000 it is going to be several hours on placement alone.
> > 
> > each core section also looks too large, containing rr multiply that
> > is not needed.  will try cutting that.
> 
> all the multiplies should be multiplying by small constants, which should
> convert to a few adds.

adds that are 192 bits long.  this results in absolutely massive adders
by the time it is converted to gates.  likewise for the trial_comparison
(the greater-than-equal)

this results in a 450k VST file because it is literally around 2,000
cells to do the compare @ 192 bit long

if one of those compares can be cut out (because the PriorityEncoder
will always select at least the lowest flag) then that literally
halves the number of gates when radix=1.


can you help investigate by using yosys and installing coriolis2 and compiling
the code so that you can see what is going on.

you need to understand exactly what is going on otherwise guessing what
*might* work is going to be a waste of time and we do not have time to
waste.

you need the feedback loop which you are entirely missing at the moment
by not running the unit tests and not running the tools.

this in turns means that i have to be the one to do it and that's simply
not ok because there is too much else to do.


> if the div pipe is flattened, their is probably a lot more that can be
> shared between all the different parts, such as every stage multiplying the
> divisor by the same constants.

constants are simply converted to pulling locally to VSS or VDD at the
point they are needed: they take up no space at all.

Comment 11 Luke Kenneth Casson Leighton 2020-07-03 05:40:41 BST

>>> 378*379
143262

by simply cutting out one of the comparisons all 64 of the mid-core stages have been cut in half, the VST file size is 250k for each QTY 64 core_NNN which has saved another 50,000 cells (which is i think somewhere around 250,000 gates).

with the trial_compare_rhs being three times larger than the 64 bit numbers
it is working from this is why the number of gates is so massive.

we seriously need to rethink this.

Comment 12 Jacob Lifshay 2020-07-03 06:19:27 BST

I am planning on rewriting DivPipeCore to allow statically disabling some ops (sqrt, rsqrt), it will also support looping the pipeline back around on itself to support 32-bit ops by running the op through the pipeline twice and 64-bit ops by running through the pipeline 4 times. It will also support 8 and 16-bit ops run through once. That will make scheduling a little harder to implement, but should still be quite trivial, since all that's needed is to delay issuing ops for each cycle that something is looping back.
I think I should probably also add support for SIMD partitioning (also statically disable-able), since it won't make it much more complex and would easily do 2x 32-bit div every 2 cycles or 4x 16-bit div every cycle or 8x 8-bit div every cycle. The SIMD functionality would be more complex to schedule, so would be statically disabled for the oct tapeout, though we would still want the ability to take half as many trips through the pipeline for 32-bit div.

If there isn't enough time, I can write a simple radix 4 or 8 fsm-based div in probably a day or two.

Comment 13 Jacob Lifshay 2020-07-03 06:39:04 BST

(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Jacob Lifshay from comment #8)
> > (In reply to Luke Kenneth Casson Leighton from comment #7)
> > > >>> 434*434
> > > 188356
> > > 
> > > down from 500,000 it is going to be several hours on placement alone.
> > > 
> > > each core section also looks too large, containing rr multiply that
> > > is not needed.  will try cutting that.
> > 
> > all the multiplies should be multiplying by small constants, which should
> > convert to a few adds.
> 
> adds that are 192 bits long.  this results in absolutely massive adders
> by the time it is converted to gates.  likewise for the trial_comparison
> (the greater-than-equal)
> 
> this results in a 450k VST file because it is literally around 2,000
> cells to do the compare @ 192 bit long
> 
> if one of those compares can be cut out (because the PriorityEncoder
> will always select at least the lowest flag) then that literally
> halves the number of gates when radix=1.
> 
> 
> can you help investigate by using yosys and installing coriolis2 and
> compiling
> the code so that you can see what is going on.
> 
> you need to understand exactly what is going on otherwise guessing what
> *might* work is going to be a waste of time and we do not have time to
> waste.
> 
> you need the feedback loop which you are entirely missing at the moment
> by not running the unit tests

That's simply because I didn't yet get around to working on the unit tests, I've been distracted by improving power-instruction-analyzer to allow using the tested-to-be-correct Rust instruction models directly from our python unit tests by adding Python bindings. I didn't push yet because I'm in the middle of a refactor and the code doesn't yet compile.

> and not running the tools.

I've been assuming that yosys synth is pretty representative, since it converts to all the gates that are used in the final circuit. If wiring is taking up much more space than would be expected from gate count, I can figure out how to install coriolis2.

> 
> > if the div pipe is flattened, their is probably a lot more that can be
> > shared between all the different parts, such as every stage multiplying the
> > divisor by the same constants.
> 
> constants are simply converted to pulling locally to VSS or VDD at the
> point they are needed: they take up no space at all.

true, except that each stage has its own instance of `divisor * N` for example, which gets converted to some adders, rather than a multiplier and a constant (assuming yosys isn't stupid). If that's replaced with propagating the pre-multiplied values through each stage, it would increase the d-ff count but reduce the adder count.

Additionally, if the wreduce yosys pass is run, it reduces the width of arithmetic operations when it can prove that a smaller bit-width will suffice.

DivPipeCore is really designed assuming it will be flattened and yosys will then have a chance to const-propagate past pipeline registers and convert multiple identical ops to use a single op.

Comment 14 Luke Kenneth Casson Leighton 2020-07-03 14:28:14 BST

(In reply to Jacob Lifshay from comment #12)
> I am planning on rewriting DivPipeCore to allow statically disabling some
> ops (sqrt, rsqrt),

i started on that last night, added an option to DivPipeCoreConfig to
select a subset of operations.  nothing sophisticated however it cut
the gates needed in half.

> it will also support looping the pipeline back around on
> itself to support 32-bit ops by running the op through the pipeline twice
> and 64-bit ops by running through the pipeline 4 times.

yes i like this idea.  i did actually create a pipeline "loop" example
however i ran into difficulties on synchronising the incoming and outgoing,
it ran into a combinatorial loop, and so stopped work on it at the time.

it was designed for exactly these scenarios: feedback data for further
processing.

the issue is that you need not just the main pipeline, you need a fan-in
and a fan-out, where the fan-out will check the loop count and send back
if non-zero, and send forward if zero.

the loopback fan-in needs to be *absolute* top priority (block) incoming data, 
because otherwise the entire pipeline must stall... or throw away the data.
both are bad options, so the loopback fan-in has to have top priority.


> It will also support
> 8 and 16-bit ops run through once. That will make scheduling a little harder
> to implement, but should still be quite trivial, since all that's needed is
> to delay issuing ops for each cycle that something is looping back.

yes, it needs proper synchronisation: a 2-to-1 at the top and a 1-to-2 at
the bottom.

my thoughts are that this should be done using the ReservationStations
class, because it already exists and i know it works.

whereas the fan-in fan-out experiment failed to work properly, particularly
when mask (cancellation) is involved.

i will draw it out, give me a couple of hours.

basically you have *TWO* context IDs (ctx.mux_id - see FPPipeContext),
and double the number of ReservationStations (quadruple if you need
4x recycling).

the recycling (looping) is done by:

* connecting the CompUnit inputs to RS #1

* letting RS #1 schedule a slot to the pipeline, with "ctx.mux_id=1"

* RS #1 receives the 1st "processed" data and its output is connected
  to *RS #2* input

* RS #2 now has ctx.mux_id=2

* RS #2 puts the data into the pipeline a 2nd time, passes to RS #3
...
...
* RS #4 receives the data processed the 4th time and this is connected
  to the CompUnit result latches.


now, if we need to do 8x ReservationStations (for the large GPU) then
this becomes *32* ReservationStations where the connectivity chain is:


* src-Compunit 0 -> RS 0 -> RS 8 -> RS 16 -> RS 24 -> dest-CompUnit 0
* src-Compunit 1 -> RS 1 -> RS 9 -> RS 17 -> RS 25 -> dest-CompUnit 1
...
....
* src-CompUnit 7 -> RS15 ..... dest-CompUnit 7

so it's "striped" connectivity.

i can do a demo / example of this, based on one of the existing
ReservationStation unit tests, very quickly.


> I think I should probably also add support for SIMD partitioning (also
> statically disable-able), since it won't make it much more complex and would
> easily do 2x 32-bit div every 2 cycles or 4x 16-bit div every cycle or 8x
> 8-bit div every cycle. The SIMD functionality would be more complex to
> schedule, so would be statically disabled for the oct tapeout, though we
> would still want the ability to take half as many trips through the pipeline
> for 32-bit div.

we're incredibly short on time.  if it's not taking a few days to write,
we really have to rule it out.

that October 2020 deadline is a *hard* immovable deadline.


> If there isn't enough time, I can write a simple radix 4 or 8 fsm-based div
> in probably a day or two.

can you start on that straight away: you *should* - hypothetically -
be able to usethe DivPipe combinatorial blocks without modifications,
by simply setting up a register "thing" that captures the ospec() and
feeds it to the ispec() on the next clock cycle.

this was the basis of the FSM support i added (except it turned out to
be too awkward to keep actively maintained).


ok, apart from passing in the shift amounts as actual Signals rather than
Consts.

which really might justify a from-scratch implementation, yeah.

this is very good, and really short:
https://github.com/antonblanchard/microwatt/blob/master/divider.vhdl

the FSM should easily fit into 100 lines of code


by the time you've got that done i should have an example "striping"
ReservationStation working.

which actually we can use for general "micro-coding" purposes (including MUL).  i wasn't planning on doing one, but... *sigh*.

Comment 15 Luke Kenneth Casson Leighton 2020-07-03 16:37:20 BST

i think i have a solution, it is simper than previously imagined.

remember:

there are multiple ReservationStations but only one pipeline. there is a Priority muxer that selects one and only one RS INPUT, allowing it to send its data into the front of the pipeline.

the muxid NORMALLY remains the same throughout the entire pipeline so that at the output stage the result may be dropped into the correct RS Output latch.

so the modifications are as follows:

* 4x the number of RSes are created. if 8 initially, 32 are created.

* RS output 8 thru 31 are wired *directly* to RS input 8 thru 31. 8 to 8 etc.

* one of the pipeline stages *MODIFIES* ctx.mid (the RS muxid) *DIRECTLY* (by adding or subtracting multiples of 8 to it). when (or if) this is done does not matter.

* for 8/16/32 bit operations no modification to muxid is needed.

* for 64 bit operations on the first run through, 8 is added to the muxid. this causes the result to be fed into RS Output 8-15 instead of 0-7. also the operation is marked "Phase 2".

* likewise on the 2nd and 3rd.

* for the final phase, 24 is *SUBTRACTED* from the muxid by the last pipeline stage.

this last action will cause the data, now hsving been through 4 pipelines, to drop into RS 0-7 which is the output to the CompUnits 0-7

so the only modifications needed to DIV are to be able to specify a range of shift amounts rather than one static one.

each time through the pipelines will select one of those shift amounts.

this is where it gets slightly complex, because the last thing we want is a full dynamic shifter.

instead, what might be possible is to use a shift cascade, by analysing and subtracting the amounts between the static shifts and activating them combinatorially depending on the loop number:

shift1 = 5 # constant
shift2 = 20 # constant

therefore:

if pipelooprun == 1 or == 2:
out1 = in << 5
if pipelooprun == 2
out2 = out1 << 15 # 20-5

etc.

this assumes that the shift amounts are incrementing on each loop. given that it is just wires (muxes) it should not cause too much gate delay

however if we try to do radix greater than 8 it may be a problem

Comment 16 Luke Kenneth Casson Leighton 2020-07-03 17:09:43 BST

(In reply to Jacob Lifshay from comment #13)
> (In reply to Luke Kenneth Casson Leighton from comment #10)
> > (In reply to Jacob Lifshay from comment #8)
> > > (In reply to Luke Kenneth Casson Leighton from comment #7)
> > > > >>> 434*434
> > > > 188356
> > > > 
> > > > down from 500,000 it is going to be several hours on placement alone.
> > > > 
> > > > each core section also looks too large, containing rr multiply that
> > > > is not needed.  will try cutting that.
> > > 
> > > all the multiplies should be multiplying by small constants, which should
> > > convert to a few adds.
> > 
> > adds that are 192 bits long.  this results in absolutely massive adders
> > by the time it is converted to gates.  likewise for the trial_comparison
> > (the greater-than-equal)
> > 
> > this results in a 450k VST file because it is literally around 2,000
> > cells to do the compare @ 192 bit long
> > 
> > if one of those compares can be cut out (because the PriorityEncoder
> > will always select at least the lowest flag) then that literally
> > halves the number of gates when radix=1.
> > 
> > 
> > can you help investigate by using yosys and installing coriolis2 and
> > compiling
> > the code so that you can see what is going on.
> > 
> > you need to understand exactly what is going on otherwise guessing what
> > *might* work is going to be a waste of time and we do not have time to
> > waste.
> > 
> > you need the feedback loop which you are entirely missing at the moment
> > by not running the unit tests
> 
> That's simply because I didn't yet get around to working on the unit tests,

sorry, i'm a bit stressed.  we still have mul to do

> I've been distracted by improving power-instruction-analyzer to allow using
> the tested-to-be-correct Rust instruction models directly from our python
> unit tests by adding Python bindings. I didn't push yet because I'm in the
> middle of a refactor and the code doesn't yet compile.

ah ok.  do keep us informed how that goes.

> I've been assuming that yosys synth is pretty representative, since it
> converts to all the gates that are used in the final circuit. If wiring is
> taking up much more space than would be expected from gate count, I can
> figure out how to install coriolis2.

it's that things are not obvious from yosys.  even if not letting it proceed with routing, just the creation of the VST files (alliance takes the BLIF and translates modules into VHDL using subset of its syntax) the size of those VST files is more accurate in terms of area. 

> > 
> > > if the div pipe is flattened, their is probably a lot more that can be
> > > shared between all the different parts, such as every stage multiplying the
> > > divisor by the same constants.
> > 
> > constants are simply converted to pulling locally to VSS or VDD at the
> > point they are needed: they take up no space at all.
> 
> true, except that each stage has its own instance of `divisor * N` for
> example, which gets converted to some adders, rather than a multiplier and a
> constant (assuming yosys isn't stupid).

it's good but not perfect.  it also tends to optimise for reducing latency rather than gate count.

which is why ARM has always done 2 numbers for its cores: size optimised and speed optimised

> If that's replaced with propagating
> the pre-multiplied values through each stage, it would increase the d-ff
> count but reduce the adder count.

with 192 bits in rr and so on that's probably not a good idea.

> Additionally, if the wreduce yosys pass is run, it reduces the width of
> arithmetic operations when it can prove that a smaller bit-width will
> suffice.

interesting.

> 
> DivPipeCore is really designed assuming it will be flattened and yosys will
> then have a chance to const-propagate past pipeline registers and convert
> multiple identical ops to use a single op.

yehh we can't make that assumption.  it may be necessary to do partial layout of subblocks (each pipeline stage separately) but the way things are done right now there is no infrastructure in place for partial flattening.

i have asked whitequark about a possible nmigen const propagation pass, which would achieve the same thing.

jean-paul has had to put some hacks in place because there are quite a few dangling signals both in and out on modules.

Comment 17 Jacob Lifshay 2020-07-03 19:37:09 BST

your pipeline loopback solution is waay more complex then I was imagining:
The pipeline would have either 7 stages (for 1x radix 3 compute stage per pipeline stage and 1 extra stage at each end for leftovers) or about 3 or 4 (for 2x radix 3 compute stages per pipeline stage).
I think we should only have 10 or so reservation stations for the non-SIMD case, since we don't need much more than the number of instructions that can be simultaneously executing. 20 or 30 is excessive.
There would be a 2-bit Signal tracking the loop number -- totally separate from anything in ctx. fiddling with the id seems like a horrible mess that can be easily avoided. We would build a custom loop header and footer pipeline control stages, where they are constructed together so can send signals from footer to header for the loop backedge. This would also be useful for fsm pipelines since we could implement that using only a combinatorial stage in between the loop header and footer.


So, the pipeline would look kinda like:

RS0    RS1 ... RS9
 |      v      |
 | +----+----+ |
 +>+Prio. Sel+<+
   +----+----+
        |
        v
   +----+----+
   |  setup  |
   +----+----+
        |
        v
   +----+----+
+->+ loop hdr|
|  +----+----+
|       |
|       v
|  +----+----+
|  | compute0|
|  +----+----+
|       |
|       v
|  +----+----+
|  | compute1|
|  +----+----+
|       |
|      ...
|       |
|       v
|  +----+----+
|  | compute5|
|  +----+----+
|       |
|       v
|  +----+----+
+--+ loop ftr|
   +----+----+
        |
        v
   +----+----+
   |  finish |
   +----+----+
        |
        v
   back to RSes

Comment 18 Luke Kenneth Casson Leighton 2020-07-03 20:25:25 BST

(In reply to Jacob Lifshay from comment #17)
> your pipeline loopback solution is waay more complex then I was imagining:

i've been thinking about how to do this for some considerable time, attempted
what you suggest below, and found that when trying to add mask cancellation
it became so complex that i couldn't actually work out how to do it.

mask cancellation is absolutely essential because it's how the speculative
execution gets.. well... cancelled.

also, you have to have stall propagation in the solution that you outline below,
i'll illustrate where.

also, *because* of that stall propagation, and how it interacts at the
merge-points, the pipeline utilisation is significantly sub-par.


> The pipeline would have either 7 stages (for 1x radix 3 compute stage per
> pipeline stage and 1 extra stage at each end for leftovers) or about 3 or 4
> (for 2x radix 3 compute stages per pipeline stage).
> I think we should only have 10 or so reservation stations for the non-SIMD
> case, since we don't need much more than the number of instructions that can
> be simultaneously executing. 20 or 30 is excessive.

it really isn't.  remember: only 10 of those are connected to (10)
ComputationUnits, whilst the remainder *store the partial results*.

this solves the problem of stalling.

> There would be a 2-bit Signal tracking the loop number -- totally separate
> from anything in ctx. fiddling with the id seems like a horrible mess that
> can be easily avoided.

you're missing the bit where i tried already to do this (exactly as outlined
below), and tried adding in the necessary mask-cancellation and that *was*
an awful mess.

by contrast, modifying the ctx.mux_id is a very simple solution that i will
have completed in about another 30-40 minutes.


> We would build a custom loop header and footer
> pipeline control stages, where they are constructed together so can send
> signals from footer to header for the loop backedge. 

yes.  this is what i tried.  the code already exists: MultiOutControlBase
and MultiInControlBase.  or, more to the point, CombMultiOutPipeline and
CombMultiInPipeline.


> This would also be
> useful for fsm pipelines since we could implement that using only a
> combinatorial stage in between the loop header and footer.
> 
> 
> So, the pipeline would look kinda like:

nice ASCII art, btw :)
 
> RS0    RS1 ... RS9
>  |      v      |
>  | +----+----+ |
>  +>+Prio. Sel+<+
>    +----+----+
>         |
>         v
>    +----+----+
>    |  setup  |
>    +----+----+
>         |
>         v
>    +----+----+
> +->+ loop hdr|
> |  +----+----+


this is where the problems start - right here.  when both "setup" and
"loop ftr" have data, *and* compute5 has data:

1) you can't stop compute5 from sending its data (pausing) because
   it's a pipeline.  it *HAS* to send its data to "loop ftr" on the
   next cycle.

2) therefore you MUST - without fail - prioritise "loop ftr" on the
   MUX-IN

3) therefore you must THROW AWAY the data from "setup".

you see how it doesn't work?  .... unless we have stall propagation
right the way through the entire pipeline.  which 

now, if you could *store* the partial result, somehow, in a buffer,
there would be no problem, would there?

and how many would actually needed?  the number needed can easily
be computed: it's the number of current ReservationStations (N)
because regardless of what's in-flight in the pipelines, the RSes
will *only* stop trying to push further data into the pipeline when
all ReservationStations have outstanding data.

with that being the condition when we can *guarantee* that further
data will be pushed into the pipe, that in turn means that we *need*
exactly N more buffers to store the data.

what name could we give those buffers... i know, let's call them
ReservationStationN*2!

it turns out then that the exact same purpose for which ReservationStations
0 to N-1 are used for (to prevent stalling and to keep in-flight data)
applies equally as well to the partial results.

however: here is another solution:

        |
        v
   +----+----+
+->+ loop hdr|
|  +----+----+
|       |
FIFO    v
|  +----+----+
|  | compute0|
|  +----+----+


where the size of that FIFO is capable of storing a *full* batch of
partial results.  in the case where we have 10 Reservation Stations
and the plan is to have a 4-way "loop", that FIFO *must* have 30
(40 - 10) entries.

anything less than that and it is *guaranteed* that data will be lost
(or stalling will be needed).

and those extra 30 FIFO entries?  they serve the exact same purpose
as the extra 30 RSes, with the advantage being that i'm almost done
modifying the code to include the feedback loop (it's taking approximately
30 lines of code) whereas the solution using a FIFO is a lot more code
(a lot more objects)

but there's more.

here's the thing, jacob: both solutions *appear* directly equivalent to
each other... but they're actually not.

the problem is that the muxid must be globally unique.  we absolutely
cannot have multiple in-flight results with the exact same muxid.  if
we did, then it is *guaranteed* that data corruption would occur.

and if there are say only 10 RSes, and only 10 muxids, but there are
40 possible partial in-flight pieces of data, there will be 4 with
muxid=0, 4 with muxid=1 .... 4 with muxid=9.

this becomes... worrying.  however what's different about those 4
with the same muxid?  they have *different* 2-bit Signal tracking
IDs.

you see how that's exactly the same as adding an extra 2 bits on the
muxid?

now, it doesn't actually matter if Signal tracking IDs are used or
if there are 2 extra bits on the muxid.  it's perfectly fine to have
both.

but... there's *still* more!

when it comes to mask cancellation, the data that's in the FIFO would
need to be destroyed.  that means *intrusively* getting inside the
FIFO, and interfering with the data structure which is *specifically*
designed not to be cancellable (it's a round-robin SRAM).

overall, then, when everything is taken into consideration, the proposal
to alter the muxid is way *way* simpler and more effective:

* pipeline no stalling
* no stalling needed
* no interaction at the mux-in/out
* very little code needs to be written
* mask cancellation is not adversely impacted (it's already functional)

Comment 19 Jacob Lifshay 2020-07-03 20:59:04 BST

(In reply to Luke Kenneth Casson Leighton from comment #18)
> (In reply to Jacob Lifshay from comment #17)
> > your pipeline loopback solution is waay more complex then I was imagining:
> 
> i've been thinking about how to do this for some considerable time, attempted
> what you suggest below, and found that when trying to add mask cancellation
> it became so complex that i couldn't actually work out how to do it.

I still think you're waay overcomplicating it. I'll post a more detailed explanation a little later when I have time.
> 
> mask cancellation is absolutely essential because it's how the speculative
> execution gets.. well... cancelled.

aren't the mask cancellation signals just broadcast to all pipeline stages? all that's needed is to just mark the instruction as canceled and have the loop footer stop looping early.

Note that the loop header and footer are special blocks (not just a mux) that always prioritize non-canceled instructions that are looping back over new instructions, stalling all stages before the pipeline header when an instruction is looping back. That way, no FIFOs or any horribly complex stuff are needed. The loop header block would have a pipeline register as part of it.

Additional advantages of the loop header/footer are that they are composable, you can easily build a doubly-nested pipeline loop if we ever need that.

You can think of them as the pipeline equivalent of a do-while loop.

Comment 20 Luke Kenneth Casson Leighton 2020-07-03 22:08:48 BST

(In reply to Jacob Lifshay from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #18)
> > (In reply to Jacob Lifshay from comment #17)
> > > your pipeline loopback solution is waay more complex then I was imagining:
> > 
> > i've been thinking about how to do this for some considerable time, attempted
> > what you suggest below, and found that when trying to add mask cancellation
> > it became so complex that i couldn't actually work out how to do it.
> 
> I still think you're waay overcomplicating it.

i'm really not, and it's really not that complicated.  5 extra main lines of code in the base class of ReservationStations and one extra constructor parameter.


> I'll post a more detailed
> explanation a little later when I have time.

there's no need: i completely get it, because i already implemented it last year, and i implemented it exactly and i mean  literally exactly as you described in the ASCII art diagram.

it *didn't work*, jacob.  a combinatorial lockup occurred under certain conditions that i had to stop investigating because it was taking up far too much time.

this has me seriously concerned that you will also becone embroiled in trying to work out a solution and become similarly trapped in the task...

.... when i *already have a simple solution* that only needs around 30 extra lines of code to adapt *existing* classes for this purpose.




> > 
> > mask cancellation is absolutely essential because it's how the speculative
> > execution gets.. well... cancelled.
> 
> aren't the mask cancellation signals just broadcast to all pipeline stages?

yes.  except you can't broadcast data-erasure to a nmigen FIFO.

> all that's needed is to just mark the instruction as canceled

no, that does not work.  if by that you mean that it should continue to propagate through the pipelines.

it *has* to be actually entirely erased from existence throughout *all* data structure because if that is not done, that muxid cannot and must not be used.

the confusion between a cancelled still-in-flight muxid and a new one with the same muxid will result in data corruption.

if by "mark" you mean "erase from existence" then yes. and that means all register latches, all stages, all buffers, everything.  it has to be _gone_ as if it never existed, was never issued.

  
> and have the
> loop footer stop looping early.

i don't understand this, sorry.

> 
> Note that the loop header and footer are special blocks (not just a mux)
> that always prioritize non-canceled instructions that are looping back over
> new instructions, stalling all stages before the pipeline header when an
> instruction is looping back.

stalling at the header is not the main problem.  it _is_ a problem, but not the main one.

think it through on the case where all and i mean all stages have data in them.

stalling interferes with the propagation of data at the split and join points in ways that cause the entire pipeline to become completely ineffective: 50% utilisation when there are 2x loopback, 33% when there is 3x loopback, 25% utilisation when there is 4x data loopback.


think of it like a traffic jam.  the mux-in junctions are incapable of allowing the flow of twice the number of "cars", because the output path is only "1 wide". meaning it is *guaranteed* a 50% flow rate through either of the mux-in paths.

and one of those is *connected to the mux-out*.

this *guarantees* that the mux-out (loopback) path can only take data 50% of the time.

in turn that *guarantees* that the pipeline *has* to stall 50% of the time.

it's gridlock, basically, and that i think was why the code i wrote went into a combinatorial lock.

renember, jacob, i have *already tried implementing* EXACTLY what you are proposing, and found that it *does not work*.


> That way, no FIFOs or any horribly complex
> stuff are needed.

i did not advocate the use of a FIFO. i only described its need to illustrate overcoming the problem that you have not yet understood exists.

> The loop header block would have a pipeline register as
> part of it.

that just delays the inevitable stalling by one cycle only.

and is effectively equivalent to a FIFO of depth 1.

a FIFO of depth 1 is not adequate to stop the bottleneck problem when all stages have data in them.


> Additional advantages of the loop header/footer are that they are
> composable, you can easily build a doubly-nested pipeline loop if we ever
> need that.

do you you mean twin parallel pipelines, such that the system can cope with twice the throughput?

or do you mean loopback three, four or eight times?

or, do you mean more complex micro-coding (such as using the int div pipeline shared with FP)?



> You can think of them as the pipeline equivalent of a do-while loop.

sort-of, except it is more like a roundabout with cars on it, with one road in and one road out (very close to each other but with the exit road *after* the entry road), and every car must go round the roundabout by 360 (actually appx 375) degrees.

in this situation, traffic flow unfortunately becomes completely ineffective during peak hours (when the roundabout is entirely full of cars) and can even achieve gridlock, which a roundabout is not supposed to do!

Comment 21 Luke Kenneth Casson Leighton 2020-07-04 11:47:10 BST

had a little more time to think.

* explicit branching will work (comment #17) and i *think* the performance will
be the same (as RSx4) except:

a) masking and the corresponding "stop" mask, both being a combinatorial
   broadcast signal, will result in a combinatorial loop if propagated in the
   current way

b) the current pipelines do *not* need stalling (at all) and this would
   be the first time it's needed.  that increases the register latches
   which given the width of the inputs and outputs (multiple 192 bit)
   is not insignificant


* RSx4 will also work except:

a) when we do the "full" version it will involve 10x4 RSes which will need
   an Array selector (pmux) of several hundred wires... *times 40*.  which
   is absolutely enormous.  12,000 to 16,000.  this is... impractical.

b) with the output feeding back to the input, the ispec and ospec need to
   be compatible.  i.e. they need to be the same (all fields in all specs)


either way is a LOT of work and i think for now we should focus on a FSM
which can be written in a couple of days.  this one is pretty obvious
and really simple:

https://github.com/antonblanchard/microwatt/blob/master/divider.vhdl

we basically are running out of time and need to focus on tasks that only
take hours or days.

we can revisit this and focus on the underlying infrastructure later.

hilariously however we may be able to use the FSM method, simply lay down dozens
of them and achieve the performance needed that way.

Comment 22 Luke Kenneth Casson Leighton 2020-07-04 12:52:26 BST

i was wrong about the roundabout analogy on the entry and exit point: they do not cross.  entry is *after* exit (350 degrees not 370)

so 4 loops will be 3x360 + 350.

anything incoming will be blocked by things already on the roundabout.

once one free "slot" (car) exits the roundabout, one of the incoming can use it.

we need the FSM first however so that we at least have something simple to go with and if there is time revisit this.

Comment 23 Luke Kenneth Casson Leighton 2020-07-11 02:44:54 BST

darn it, i just realised that even attempting to reuse any combinatorial stages means fully ripping out the entire pipeline infrastructure and replacing it with a FSM skeleton.

luckily i did this once already for the jondawson IEEE754 FP nmigen conversion.

that involved manually connecting up stages by way of local instances as regiaters, using ispec and ospec to create and assign them between each clock.

as the entire chain is fully defined by those ispesc and opspecs in a general purpose opaque fashion (managed by ControlBase) it will be possible to do an incremental conversion *of the existing code* using a straight for-loop, then go from there.

however given the size of the DIV pipeline my feeling is it is probably best to try MUL, first.  this will give an opportunity to reduce the MUL block down from 64x64 to 32x32 and do 4 loops to create a 128 bit answer.

currently that 128x128 mul is a whopping 1.8 mm^2 so worth the effort to reduce