Bug 296 - idea: cyclic buffer between FUs and register file
Summary: idea: cyclic buffer between FUs and register file
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Mac OS
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks:
 
Reported: 2020-05-01 03:09 BST by Luke Kenneth Casson Leighton
Modified: 2020-05-03 15:07 BST (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-05-01 03:09:45 BST
the issue that we have is that multiplexing between a dozen FUs with multiple
read/write ports, and the available regfile read/write ports, does not quite
match up.  full crossbars are completely impractical.

we therefore need something that allows FUs to receive / send values, without
a complete overload of wiring or muxes.
Comment 1 Luke Kenneth Casson Leighton 2020-05-01 03:10:01 BST
an idea for a cyclic hopper:

here is an 

https://groups.google.com/d/msg/comp.arch/qeMsE7UxvlI/6nvrtmBoAQAJ

> Since you separated the signals, can I suggest placing a hopper between 
> reading register specifiers out of FUs and the RF ports. Each cycle a 
> number of specifiers are dropped into the hopper, and each cycle as many 
> RF read ports as you have are read and then delivered to instructions 
> at the calculation units. Then each calculation unit picks instructions 
> into execution. 

*sigh* unfortunately we are a little bit hamstrung by having lost the register 
binary indices, using instead the Dependency Matrix unary encoding to 
directly enable the regfile row.  i mean, we _could_ use the unary encoding 
as the register specifier... 

by "hopper" do i take it that you mean a cyclic shift register where each 
bucket contains the register specifier, type of operation (R/W), and, say, the FU 
number? 

or, actually, if the register specifier is in unary... 

oo i have an idea.  it may well be the same idea. 

* for all registers being requested by all FUs, merge the unary register indices 
   into a single "these registers need to be read" vector.  i think, in fact, that 
   the Dependency Matrices might actually have this already (Read Vector 
   and Write Vector Pending) 

* for reads, throw that vector at a regfile "fetcher", which, using the read vector, 
  simply fetches as many registers as there are available regfile ports 

  let's say that there are 4R regfile ports 

* the 4 read results, along with a *single* bit unary encoding of their corresponding 
  reg number, are dropped into a cyclic 4-wide buffer (cyclic 4-wide shift reg) 

* all FUs "GoRead1" are connected to port 1 of the cyclic buffer 
* all FUs "GoRead2" are connected to port 2... 
* ............GoRead3 ..... 3 

* entries *remain* in the cyclic buffer for as long as the corresponding bit in the 
   Read Vector remains set, continuing to cycle through, moving from port 1 to 
   2, 2 to 3, 3 to 4, 4 back to 1, on each clock 

* on any given clock cycle, *all* FUs may read that "broadcasted" entry from their 
   corresponding cyclic buffer "hopper", as long as the unary reg index matches up 

* when the reg index matches up, GO_RD1/2/3 is fired at that FU. 

* when the FU has read from that port, it may drop the REQ_RD1/2/3 

* this results in the (usual) dropping of the unary bit from the Read Vector back at 
   the Dependency Matrix 

* this in turn also triggers the cyclic buffer to drop the reg value, freeing up that 
   port for a refill opportunity. 

it's by no means perfect: the cyclic buffer moving only one value at a time means 
that if an FU has 3 GO_RD/REQ_RD lines, it could have to wait for up to 3 cycles 
for the data to cycle round the shift-buffer... *per read-register*. 

if however a concerted effort is made to ensure that a REQ_RD1 always tries to drop 
the read-request into cyclic-buffer entry 1, REQ_RD2 always tries to drop into entry 2 
and so on, then the reads should just pass straight through. 

this does have the advantage that if there are multiple FUs waiting for the same 
register, then as a "broadcast" bus system, they can in fact get the same value 
simultaneously. 

writes are very similar, and there is the additional advantage that the written value 
can be read from the cyclic buffer by any FU that "sees" it, prior to the value landing 
at the actual regfile. 

thus the cyclic buffer also happens to serve double-duty as a forwarding bus.  or, 
more to the point, the role and distinction between "register bus" and "forwarding 
bus" is now decoupled / moot. 

was that sort-of what you meant by "hopper"? :) 

l.
Comment 2 Jacob Lifshay 2020-05-01 04:02:55 BST
Seems like a good idea, however out-of-order (and in-order) processors depend on single-cycle forwarding between the results of one operation and the input of the next for most of their performance, the forwarding network would need to keep that property (haven't fully thought through if this idea keeps that property).
Comment 3 Yehowshua 2020-05-01 06:03:14 BST
We don't need a full crossbar. We can use a strong-sense non-blocking benes network with a reconfiguration lookup table.

I can whip one up next week if you want. They use less gates than full crossbars...
Comment 4 Luke Kenneth Casson Leighton 2020-05-01 19:19:03 BST
i believe it may be possible to use existing components for this: the
mask-cancellable Pipeline class (where the "mask" is the GLOBAL_READ_PENDING_VEC
or GLOBAL_WRITE_PENDING_VEC) plus the multi-input pipe:

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/multipipe.py;h=95279250d005ecda96545c9ac3d5aec74ef4d082;hb=22bb7e4a3ec7d5f6139ed0bd427609af33ddafb3#l255

this allows input from multiple sources, and one of those input sources would
be the *other end* of the pipeline, thus creating a cyclic ring buffer that
has two key additional abilities:

1) the ability to "pause" the data being passed down the pipeline (without
   losing any data).  this because each stage is protected by ready/valid
   signals

2) the ability to "accept" extra data, from (multiple) other sources.

examples of (2) include an operand-forwarding capability.  this can be
implemented very simply: the output from the *write* cyclic buffer is connected
also to one of the multi-input ports on the *read* cyclic buffer!


  +------------------------------<-----------<-------+
  v                                                  ^
  |                                                  |
  +-- RD1 --> RD2 --> RD3 ---+    +-- WR1 --> WR2 ---+
  |                          |    |                  |
  ^                          v    ^                  v
  +------<-----------<-------+    +------<------<----+

if we have *multiple* of these cyclic buffers, split to deal with 32-bit
data in 4 banks (HI-32 ODD, LO-32 ODD, HI-32 EVEN, LO-32 EVEN) we can have
the appearance of up to 16R4W for 32-bit vector operations.

not only that, but "crossing over" between these 4 "lanes" is a simple
matter of putting additional inputs onto the front of the buffer.

i'm also going to recommend that we have at least two of these cyclic
buffers per regfile read/write bank.  this because, note:

* RD1 is *only* connected to the broadcast bus linked to all Function Unit's
  Reg-Read1.
* RD2 is *only* connected likewise to all FU's Reg-Read2
* etc.

and if a register was read by RD1 but is needed by RD2, we *have* to pass
it over to RD2 on the next clock cycle... but in doing so we *cannot do a
read on the regfile Port 2* on that next clock cycle, because the Cell for
RD2 is *already going to have data in it* (received from the cyclic shift
of data previously in RD1).

therefore, to solve this, i think we should have two rows.  this gives an
opportunity to always hit the regfile with multiple RDs in every clock cycle,
whilst still also being able to pass data over the broadcast bus(es) to any
listening FunctionUnits.

this does however mean that the columns need to collaborate.  there will be
*two* RD1 Cells vying for access to the Reg-Read1 Broadcast Bus.  there will
be *two* RD2 Cells vying for access to the Reg-Read2 Broadcast Bus.
fortunately, this just needs OR-gates (a 2-in 1-out MUX).
Comment 5 Luke Kenneth Casson Leighton 2020-05-01 19:33:05 BST
(In reply to Yehowshua from comment #3)

> We don't need a full crossbar. We can use a strong-sense non-blocking benes
> network with a reconfiguration lookup table.
> 
> I can whip one up next week if you want. They use less gates than full
> crossbars...

http://homepages.inf.ed.ac.uk/cgi/rni/comp-arch.pl?Networks/benes.html,Networks/benes-f.html,Networks/menu-dyn.html

oh _that_!  yeah i did one for my 3rd year project in 1991, actually two back-to-back
termed a "butterfly" network, and demonstrated that a permutation could be solved
in O(log N + N) time with 100% throughput.  it did require that the inputs to outputs
*be* a permutation, otherwise it would not complete the routing.  i ran out of time
to investigate further.

Lambda Networks - what the Imperial College team working on the ALICE Transputer
Network in conjunction with Plessey - and what has now the name "Benes Networks" -
have the unfortunate property of reducing the throughput in an exponential fashion,
when presented with arbitrarily-random source-to-destination requests.

* a 16-to-16 Lambda / Benes Network will have a random-to-random maximum throughput
  of 50%
* a 64-to-64 Lambda / Benes Network will have a random-to-random maximum throughput
  of 25%

and so on.

unless the data is particularly uniform (strictly controlled), it actually
gets pretty bad.

hence the reason why i investigated back-to-back Lambda / Benes Networks, which
fascinatingly have the property of providing *two* possible paths to each
destination, per bit-per-layer.  i used this property to cascade-solve a permutation
and got 100% guaranteed routing.

however... it requires (A) a permutation and (B) close to double the amount of
actual gates and (C) a delay whilst solving the permutation.

yes, yehowshua, i've done a hell of a lot of stuff, in a lot of different areas :)

wait... no, just looking at Figure 2: Figure 2 *is* what i came up with.  sorry:
what Imperial College did in the ALICE Project was a *half* Benes Network (cut
that network in half down the middle).

Lambda Networks have exponential drop-off throughput characteristics (each
layer blocks other layers, in a cascade).  Benes Networks require double the
hardware of Lambda Networks, but the routing decision-making is... quite a
bit more complex, as there are always 2 paths.
Comment 6 Luke Kenneth Casson Leighton 2020-05-01 19:54:22 BST
(In reply to Jacob Lifshay from comment #2)
> Seems like a good idea, however out-of-order (and in-order) processors
> depend on single-cycle forwarding between the results of one operation and
> the input of the next for most of their performance, the forwarding network
> would need to keep that property (haven't fully thought through if this idea
> keeps that property).

luckily, in a Dependency-Matrix-based system, nothing is timing-dependent.
the DMs (the combination of the FU-Regs and FU-FU matrices) preserve a
Directed Acyclic Graph of the relationships between all in-flight operations,
based on the register numbers, *not* based on the time taken to completion
*of* each operation, in any way.  FSMs, pipelines, async FUs, variable-pipelines,
it's all the same to the 6600-DMs.

therefore (the point of explaining that is:) it doesn't _actually_ matter if
the forwarding between result latches at the output side of the Function Units
is delayed even 5-20 cycles in reaching the input latches of the FUs that need
that result.

btw in the original 6600, regfile read was done on the leading edge, regfile write
was done on the falling edge.  this had the fascinating property that the
data coming in on a write could be available on the *same* clock cycle as a
read, effectively making the register file its own "Forwarding Bus".

anyway: unlike the 6600, we have a bit of a problem in that we cannot have massive
register file porting.  the 6600's "A" regfile had *FIVE* read ports and *TWO*
write ports!

we are going to be under-ported because it's not really practical to design a
multi-ported (multi-write) SRAM in the time that we have (and their gate count
and power consumption is pretty high).  so, therefore, there will be quite a few
times where data is generated faster than it can be written to the regfile.

therefore, operand-forwarding - even if it's a couple of cycles delayed - is
actually going to really important.

LOAD-STORE Computation Units, the LOAD in UPDATE mode can generate *two*
register writes.  fortunately, the ADDR-GEN side may be available one clock
cycle *before* the actual LOAD result.  however we need to plan for worst-case:
that the two are available simultaneously.

what we do *not* want to happen is that just because there is only 1W on the
regfile, the two writes from LOAD-UPDATEmode causes a bottleneck through the
regfile, when trying to get either of those two writes to the FunctionUnits
waiting for them.

therefore, if the cyclic buffer can forward at least one of those results
into the *read* cyclic buffer, then:

* one of the writes will go through the regfile and (if we design it
  right, a la 6600 regfile) *back* out through one of the read ports
  *on the same clock cycle*

* the other write will go through the end of the write cyclic buffer into
  the beginning of the read cyclic buffer, and if the Function Unit that
  needs that register happens to be RD1 (REQ_RD1 is asserted) it *will*
  get that result on the same clock cycle.

  if not, it will get it on the next clock (when shift-passed to RD2).
Comment 7 Jacob Lifshay 2020-05-01 20:12:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #2)
> > Seems like a good idea, however out-of-order (and in-order) processors
> > depend on single-cycle forwarding between the results of one operation and
> > the input of the next for most of their performance, the forwarding network
> > would need to keep that property (haven't fully thought through if this idea
> > keeps that property).
> 
> luckily, in a Dependency-Matrix-based system, nothing is timing-dependent.
> the DMs (the combination of the FU-Regs and FU-FU matrices) preserve a
> Directed Acyclic Graph of the relationships between all in-flight operations,
> based on the register numbers, *not* based on the time taken to completion
> *of* each operation, in any way.  FSMs, pipelines, async FUs,
> variable-pipelines,
> it's all the same to the 6600-DMs.
> 
> therefore (the point of explaining that is:) it doesn't _actually_ matter if
> the forwarding between result latches at the output side of the Function
> Units
> is delayed even 5-20 cycles in reaching the input latches of the FUs that
> need
> that result.

That's true for correctness, but not for performance. Delaying 20 cycles on every forwarding operation would seriously reduce the performance, probably by at least 10x, so, obviously, that should be avoided.
Comment 8 Luke Kenneth Casson Leighton 2020-05-01 21:23:37 BST
(In reply to Jacob Lifshay from comment #7)

> That's true for correctness, but not for performance. Delaying 20 cycles on
> every forwarding operation would seriously reduce the performance, probably
> by at least 10x, so, obviously, that should be avoided.

sorry, i meant, "delayed not deliberately but because the total bandwidth
(paths) between FU-result latches and FU-input latches is limited".

whether that path is via the regfile or via a forwarding bus doesn't matter:
if the forwarding bus isn't available for 5-10 cycles, there's always the
write-port on the regfile.

Mitch's analysis was, i believe, that we should expect a ratio of... what
was it... 1.3 reg reads to writes?  i.e. each result produced by one FU is
"consumed" by an average of 1.3 other FUs.

he noted that, therefore, even if the forwarding bus is used just once,
forwarding to just one other FU before the value is dropped on the floor,
in a cascade it actually saves a lot of regfile activity.

unfortunately, to reduce this regfile pressure assumes that you can detect
write-after-write and also "discards".  that *is* possible, however the
discussion was now over a year ago, and we simply haven't got time in
this round to make the necessary DM modifications.

so what we will gain with a forwarding bus is: decreased latency.  this because
with the register file being a bottleneck, even one extra bus (fwd) means that
some of the operations that would otherwise need to have waited for the write
(and followup read) through the regfile, instead may start that much earlier.

yes, ultimately: if access to that forwarding bus (which _will_ itself be
resource-contended) can be done early (same clock cycle, even), obviously that
latency is decreased much more.

summary: for a 6600-based DM system it's strongly desirable for the fwd
bus to be same (or 1 clock) cycle: it's not however a make-or-break mandatory
requirement.  for an in-order system, on the other hand, it's not only
far more difficult to *add* a forwarding bus, it is, if i recall correctly,
timing-critical.  i.e. the consequences of missing the slot are that the
entire pipeline (right the way back to issue) is forced to stall.
Comment 9 Luke Kenneth Casson Leighton 2020-05-02 03:01:39 BST
https://groups.google.com/d/msg/comp.arch/qeMsE7UxvlI/t6QtE4C2AQAJ

idea: one cyclic buffer *at every FunctionUnit* and turn the buses into general data buses.
Comment 10 Luke Kenneth Casson Leighton 2020-05-02 14:15:16 BST
as can be seen from the comp.arch discussion, a cyclic buffer on *each Function Unit*
seems to have some advantages over just having a global cyclic buffer.

in particular, we discussed 18 months ago the merits of an idea by Mitch Alsup
to use *only* 1R-or-1W SRAMs on the regfile, with serial reads followed by
FMAC (the example) pipeline stages and a single write.  it's extremely efficient
resource-utilisation on parallel workloads.

the problem with that idea is that it absolutely sucks for scalar code.

turns out that if we have a cyclic buffer and turn the "dedicated" RD1/2/3 WR1/2
Broadcast Buses into "general-purpose read / write buses", that we get the best
of both worlds:

* in the scalar case, a 3-operand FunctionUnit may use *three* Broadcast Buses
  to receive operand1, operand2 and operand3 in a single cycle

* in the parallel case (or during heavy loading) the lack of availability of
  all but one of the available buses does *not* stop a Function Unit from
  receiving its operands.  even if it takes 3 cycles to receive operands over
  just the one Bus, the fact that it will *also* take 3 cycles to rotate those
  values into the correct Operand Latches *does not matter*

(Yehowshua, this is partly where / why the extra gates of a Benes / Butterfly
Network are overkill: *all* operands need to be delivered before the operation
can proceed, which leaves time to shuffle things into place).

what does need to be thought through is to preserve the read/write Dependencies.
the crucial one is not reads: those can drop their dependency the moment that
the operand is received (even into the cyclic buffer), however writes definitely
cannot.

whilst they can be _delivered_ via the forwarding bus, the Function Unit *must*
remain busy (not be sent a "Go_Write" signal which will tell it to drop the
dependency) until its result has reached the regfile.

and "in any of the cyclic buffers" is *not* the regfile.

so that needs a little more thought.
Comment 11 Luke Kenneth Casson Leighton 2020-05-02 16:02:55 BST
hmm hmm, the other big concern: just as with a Tomasulo Algorithm, the
Register "ID" (or... what was it... the CAM Row Number or something)
has to be:

(a) Broadcast onto each Common Data Bus along with the value
(b) stored in each and every Function Unit (which means getting it
    *to* that FU as well as having a comparator)

now, the trade-off between making that ID an unary number or a binary number is:

* we will have around 20 Function Units
* there will be 5 ports (and therefore 5 CDBs: 3-read, 2-write)

therefore in binary:

* there will be 500 wires for a 5-bit RegisterID coming *in* to Function
  Units
* there will be 500 wires going *out* of Function Units onto CDBs
* 500 XOR gates will be needed to perform comparisons, and that's a power
  hit that we get on EVERY clock cycle (!)

in unary:

* there will be a whopping THREE THOUSAND wires coming in for a 32-bit
  unary RegisterID
* there will be three thousand going out onto the CDB (!!)
* there would be 3,000 AND gates needed, however the power hit will *only*
  be from a maximum of 5x20=100 of those going active in any one clock cycle
  because they're unary-encoded, and only 1/32 of the 3,200 bits is ever
  active, rather than *all* (5) bits active in the binary case.

to be honest, neither of these is particularly attractive! :)

compare and contrast this with the way that the 6600 works:

* The Register Port Buses, although global in nature, are direct-connected
  from Function Unit Operands to *specific* Regfile ports.
* A-Units (address Units) have 2-in, 1-out and those are wired to Regfile
  RD ports 1,2 and Regfile WR port 1
* B-Units (algorithmic units) i think likewise have 2-in, 1-out, and
  go to RD ports 3,4 and Regfile WR port 2
* X-Units have 1-in, 1-out and go to RD port 5 and WR port 3.

therefore:

* the FU-Regs Dependency Matrix captures the information about which regs
  *on which port* each FU shall read (or write) from
* this in an IMPLICIT fashion such that there is NO possibility for the
  value being broadcast to be picked up by a second Function Unit

i.e. the Register ID itself is *NOT* actually transmitted over the Bus,
at all.  it's just down to "information capture", back at the FU-Regs
Dependency Matrix.


i wonder... i wonder if there's a way to compute the amount of "shifting"
that would be required, by each FunctionUnit, and have *that* transmitted
to the FU, instead?  this would only be a 2-bit value (assuming a maximum
of 4 read-ports).

it goes like this:

* each row of the FU-Regs Dependency Matrix has captured (already, this
  is part of the job of the FU-Regs DM) which registers the FU requires.
  this *already* encodes which FU Operand Ports it needs
* when a CDB is available, we know its index ID.
* at the time that the CDB is available, we also know, back in the DM,
  which FU Operand Port index requires that value
* the DIFFERENCE between these two indices as binary values becomes EXACTLY
  the amount of shifting required, should the value be transmitted over
  that available CDB.

it's not a vast amount of gates (a 2-bit subtractor per FU per port) and
it's only 2 bits of information to be sent to each Function Unit.  note
however that each FU needs a *different* shift-diff value to be transmitted
to it, for each broadcast value on that CDB!

so if a 2-bit subtractor is... ermm... 10 gates(?) then that's:

* 10 (appx) gates for the subtractor (it doesn't need carry)
* times 20 for the number of Function Units
* times 5 for 3RD Operands and 2WR operands

1000 gates, as an adjunct to the FU-Regs Dependency Matrix.

however the number of wires is:

* 2 for the shift-diff value
* times 20 for FUs
* times 5 for the operands

a total of 200 wires and *that's* tolerable.

compare this to XOR being four gates, where in the binary-broadcast we'd have
5x20x5 wires (500) but we'd have two THOUSAND gates.
Comment 12 Luke Kenneth Casson Leighton 2020-05-03 12:58:32 BST
hmm because we are aiming for a minimum 4x FP32 FMACs per cycle,
the datapath situation is much "worse" than i described earlier.
(meaning: the pressure to reduce the amount of control and information
wires is much higher).

* 4x FP32 FMACs is 2x 64-bit results
* however it's also 4x3 *read* registers @ FP32 which is 6x 64-bit
  reads.

having 6R2W on the regfile is completely impractical.

this is why we decided to go for the HI/LO-32-ODD/EVEN regfile split
(and to synchro-"pair" 32-bit operations if 64-bit calculations are
required, using PartitionedSignal).

by subdividing, we effectively end up with *four* separate isolated
32-bit-wide register files.

thus there will be *four* separate Common Data Buses, each serving
32-bit data to a *quarter* of the Function Units, each.

i'll draw it out.
Comment 13 Luke Kenneth Casson Leighton 2020-05-03 14:23:01 BST
https://libre-soc.org/3d_gpu/architecture/regfile/

created a new page (for regfiles in general), it's a near-duplicate
of a very early diagram written out last year, except this time
the PartitionedSignal-baesd Pipelines, and Function Units, as well as
the cyclic buffers, are now included.
Comment 14 Luke Kenneth Casson Leighton 2020-05-03 15:07:06 BST
i've updated the descriptive text at the page:
https://libre-soc.org/3d_gpu/architecture/regfile/

a review and commentary appreciated, so that architecturally this is
understood, and any potential errors and omissions detected.