the issue that we have is that multiplexing between a dozen FUs with multiple read/write ports, and the available regfile read/write ports, does not quite match up. full crossbars are completely impractical. we therefore need something that allows FUs to receive / send values, without a complete overload of wiring or muxes.
an idea for a cyclic hopper: here is an https://groups.google.com/d/msg/comp.arch/qeMsE7UxvlI/6nvrtmBoAQAJ > Since you separated the signals, can I suggest placing a hopper between > reading register specifiers out of FUs and the RF ports. Each cycle a > number of specifiers are dropped into the hopper, and each cycle as many > RF read ports as you have are read and then delivered to instructions > at the calculation units. Then each calculation unit picks instructions > into execution. *sigh* unfortunately we are a little bit hamstrung by having lost the register binary indices, using instead the Dependency Matrix unary encoding to directly enable the regfile row. i mean, we _could_ use the unary encoding as the register specifier... by "hopper" do i take it that you mean a cyclic shift register where each bucket contains the register specifier, type of operation (R/W), and, say, the FU number? or, actually, if the register specifier is in unary... oo i have an idea. it may well be the same idea. * for all registers being requested by all FUs, merge the unary register indices into a single "these registers need to be read" vector. i think, in fact, that the Dependency Matrices might actually have this already (Read Vector and Write Vector Pending) * for reads, throw that vector at a regfile "fetcher", which, using the read vector, simply fetches as many registers as there are available regfile ports let's say that there are 4R regfile ports * the 4 read results, along with a *single* bit unary encoding of their corresponding reg number, are dropped into a cyclic 4-wide buffer (cyclic 4-wide shift reg) * all FUs "GoRead1" are connected to port 1 of the cyclic buffer * all FUs "GoRead2" are connected to port 2... * ............GoRead3 ..... 3 * entries *remain* in the cyclic buffer for as long as the corresponding bit in the Read Vector remains set, continuing to cycle through, moving from port 1 to 2, 2 to 3, 3 to 4, 4 back to 1, on each clock * on any given clock cycle, *all* FUs may read that "broadcasted" entry from their corresponding cyclic buffer "hopper", as long as the unary reg index matches up * when the reg index matches up, GO_RD1/2/3 is fired at that FU. * when the FU has read from that port, it may drop the REQ_RD1/2/3 * this results in the (usual) dropping of the unary bit from the Read Vector back at the Dependency Matrix * this in turn also triggers the cyclic buffer to drop the reg value, freeing up that port for a refill opportunity. it's by no means perfect: the cyclic buffer moving only one value at a time means that if an FU has 3 GO_RD/REQ_RD lines, it could have to wait for up to 3 cycles for the data to cycle round the shift-buffer... *per read-register*. if however a concerted effort is made to ensure that a REQ_RD1 always tries to drop the read-request into cyclic-buffer entry 1, REQ_RD2 always tries to drop into entry 2 and so on, then the reads should just pass straight through. this does have the advantage that if there are multiple FUs waiting for the same register, then as a "broadcast" bus system, they can in fact get the same value simultaneously. writes are very similar, and there is the additional advantage that the written value can be read from the cyclic buffer by any FU that "sees" it, prior to the value landing at the actual regfile. thus the cyclic buffer also happens to serve double-duty as a forwarding bus. or, more to the point, the role and distinction between "register bus" and "forwarding bus" is now decoupled / moot. was that sort-of what you meant by "hopper"? :) l.
Seems like a good idea, however out-of-order (and in-order) processors depend on single-cycle forwarding between the results of one operation and the input of the next for most of their performance, the forwarding network would need to keep that property (haven't fully thought through if this idea keeps that property).
We don't need a full crossbar. We can use a strong-sense non-blocking benes network with a reconfiguration lookup table. I can whip one up next week if you want. They use less gates than full crossbars...
i believe it may be possible to use existing components for this: the mask-cancellable Pipeline class (where the "mask" is the GLOBAL_READ_PENDING_VEC or GLOBAL_WRITE_PENDING_VEC) plus the multi-input pipe: https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/multipipe.py;h=95279250d005ecda96545c9ac3d5aec74ef4d082;hb=22bb7e4a3ec7d5f6139ed0bd427609af33ddafb3#l255 this allows input from multiple sources, and one of those input sources would be the *other end* of the pipeline, thus creating a cyclic ring buffer that has two key additional abilities: 1) the ability to "pause" the data being passed down the pipeline (without losing any data). this because each stage is protected by ready/valid signals 2) the ability to "accept" extra data, from (multiple) other sources. examples of (2) include an operand-forwarding capability. this can be implemented very simply: the output from the *write* cyclic buffer is connected also to one of the multi-input ports on the *read* cyclic buffer! +------------------------------<-----------<-------+ v ^ | | +-- RD1 --> RD2 --> RD3 ---+ +-- WR1 --> WR2 ---+ | | | | ^ v ^ v +------<-----------<-------+ +------<------<----+ if we have *multiple* of these cyclic buffers, split to deal with 32-bit data in 4 banks (HI-32 ODD, LO-32 ODD, HI-32 EVEN, LO-32 EVEN) we can have the appearance of up to 16R4W for 32-bit vector operations. not only that, but "crossing over" between these 4 "lanes" is a simple matter of putting additional inputs onto the front of the buffer. i'm also going to recommend that we have at least two of these cyclic buffers per regfile read/write bank. this because, note: * RD1 is *only* connected to the broadcast bus linked to all Function Unit's Reg-Read1. * RD2 is *only* connected likewise to all FU's Reg-Read2 * etc. and if a register was read by RD1 but is needed by RD2, we *have* to pass it over to RD2 on the next clock cycle... but in doing so we *cannot do a read on the regfile Port 2* on that next clock cycle, because the Cell for RD2 is *already going to have data in it* (received from the cyclic shift of data previously in RD1). therefore, to solve this, i think we should have two rows. this gives an opportunity to always hit the regfile with multiple RDs in every clock cycle, whilst still also being able to pass data over the broadcast bus(es) to any listening FunctionUnits. this does however mean that the columns need to collaborate. there will be *two* RD1 Cells vying for access to the Reg-Read1 Broadcast Bus. there will be *two* RD2 Cells vying for access to the Reg-Read2 Broadcast Bus. fortunately, this just needs OR-gates (a 2-in 1-out MUX).
(In reply to Yehowshua from comment #3) > We don't need a full crossbar. We can use a strong-sense non-blocking benes > network with a reconfiguration lookup table. > > I can whip one up next week if you want. They use less gates than full > crossbars... http://homepages.inf.ed.ac.uk/cgi/rni/comp-arch.pl?Networks/benes.html,Networks/benes-f.html,Networks/menu-dyn.html oh _that_! yeah i did one for my 3rd year project in 1991, actually two back-to-back termed a "butterfly" network, and demonstrated that a permutation could be solved in O(log N + N) time with 100% throughput. it did require that the inputs to outputs *be* a permutation, otherwise it would not complete the routing. i ran out of time to investigate further. Lambda Networks - what the Imperial College team working on the ALICE Transputer Network in conjunction with Plessey - and what has now the name "Benes Networks" - have the unfortunate property of reducing the throughput in an exponential fashion, when presented with arbitrarily-random source-to-destination requests. * a 16-to-16 Lambda / Benes Network will have a random-to-random maximum throughput of 50% * a 64-to-64 Lambda / Benes Network will have a random-to-random maximum throughput of 25% and so on. unless the data is particularly uniform (strictly controlled), it actually gets pretty bad. hence the reason why i investigated back-to-back Lambda / Benes Networks, which fascinatingly have the property of providing *two* possible paths to each destination, per bit-per-layer. i used this property to cascade-solve a permutation and got 100% guaranteed routing. however... it requires (A) a permutation and (B) close to double the amount of actual gates and (C) a delay whilst solving the permutation. yes, yehowshua, i've done a hell of a lot of stuff, in a lot of different areas :) wait... no, just looking at Figure 2: Figure 2 *is* what i came up with. sorry: what Imperial College did in the ALICE Project was a *half* Benes Network (cut that network in half down the middle). Lambda Networks have exponential drop-off throughput characteristics (each layer blocks other layers, in a cascade). Benes Networks require double the hardware of Lambda Networks, but the routing decision-making is... quite a bit more complex, as there are always 2 paths.
(In reply to Jacob Lifshay from comment #2) > Seems like a good idea, however out-of-order (and in-order) processors > depend on single-cycle forwarding between the results of one operation and > the input of the next for most of their performance, the forwarding network > would need to keep that property (haven't fully thought through if this idea > keeps that property). luckily, in a Dependency-Matrix-based system, nothing is timing-dependent. the DMs (the combination of the FU-Regs and FU-FU matrices) preserve a Directed Acyclic Graph of the relationships between all in-flight operations, based on the register numbers, *not* based on the time taken to completion *of* each operation, in any way. FSMs, pipelines, async FUs, variable-pipelines, it's all the same to the 6600-DMs. therefore (the point of explaining that is:) it doesn't _actually_ matter if the forwarding between result latches at the output side of the Function Units is delayed even 5-20 cycles in reaching the input latches of the FUs that need that result. btw in the original 6600, regfile read was done on the leading edge, regfile write was done on the falling edge. this had the fascinating property that the data coming in on a write could be available on the *same* clock cycle as a read, effectively making the register file its own "Forwarding Bus". anyway: unlike the 6600, we have a bit of a problem in that we cannot have massive register file porting. the 6600's "A" regfile had *FIVE* read ports and *TWO* write ports! we are going to be under-ported because it's not really practical to design a multi-ported (multi-write) SRAM in the time that we have (and their gate count and power consumption is pretty high). so, therefore, there will be quite a few times where data is generated faster than it can be written to the regfile. therefore, operand-forwarding - even if it's a couple of cycles delayed - is actually going to really important. LOAD-STORE Computation Units, the LOAD in UPDATE mode can generate *two* register writes. fortunately, the ADDR-GEN side may be available one clock cycle *before* the actual LOAD result. however we need to plan for worst-case: that the two are available simultaneously. what we do *not* want to happen is that just because there is only 1W on the regfile, the two writes from LOAD-UPDATEmode causes a bottleneck through the regfile, when trying to get either of those two writes to the FunctionUnits waiting for them. therefore, if the cyclic buffer can forward at least one of those results into the *read* cyclic buffer, then: * one of the writes will go through the regfile and (if we design it right, a la 6600 regfile) *back* out through one of the read ports *on the same clock cycle* * the other write will go through the end of the write cyclic buffer into the beginning of the read cyclic buffer, and if the Function Unit that needs that register happens to be RD1 (REQ_RD1 is asserted) it *will* get that result on the same clock cycle. if not, it will get it on the next clock (when shift-passed to RD2).
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #2) > > Seems like a good idea, however out-of-order (and in-order) processors > > depend on single-cycle forwarding between the results of one operation and > > the input of the next for most of their performance, the forwarding network > > would need to keep that property (haven't fully thought through if this idea > > keeps that property). > > luckily, in a Dependency-Matrix-based system, nothing is timing-dependent. > the DMs (the combination of the FU-Regs and FU-FU matrices) preserve a > Directed Acyclic Graph of the relationships between all in-flight operations, > based on the register numbers, *not* based on the time taken to completion > *of* each operation, in any way. FSMs, pipelines, async FUs, > variable-pipelines, > it's all the same to the 6600-DMs. > > therefore (the point of explaining that is:) it doesn't _actually_ matter if > the forwarding between result latches at the output side of the Function > Units > is delayed even 5-20 cycles in reaching the input latches of the FUs that > need > that result. That's true for correctness, but not for performance. Delaying 20 cycles on every forwarding operation would seriously reduce the performance, probably by at least 10x, so, obviously, that should be avoided.
(In reply to Jacob Lifshay from comment #7) > That's true for correctness, but not for performance. Delaying 20 cycles on > every forwarding operation would seriously reduce the performance, probably > by at least 10x, so, obviously, that should be avoided. sorry, i meant, "delayed not deliberately but because the total bandwidth (paths) between FU-result latches and FU-input latches is limited". whether that path is via the regfile or via a forwarding bus doesn't matter: if the forwarding bus isn't available for 5-10 cycles, there's always the write-port on the regfile. Mitch's analysis was, i believe, that we should expect a ratio of... what was it... 1.3 reg reads to writes? i.e. each result produced by one FU is "consumed" by an average of 1.3 other FUs. he noted that, therefore, even if the forwarding bus is used just once, forwarding to just one other FU before the value is dropped on the floor, in a cascade it actually saves a lot of regfile activity. unfortunately, to reduce this regfile pressure assumes that you can detect write-after-write and also "discards". that *is* possible, however the discussion was now over a year ago, and we simply haven't got time in this round to make the necessary DM modifications. so what we will gain with a forwarding bus is: decreased latency. this because with the register file being a bottleneck, even one extra bus (fwd) means that some of the operations that would otherwise need to have waited for the write (and followup read) through the regfile, instead may start that much earlier. yes, ultimately: if access to that forwarding bus (which _will_ itself be resource-contended) can be done early (same clock cycle, even), obviously that latency is decreased much more. summary: for a 6600-based DM system it's strongly desirable for the fwd bus to be same (or 1 clock) cycle: it's not however a make-or-break mandatory requirement. for an in-order system, on the other hand, it's not only far more difficult to *add* a forwarding bus, it is, if i recall correctly, timing-critical. i.e. the consequences of missing the slot are that the entire pipeline (right the way back to issue) is forced to stall.
https://groups.google.com/d/msg/comp.arch/qeMsE7UxvlI/t6QtE4C2AQAJ idea: one cyclic buffer *at every FunctionUnit* and turn the buses into general data buses.
as can be seen from the comp.arch discussion, a cyclic buffer on *each Function Unit* seems to have some advantages over just having a global cyclic buffer. in particular, we discussed 18 months ago the merits of an idea by Mitch Alsup to use *only* 1R-or-1W SRAMs on the regfile, with serial reads followed by FMAC (the example) pipeline stages and a single write. it's extremely efficient resource-utilisation on parallel workloads. the problem with that idea is that it absolutely sucks for scalar code. turns out that if we have a cyclic buffer and turn the "dedicated" RD1/2/3 WR1/2 Broadcast Buses into "general-purpose read / write buses", that we get the best of both worlds: * in the scalar case, a 3-operand FunctionUnit may use *three* Broadcast Buses to receive operand1, operand2 and operand3 in a single cycle * in the parallel case (or during heavy loading) the lack of availability of all but one of the available buses does *not* stop a Function Unit from receiving its operands. even if it takes 3 cycles to receive operands over just the one Bus, the fact that it will *also* take 3 cycles to rotate those values into the correct Operand Latches *does not matter* (Yehowshua, this is partly where / why the extra gates of a Benes / Butterfly Network are overkill: *all* operands need to be delivered before the operation can proceed, which leaves time to shuffle things into place). what does need to be thought through is to preserve the read/write Dependencies. the crucial one is not reads: those can drop their dependency the moment that the operand is received (even into the cyclic buffer), however writes definitely cannot. whilst they can be _delivered_ via the forwarding bus, the Function Unit *must* remain busy (not be sent a "Go_Write" signal which will tell it to drop the dependency) until its result has reached the regfile. and "in any of the cyclic buffers" is *not* the regfile. so that needs a little more thought.
hmm hmm, the other big concern: just as with a Tomasulo Algorithm, the Register "ID" (or... what was it... the CAM Row Number or something) has to be: (a) Broadcast onto each Common Data Bus along with the value (b) stored in each and every Function Unit (which means getting it *to* that FU as well as having a comparator) now, the trade-off between making that ID an unary number or a binary number is: * we will have around 20 Function Units * there will be 5 ports (and therefore 5 CDBs: 3-read, 2-write) therefore in binary: * there will be 500 wires for a 5-bit RegisterID coming *in* to Function Units * there will be 500 wires going *out* of Function Units onto CDBs * 500 XOR gates will be needed to perform comparisons, and that's a power hit that we get on EVERY clock cycle (!) in unary: * there will be a whopping THREE THOUSAND wires coming in for a 32-bit unary RegisterID * there will be three thousand going out onto the CDB (!!) * there would be 3,000 AND gates needed, however the power hit will *only* be from a maximum of 5x20=100 of those going active in any one clock cycle because they're unary-encoded, and only 1/32 of the 3,200 bits is ever active, rather than *all* (5) bits active in the binary case. to be honest, neither of these is particularly attractive! :) compare and contrast this with the way that the 6600 works: * The Register Port Buses, although global in nature, are direct-connected from Function Unit Operands to *specific* Regfile ports. * A-Units (address Units) have 2-in, 1-out and those are wired to Regfile RD ports 1,2 and Regfile WR port 1 * B-Units (algorithmic units) i think likewise have 2-in, 1-out, and go to RD ports 3,4 and Regfile WR port 2 * X-Units have 1-in, 1-out and go to RD port 5 and WR port 3. therefore: * the FU-Regs Dependency Matrix captures the information about which regs *on which port* each FU shall read (or write) from * this in an IMPLICIT fashion such that there is NO possibility for the value being broadcast to be picked up by a second Function Unit i.e. the Register ID itself is *NOT* actually transmitted over the Bus, at all. it's just down to "information capture", back at the FU-Regs Dependency Matrix. i wonder... i wonder if there's a way to compute the amount of "shifting" that would be required, by each FunctionUnit, and have *that* transmitted to the FU, instead? this would only be a 2-bit value (assuming a maximum of 4 read-ports). it goes like this: * each row of the FU-Regs Dependency Matrix has captured (already, this is part of the job of the FU-Regs DM) which registers the FU requires. this *already* encodes which FU Operand Ports it needs * when a CDB is available, we know its index ID. * at the time that the CDB is available, we also know, back in the DM, which FU Operand Port index requires that value * the DIFFERENCE between these two indices as binary values becomes EXACTLY the amount of shifting required, should the value be transmitted over that available CDB. it's not a vast amount of gates (a 2-bit subtractor per FU per port) and it's only 2 bits of information to be sent to each Function Unit. note however that each FU needs a *different* shift-diff value to be transmitted to it, for each broadcast value on that CDB! so if a 2-bit subtractor is... ermm... 10 gates(?) then that's: * 10 (appx) gates for the subtractor (it doesn't need carry) * times 20 for the number of Function Units * times 5 for 3RD Operands and 2WR operands 1000 gates, as an adjunct to the FU-Regs Dependency Matrix. however the number of wires is: * 2 for the shift-diff value * times 20 for FUs * times 5 for the operands a total of 200 wires and *that's* tolerable. compare this to XOR being four gates, where in the binary-broadcast we'd have 5x20x5 wires (500) but we'd have two THOUSAND gates.
hmm because we are aiming for a minimum 4x FP32 FMACs per cycle, the datapath situation is much "worse" than i described earlier. (meaning: the pressure to reduce the amount of control and information wires is much higher). * 4x FP32 FMACs is 2x 64-bit results * however it's also 4x3 *read* registers @ FP32 which is 6x 64-bit reads. having 6R2W on the regfile is completely impractical. this is why we decided to go for the HI/LO-32-ODD/EVEN regfile split (and to synchro-"pair" 32-bit operations if 64-bit calculations are required, using PartitionedSignal). by subdividing, we effectively end up with *four* separate isolated 32-bit-wide register files. thus there will be *four* separate Common Data Buses, each serving 32-bit data to a *quarter* of the Function Units, each. i'll draw it out.
https://libre-soc.org/3d_gpu/architecture/regfile/ created a new page (for regfiles in general), it's a near-duplicate of a very early diagram written out last year, except this time the PartitionedSignal-baesd Pipelines, and Function Units, as well as the cyclic buffers, are now included.
i've updated the descriptive text at the page: https://libre-soc.org/3d_gpu/architecture/regfile/ a review and commentary appreciated, so that architecturally this is understood, and any potential errors and omissions detected.