I thought I'd write this down before I forgot: I was thinking about a hypothetical cpu design that has register renaming and 6600-derived dependency matrixes, and I realized that: if we, instead of having a separate register file, simply have 1 register per FU (the register that FU's corresponding instruction writes to) -- the FUs *are* the register file, it totally eliminates the need for the FU-Reg dependency matrix, leaving only the FU-FU matrix. register renaming can take care of allocating FUs that are not in use when we need to allocate physical registers when renaming newly decoded instructions.
two parts: part 1 ------ the architectural design for register renaming is already done, following a discussion last year with Mitch Alsup on comp.arch. this is the table for the port management: https://libre-soc.org/3d_gpu/isa_to_virtual_regs/ requests for access to a read/write a register file port come in on the columns (bottom), and go out on the rows (left). also required for Write-after-Write (which is basically what register renaming is) is a pair of FU-Regs and FU-FU Matrices, adapted to work "in reverse" (see comp.arch discussion at the url), names "FU" replaced with "VirtualReg" and "Regs" replaced with "Physical" i.e. the Matrices become named "VirtualReg-PhysicalReg" and "VirtualReg-VirtualReg". the matrices are not quite exactly the same as FU-Regs and FU-FU: the driving force (decision logic) goes in the *opposite* direction. i don't recall the exact details of the comp.arch discussion, it was 14 months ago. looks like this was it: https://groups.google.com/g/comp.arch/c/vdgvrYGoxTM/m/w8jAF56fBgAJ ah, it was the Shadowing. the Shadowing operates in the opposite direction. part 2 ------ you *may* be referring to reducing the size of the FU-Regs Matrices. a technique there is instead of one FU-Regs column per register, to map multiple registers down to one column. the most ridiculous but perfectly legitimate version of that is to map *all* registers down to a single hazard bit per read and per write port. for the SPR regfile this may be directly necessary *right now* because there are over 100 entries and it is flat-out completely impractical to have a 100-wide SPR regfile hazard bitvector. the upshot of such a type of drastic decision is that *only one instruction at a time may be in-flight* which wishes to write to the entire SPR regfile, and in a simple design in the case of writing to the SPR regfile that may be exactly what is needed. less draconian decisions can also be made by allowing certain SPRs to have their own dedicated Hazard column such as for the Galois Field instruction for setting the modulo. this is already partly done in the form of splitting up the XER SPR which has a very special 3-entry 2-bit regfile (SO, CA, OV where the top - 2nd - bit of SO is ignored) and consequently it has its own 3-wide Hazard Protection. bottom line is that you do not have to think of the mapping of hazard protection to be *exactly* one-to-one and onto registers: FU-Regs protects *contended resources*. defining what those "contended resources" are is entirely up to you. part 3 ------ combining those two: i believe what you are saying is that you are thinking in terms of a dedicated FU (or, more accurately, dedicated Reservation Station) per "Virtually-Allocated-Register". in other words, in OoO Tomasulo terminology, it would be the "Reorder Buffer Row" (ROB) or ROB in-flight row number. this is definitely WaW and definitely requires a FU-Regs + FU-FU pair (renamed to VirtRegs-RealRegs Virt-Virt or better ROBInFlightRow-RealRegs ROBInFlightRow-ROBInFlightRow) or, if either single-issue or absolute hell-on-earth multi-ported lookup or severe limitations on the routing are tolerable, a CAM.
ok, i thought a bit more, and i don't believe this idea will be useful in exactly the way that it is originally envisaged (if i understand it correctly), and it's quite simple: it would only be useful for instructions that have one read (and no writes) or one write (and no reads). direct-and-only association of FU-with-a-reg is a severe restriction. at present there are some FUs with *six* read registers. actually, seven https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/cr/pipe_data.py;hb=HEAD 10 regspec = [('INT', 'ra', '0:63'), # 64 bit range 11 ('INT', 'rb', '0:63'), # 64 bit range 12 ('CR', 'full_cr', '0:31'), # 32 bit range 13 ('CR', 'cr_a', '0:3'), # 4 bit range 14 ('CR', 'cr_b', '0:3'), # 4 bit range 15 ('CR', 'cr_c', '0:3')] # 4 bit: for CR_OP partial update (there are also 3 output regfile ports on CR: this allows transfer in from RA to CR and from CR to RT) the SPR regspecs are six in and six out: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/spr/pipe_data.py;hb=HEAD 19 regspec = [('INT', 'ra', '0:63'), # RA 20 ('SPR', 'spr1', '0:63'), # SPR (slow) 21 ('FAST', 'fast1', '0:63'), # SPR (fast: LR, CTR etc) 22 ('XER', 'xer_so', '32'), # XER bit 32: SO 23 ('XER', 'xer_ov', '33,44'), # XER bit 34/45: CA/CA32 24 ('XER', 'xer_ca', '34,45')] # bit0: ov, bit1: ov32 the reason why there are six is because mtspr needs to construct (or change) the three XER registers. there are *six* separate regfiles, a seventh is necessary for the FPR and an eigth is under consideration (MSR bits, similar to XER bits) to speed up context-switching (writing MSR.pr as its own bit instead of to the entire MSR) and other operations. https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;hb=HEAD 6 * INT regfile - 32x 64-bit 7 * SPR regfile - 110x 64-bit 8 * CR regfile - CR0-7 9 * XER regfile - XER.so, XER.ca/ca32, XER.ov/ov32 10 * FAST regfile - CTR, LR, TAR, SRR1, SRR2 11 * STATE regfile - PC, MSR, (SimpleV VL later) given the heavy level of inter-connectedness across each pipeline between different regfile types, it is simply not practical to consider allocating one entire regfile for the entire lot, in order to have a single FU deal with a single register. all the ALUs use (update) XER and CR. even LD/ST requires the addition of XER.SO and CR (which will be... complicated) 23 regspec = [('INT', 'o', '0:63'), # RT 24 ('INT', 'o1', '0:63'), # RA (effective address, update) 25 # TODO, later ('CR', 'cr_a', '0:3'), 26 # TODO, later ('XER', 'xer_so', '32') this is to be able to support the arithmetic ld/st (and cix) instructions properly, which set CR0 and therefore also set XER.SO which i can tell you will be a royal pain given that LDSTCompUnit is a special (non-standard) FU. bottom line is that due to the supercomputing nature of Power ISA, the idea of dedicating one single virtual in-flight regfile entry to one single Function Unit is a non-starter.
This idea is intended for a cpu where all micro-ops only write to one register each...it can be extended by having multiple output registers per FU (assumes all outputs are written simultaneously) -- equivalent to one wide output register with fields for several ISA-level registers' values -- e.g. add. would have an output field for RT and for CR0. The idea uses a mostly traditional register renaming scheme where ISA-level registers are renamed into "physical registers" or "hardware registers" (traditional names -- these are associated 1:1 with FUs' output registers in this idea) as a pipeline stage in the fetch-decode pipe immediately after decode and immediately before dispatching instructions to FUs. For an example of what I mean by traditional register renaming, see: https://ftp.libre-soc.org/power-cpu-sim/ (requires javascript and a modern browser) but mentally add a dispatch stage after renaming. Code: https://salsa.debian.org/Kazan-team/power-cpu-sim/-/tree/10b113faab52890dd77809096d5a664ece6b069e For cases such as: add r0, ... add r1, ... add r2, ... ... add r31, ... where you likely need more physical registers than the number of FUs for any one ALU: we could add a separate ALU that just copies the input value into cold-storage (aka. a move-only ALU) -- this'd be the only ALU that breaks the 1 FU 1 physical register rule -- it would have enough registers for all ISA-level registers to fit but only a few FUs. the register rename stage would insert copy ops before reallocating a physical register/FU (belonging to a normal ALU pipe) if it was still used by any ISA-level registers -- the rename stage would stall until both nothing was reading the FU to be reallocated's output and until the FU computed its output (stall happens anyway when all FUs for an ALU are still computing).
(In reply to Luke Kenneth Casson Leighton from comment #2) > ok, i thought a bit more, and i don't believe this idea will be useful > in exactly the way that it is originally envisaged (if i understand it > correctly), and it's quite simple: it would only be useful for instructions > that have one read (and no writes) or one write (and no reads). umm, can't a FU simultaneously depend on the outputs of several other FUs? you seem to have forgotten this...
(In reply to Jacob Lifshay from comment #3) > This idea is intended for a cpu where all micro-ops only write to one > register each... that's six separate Function Units in some cases for the Power ISA. Load/Store would become five Function Units. ShiftRot would be three. Condition Register CRops would become three. remember that if you *don't* allocate enough FUs, the only option is to stall. so although the CR0 FU could be shared between different FUs, there has to be enough to hold the entire in-flight Reservations expected. a large high-end (3+ ghz) multi-issue (8-issue) system normally has a THOUSAND instructions in-flight at any one time. you're talking about splitting up into between three to *six* operations, which would be six **THOUSAND** Function Units with in-flight Reservations. > e.g. add. would have an output field for RT and for CR0. and another for XER.SO and another for XER.CA and another for XER.OV that's five, not two. yes, some instructions will not set XER.CA, or not set XER.OV, or not set XER.SO: this is determined by the output itself, by the pipeline itself. the Reservation unfortunately still has to be made because the Function Unit *might* write. soc/fu/alu/output_stage.py: 30 comb += oe.eq(op.oe.oe & op.oe.ok) 31 with m.If(oe): 32 # XXX see https://bugs.libre-soc.org/show_bug.cgi?id=319#c5 33 comb += xer_so_o.data.eq(xer_so_i[0] | xer_ov_i[0]) # SO 34 comb += xer_so_o.ok.eq(1) this logic - tiny as it is would need to move to an entirely separate Function Unit. the subsequent lines to another separate Function Unit: 35 comb += xer_ov_o.data.eq(xer_ov_i) 36 comb += xer_ov_o.ok.eq(1) # OV/32 is to be set that's for every ALU that has XER.SO/OV/CA, and there are several. i think you'll find that this results in an alarmingly-high number of Reservation Stations and consequently absolutely massive Dependency Matrices. (In reply to Jacob Lifshay from comment #4) > umm, can't a FU simultaneously depend on the outputs of several other FUs? > you seem to have forgotten this... through the FU-FU Matrix, yes. and ti's multi-read as well as multi-write capable: https://libre-soc.org/3d_gpu/fu_dep_cell_multi_6600.jpg here's the source code: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/scoremulti/fu_fu_matrix.py;hb=HEAD it outputs readable_o and writeable_o both of which are true (on a per-FU basis) if there are no remaining write hazards (for readable_o) or no remaining read hazards (for writeable_o) respectively. * read-hazards are tracked per src operand in the FU. for ALU that is: - RA - RB - CR0 (due to writing the SO bit) - XER.SO - XER.CA - XER.OV * write-hazards are tracked per dest operand in the FU. for ALU that is: - RT - CR0 (due to writing the SO bit) - XER.SO - XER.CA - XER.OV the latches go HI at Issue time and remain HI until the Great-Big-Or-Gate for Read-Reg-Deps and Write-Reg-Deps of the corresponding FU-REGs Row says that all Read-deps are cleared or all Write-deps are cleared... *on a per-port* basis. you can see from the diagram that on the READ side that there is an OR gate per FU-FU-cell. only when every source register latch in the FU-FU-cell is cleared will the WRITE-WAIT signal go HI indicating that this FU no longer blocks any other FUs from WRITING. likewise for the WRITE side there is a corresponding OR gate to create a READ-WAIT signal which only goes HI when all dest SRC latches (RT, CR0, XER.SO, XER.CA, XER.OV) go LOW, indicating that this FU no longer blocks any other FUs from READING. reducing that down to one write per FU *does not* make the need to actually track that write go away. all it does is: move that need to somewhere else. so where at the moment the FU-FU DM can track N write-dependencies per FU, you are talking about having N-times more FUs with only single write-dep tracking (read-src tracking is still required) also it does not take away the need for the READ side tracking which must still be duplicated across all those FUs. and given that FU-FU is an O(N^2) resource the effect on gate count could be catastrophically high (several million gates) there's something else that i can't quite put my finger on that's making me... nervous / twitchy. it could just be the numbers involved (the number of RSes). given that it took four *months* for me to implement Mitch Alsup's 2nd book chapter idea, and when we went over it we found that the idea of replacing the FU-FU Matrix with a bitvector was flawed, the idea of altering such a critical low-level algorithm when even Mitch could have got it wrong, and how long it took to find that out, makes me quite nervous, mainly because of the amount of time it takes to properly evaluate these things.
(In reply to Luke Kenneth Casson Leighton from comment #5) > (In reply to Jacob Lifshay from comment #3) > > This idea is intended for a cpu where all micro-ops only write to one > > register each... I think you're misunderstanding still: the output register can have multiple fields in *one* FU's output register: +--------------+ | FU | | | | output reg | | +----+-----+ | | | RT | CR0 | | | +----+-----+ | | | +--------------+ the register renaming unit would tell the FUs that read from that output register, which field to read, if it isn't already obvious from the type of input for the instruction. e.g.: addi. r1, r3, 45 sub r4, r5, r1 the sub is told by the register renamer that its RB input comes from the addi.'s RT output field in the addi.'s FU's output register (which holds both r1 and cr0).
one part that is very useful or required is for the decoder to determine exactly which registers each instruction reads/writes, instead of doing what we do (which I think is a major design flaw) where nothing knows exactly what registers will be written or not until the ALU finishes computing and sets the output valid flags.
(In reply to Jacob Lifshay from comment #6) > I think you're misunderstanding still: the output register can have multiple > fields in *one* FU's output register: > > +--------------+ > | FU | > | | > | output reg | > | +----+-----+ | > | | RT | CR0 | | > | +----+-----+ | > | | > +--------------+ [that's five output registers, not two] this creates false (unnecessary) dependencies that in turn puts pressure on the read/write ports of the regfile and/or on the number of FUs required to hold in-flight data due to missed write opportunities. it is one of the (not many) flaws of the original 6600 design that there is only one GO_READ per Computation Unit. in the original 6600 design because there is only one GO_READ flag the Function Unit may *only* read from *all* regfile ports when *all* regfile ports are cleared. in the design that i have implemented, there is one GO_READ per regfile port and one GO_WRITE per regfile port. NOT: one GO_READ per *Function Unit* and one GO_WRITE per Function Unit. without this, even if RT is free to write, the Function Unit is FORCED to wait until the Condition Register file is also free to write (and the other way round, as well). this again puts pressure on the number of Function Units required because the in-flight data is now waiting around for much longer than is necessary [any time that ONLY INT is available for write is a missed opportunity. any time that ONLY CR is available for write is a missed opportunity. BOTH INT *AND* CR must be free and clear] the original 6600 compensated for this by having a whopping five read ports and two write ports on the "A" regfile (i think), which is absolutely enormous for the time. either way there is pressure on either: * the number of regfile ports required * the number of Function Units required, to hold in-flight data also the effectiveness of Operand-Forwarding may be adversely affected, but that's a separate analysis.
(In reply to Jacob Lifshay from comment #7) > one part that is very useful or required is for the decoder to determine > exactly which registers each instruction reads/writes, instead of doing what > we do (which I think is a major design flaw) where nothing knows exactly > what registers will be written or not until the ALU finishes computing and > sets the output valid flags. you've completely misunderstood. the decoder knows perfectly well which registers are read and write: it has to. i've just spent the past three weeks making sure that it does. here is the source code: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/power_regspec_map.py;hb=HEAD what you have completely missed is that the logic that was added right back at the start - two years ago - is an opportunity for optimisation. what you are describing as a "major design flaw" is a perspective that is total nonsense. let's go through it. no FU can ever be permitted to operate without its write-hazards being monitored. therefore even if that Function Unit **MIGHT** write, the write hazard **MUST** repeat **MUST** STILL BE REQUESTED. this is blindingly obvious up to this point: failure to note any hazards results in catastrophic unrecoverable data corruption. now, it so happens that only the Function Unit itself can determine, itself, whether things such as XER.SO actually need to be written, because writing to XER.SO is determined from the *input*, which is, clearly, NOT YET EVEN AVAILABLE at the time that the instruction is actually issued. therefore we are FORCED to make that write-reservation JUST IN CASE. ONLY once the 64-bit result has been computed can the *PIPELINE* determine - at that point and at that point only - whether XER.SO needs to be written to, or whether it does not. and if it does not, then this is great! the write-reservation can be dropped! what that in turn means is that: * a completely redundant (unnecessary write) to a regfile port is dropped. * register file port pressure is consequently reduced * the Function Unit may complete EARLY which in turn REDUCES the pressure on Function Unit Reservations * with the pressure on Function Unit Reservations comes in turn the possibility of a Order (N^2) reduction in gate count of the Dependency Matrices due to a reduction in the number of FUs. overall it is a SIGNIFICANT RESOURCE SAVING. it is complete utter nonsense to categorise such resource savings as "major design flaws". this is just part of how the Power ISA works, and it is slightly alarming to me that after two years you are not familiar with these subtleties.
(In reply to Luke Kenneth Casson Leighton from comment #8) > [that's five output registers, not two] yeah, yeah...i don't want to type 5 registers. also, not all ALUs that can add necessarily must support all possible add-like instructions, they may just support add[i][.] and nothing else, so rt and cr0 would be sufficient there. > this creates false (unnecessary) dependencies that in turn puts pressure > on the read/write ports of the regfile and/or on the number of FUs required > to hold in-flight data due to missed write opportunities. well...there is always exactly one write port on the register, cuz the register is only ever written by its corresponding ALU. whenever an instruction is finished executing, it always immediately writes its result to the corresponding register, there are never any reasons to delay at all, once execution has started, so...there are never missed write opportunities. re: unnecessary dependencies: other than load-update (which is weird and could be split in 2 uops: addr calc and load), basically all instructions can/should write all outputs simultaneously (when they finish the last stage of their execution pipe), so we never really have the case where some instructions' cr0 output is ready 5 cycles earlier than its rt output.
(In reply to Luke Kenneth Casson Leighton from comment #9) > now, it so happens that only the Function Unit itself can determine, > itself, whether things such as XER.SO actually need to be written, > because writing to XER.SO is determined from the *input*, which > is, clearly, NOT YET EVEN AVAILABLE at the time that the instruction is > actually issued. well, I'm approaching it from the perspective of: the instruction is fully known at decode time, if the instruction is an addi, then it never writes SO, and any successive instructions that read SO ignore the addi, not waiting for it. if it's addo, then it *always* writes SO, writing 0 if necessary, and any successive instructions that need SO will *always* wait for the addo. it never *maybe* writes anything, cuz that has questionable benefits and requires additional logic, it's always completely determined at decode time.
(In reply to Jacob Lifshay from comment #11) > (In reply to Luke Kenneth Casson Leighton from comment #9) > > now, it so happens that only the Function Unit itself can determine, > > itself, whether things such as XER.SO actually need to be written, > > because writing to XER.SO is determined from the *input*, which > > is, clearly, NOT YET EVEN AVAILABLE at the time that the instruction is > > actually issued. > > well, I'm approaching it from the perspective of: the instruction is fully > known at decode time, if the instruction is an addi, then it never writes > SO, and any successive instructions that read SO ignore the addi, not > waiting for it. that's what the PowerDecoder2 does. it's always done that, because we got that trick from Microwatt. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=edf2893b3dec4749822db7d926efb4eaa0eea9b2;hb=HEAD#l966 966 # rc and oe out 967 comb += self.do_copy("rc", dec_rc.rc_out) 968 if self.svp64_en: 969 # OE only enabled when SVP64 not active 970 with m.If(~self.is_svp64_mode): 971 comb += self.do_copy("oe", dec_oe.oe_out) 972 else: 973 comb += self.do_copy("oe", dec_oe.oe_out) (this is where you can see the rule about OE being entirely ignored in SVP64 is implemented). "listening" to Rc and OE comes from the CSV files, which originally come from microwatt decode1.vhdl. therefore, addi *DOES NOT* require XER.SO writing. here - source code which i have already referred you to and you clearly haven't read or asked questions about, just made arbitrary fundamental assumptions: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_regspec_map.py;h=7c066d7dbd691c089a13d981b3851a02ee1f1f89;hb=HEAD#l88 85 if name == 'xer_so': 86 # SO needs to be read for overflow *and* for creation 87 # of CR0 and also for MFSPR 88 rd = RegDecodeInfo(((e.do.oe.oe[0] & e.do.oe.ok) | 89 (e.xer_in & SO == SO)| 90 (e.do.rc.rc & e.do.rc.ok)), SO, 3) note there that if the PowerDecoder2 is instructed to ignore XER.SO, it does so, and does not even request the Read Port. there is a corresponding piece of code - which you clearly haven't looked at - which likewise performs the same check on the XER.SO write port. 179 if name == 'xer_so': 180 wr = RegDecodeInfo(e.xer_out | (e.do.oe.oe[0] & e.do.oe.ok), 181 SO, 3) # hmmm thus: addi DOES NOT request XER.SO, read or write. only those instructions which MIGHT require XER.SO actually request it. and therefore there is no "major design flaw", just an alarming lack of working knowledge on your part as to the internals of the design, even after two years, leading you to create "alternative designs that you think are better", rather than talking about it months ago, and resulting in you spending signficant time *not* helping us fulfil our goals and obligations. classic "Not Invented Here" syndrome, i'm sorry to have to point out. > if it's addo, then it *always* writes SO, writing 0 if > necessary, and any successive instructions that need SO will *always* wait > for the addo. it never *maybe* writes anything, cuz that has questionable > benefits and requires additional logic, it's always completely determined at > decode time. no, it's not. you've failed to listen to what i wrote. it is *not possible* to determine entirely at instruction decode time whether XER.SO needs to be written to. yes it is an optimisation, but an important one. you are correct in that actually, with XER.SO not being popularly used, it's not that important (for XER.SO). however the code itself has this capability (to "drop" write-port-requests, as determined based on information determined **AFTER** instruction decode phase), which will become critically important later on when predication is added. at that point it will matter a hell of a lot, hence why i went to the trouble of putting the infrastructure in place even at this early stage, because it will be too damn difficult to add later [the "ok" flags on ALU regspecs, which is part of MultiCompUnit and part of the entire pipeline data specifications] and it's not, honestly, that difficult to detect [but is a lot of design work right throughout the entire pipelines] this is the one line of code needed to identify when the condition occurs: 741 with m.If(fu.alu_done_o & latch_wrflag & ~fu_wrok): that's: * "the ALU is done i.e. it is requesting that its output (which will include XER.SO) be 'latched'" * at that exact moment, the dest "ok" flag may (or may not) be set * latch_wrflag captures those registers that were requested back at ISSUE time the combination of these tells you which registers were REQUESTED to be written to, but the pipeline is telling you NO, i do NOT need to write to them. because the pipeline is saying "no write needed", there is never going to be a corresponding write-request to the regfile, and consequently the Write-Hazard may be dropped. if it isn't dropped, all hell breaks loose, the entire Engine will lock up permanently, because the write-port was requested at ISSUE time but is never cleared. it's solved with a one-line test [but required a LOT of careful advance planning to get to that point, and a hell of a lot of work on MultiCompUnit] https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;h=bb7f8ce9e9b7b8454bc583b7fa2363f99c6e62a7;hb=56d6f9114733a20015df85da59c5d2ce694a465b#l731