Bug 352 - virtual (dependency-tracked) regfile (cache) needed
Summary: virtual (dependency-tracked) regfile (cache) needed
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Mac OS
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 383 345
  Show dependency treegraph
 
Reported: 2020-05-25 15:23 BST by Luke Kenneth Casson Leighton
Modified: 2020-10-31 12:27 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2019.02
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-05-25 15:23:26 BST
i was hoping this would not be needed at this stage, however it looks like it is.

there are over 100 active SPRs in POWER9.   we cannot possibly have 
a Dependency Matrix 100-wide for that.

therefore, option 1:

* restrict the supported SPRs to a manageable subset (and remap them)

option 2:

* have a "virtual" (dependency-tracked) front-end on the Regfile.

traditional names for this are "PRF-ARF" - physical regfile / architectural
regfile.

normally, the numbering for regfiles would have fixed indexes.  here, instead,
we need *dynamically-allocated* re-mapped indexing, plus a tracking system
which will be based on a pair of "bitvectors", generated and managed by the
FU-REGs Dependency Matrix: global_read_pending and global_write_pending

on first allocation, any free bit in g_r and g_w (both must be free) is used
to indicate that that "slot" is available.  the association between the
"real" regfile (binary? unary?) index will be recorded and kept.

the association will only be dropped when both the g_r and g_w pending
bitvectors clear their bits corresponding to that PRF-ARF association.

if there are no free bits, the entire execution engine - further allocation
of instructions to Function Units - is REQUIRED to stall.
Comment 1 Luke Kenneth Casson Leighton 2020-10-28 12:57:20 GMT
http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-October/001069.html

further discussion leading back, where we may need to introduce register-renaming into the register cache.

the original scheme (comment #1) is the "green" matrix here:
https://libre-soc.org/3d_gpu/reorder_alias_bytemask_scheme.jpg

that green table has "real" registers as columns and "virtual" registers as rows

however if we are to introduce register-renaming the tables become more complex and would need to be thought through.  perhaps an actual L1 cache?  don't know.  needs thought.
Comment 2 Luke Kenneth Casson Leighton 2020-10-28 15:07:03 GMT
just to note jacob's observation that expecting compilers to perform reg-rename "avoidance" isn't normally done or expected.  therefore we do need to solve this in hardware: it is a matter of when (which phase).  the stage-simulation will help us to evaluate the impact, there. bug #523
Comment 3 Luke Kenneth Casson Leighton 2020-10-28 15:29:59 GMT
the scope here, the number and type of regfiles, is going to be quite large.

* 128 64-bit INTs
* 128 64-bit FPs
* 8 (64? 128?) 4-bit CRs
* 8 "FAST" SPRs (TAR, CTR etc)

these are the major ones: XER is only 3 regs, SPRs, and even potentially the FAST regs (TAR, CTR, LR, SRR1/0) do not really need to be virtualised.  SPRs (except FAST ones) are uncommon and do not interact with other regfiles in ways that need virtualisation.

a case could be made however to virtualise (and rename) the FAST regs given how jacob showed that CTR is for example critical to loops (bcctr).

the number of INTs and FPs alone leaves us with an enormous register address space: *9* bits!  most architectures only use 5!

CRs being only 4 bit could be considered separate and given their own cache.  however INT and FP and probably FAST seem like they should all be in the same cache/renaming, particularly as they are all 64 bit.


on the other side of the equation is the number of virtual regs, which will define (match exactly) the column width of the FU-REGs DM.  32 may be too small (run out), 64 is getting to the point where the size of the FU-REGs DM is a little large.

note that even the original 6600 had 24 registers (A, B, X) although it was a sparse FU-REGs matrix because not all opcodes used all 3 regtypes.


so this defines the scope:

* 9 bit addressing of physical regs
* 6 bit addressing of virtual regs

this is however *without* register renaming which effectively introduces a minimum of 1 extra bit and potentially as high as 5 extra bits on the physical address space side
Comment 4 Luke Kenneth Casson Leighton 2020-10-28 16:21:47 GMT
https://groups.google.com/g/comp.arch/c/vdgvrYGoxTM

need help, asked on comp.arch :)
Comment 5 Jacob Lifshay 2020-10-28 16:49:53 GMT
(In reply to Luke Kenneth Casson Leighton from comment #4)
> https://groups.google.com/g/comp.arch/c/vdgvrYGoxTM
> 
> need help, asked on comp.arch :)

Google finally fixed Google Groups so it's archivable on archive.org! they used to have the url be something like groups.google.com/#comp.arch... and archive.org wasn't able to figure out all the javascript used to load it.
Comment 6 Luke Kenneth Casson Leighton 2020-10-30 22:24:31 GMT
(In reply to Jacob Lifshay from comment #5)
> (In reply to Luke Kenneth Casson Leighton from comment #4)
> > https://groups.google.com/g/comp.arch/c/vdgvrYGoxTM
> > 
> > need help, asked on comp.arch :)
> 
> Google finally fixed Google Groups so it's archivable on archive.org!

bout time, it's big enough and dignificant enough, good for them.

so after going round the houses a bit on comp.arch it looks like a counter and/or bitfield per reg on WaW is the way to go.  as it is r9-r9 or r3-r3 the number of columns i *think* only need be 1 i.e. it is a WaW vector more than it is a matrix.

allocating an entirely new virtual row in the FU-REGs per newly-discovered WaW conflict gives us a safe rollback/shadow-commit even when faced with WaW hazards.

the FU-REGs itself stores all Write Hazards and by ORing all write hazards together down the columns we get a "Write Hazard Vector".

this can be ANDed with our instruction's write reg target to tell us that a WaW exists.

normally (in the 6600) this would be an indication to issue to "stall".  we instead use this to increment that reg's "rename" counter.

this just leaves a multi-issue-capable method of transmitting and marshalling read/write regfile ports and addresses.  which is the bit i am still going "hmmm" on.
Comment 7 Luke Kenneth Casson Leighton 2020-10-30 22:25:37 GMT
(In reply to Luke Kenneth Casson Leighton from comment #6)

> this just leaves a multi-issue-capable method of transmitting and
> marshalling read/write regfile ports and addresses.  which is the bit i am
> still going "hmmm" on.

(taking into account the virtual redirection including renaming)
Comment 8 Luke Kenneth Casson Leighton 2020-10-31 02:30:33 GMT
so imagine a matrix, ISAregs on the rows, virtual regs on the columns.

on each ISAreg at the left is a bitfield which is our "active renamers" that also if 0b0000 tells us that the ISAreg is not in use.

along the top is another bitfield vector telling us which virtual regs are in use

in every cell is a Latch which says whether redirection from ISAreg A is active to Virtual reg B.

then, for each "issue" (1, 2, 3, 4) in multi-issue are required independent column *and* row activation wires that will allow the Latch to be set.  if there is to be 4-multi issue, 4 sets of independent row-column grid wires are needed.

then (and the above grid wires can probably be shared with this), a way to pass through information about the regfile ports is needed, this being the whole purpose, that the virtual regs gets redirected to real ones.

giving each regfile port a binary ID (read and write) the ID can be passed up from the bottom, along with a request, "please redirect to real".

on reaching the cell with the active latch the entire ID on that column is passed to the *row* wires using simple AND and OR gates.

the ID is received by the regfile to tell it which port to activate.

the data will then be read/written on the appropriate numbered regfile broadcast bus.

deactivation is cleared by column, matching the FUREGs reset, and because there is only one latch ever set per column (and row) a grid reset is not needed, just column wires.
Comment 9 Luke Kenneth Casson Leighton 2020-10-31 12:27:28 GMT
https://groups.google.com/g/comp.arch/c/vdgvrYGoxTM/m/KerzlHs0BgAJ
https://libre-soc.org/3d_gpu/isa_to_virtual_regs/

drew it out this morning (ish), this is just the lookup/translation,
have to do the reg-rename logic separately which will "drive" the set/reset
lines.

hilariously whilst we are looking to reduce the number of virtual regs
when it comes to doing Monster CPUs there would actually be way more
virtual regs than real ones (a la POWER10 which can have 1,000 instructions
in-flight and is 8-way multi-issue)