751 – idea for reducing dependency matrixes in 6600-derived architecture with register renaming

Bug 751 - idea for reducing dependency matrixes in 6600-derived architecture with register renaming

Summary: idea for reducing dependency matrixes in 6600-derived architecture with regis...

Status:	DEFERRED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:	https://libre-soc.org/3d_gpu/isa_to_v...

Depends on:
Blocks:

Reported:	2021-12-02 10:29 GMT by Jacob Lifshay
Modified:	2021-12-03 00:57 GMT (History)
CC List:	3 users (show)

See Also:	352 737
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2021-12-02 10:29:58 GMT

I thought I'd write this down before I forgot:
I was thinking about a hypothetical cpu design that has register renaming and 6600-derived dependency matrixes, and I realized that:
if we, instead of having a separate register file, simply have 1 register per FU (the register that FU's corresponding instruction writes to) -- the FUs *are* the register file, it totally eliminates the need for the FU-Reg dependency matrix, leaving only the FU-FU matrix.

register renaming can take care of allocating FUs that are not in use when we need to allocate physical registers when renaming newly decoded instructions.

Comment 1 Luke Kenneth Casson Leighton 2021-12-02 11:16:34 GMT

two parts:

part 1
------

the architectural design for register renaming is already done, following
a discussion last year with Mitch Alsup on comp.arch.  this is the
table for the port management:

https://libre-soc.org/3d_gpu/isa_to_virtual_regs/

requests for access to a read/write a register file port come in on the columns
(bottom), and go out on the rows (left).

also required for Write-after-Write (which is basically what register
renaming is) is a pair of FU-Regs and FU-FU Matrices, adapted to work
"in reverse" (see comp.arch discussion at the url), names "FU" replaced
with "VirtualReg" and "Regs" replaced with "Physical" i.e. the Matrices
become named "VirtualReg-PhysicalReg" and "VirtualReg-VirtualReg".

the matrices are not quite exactly the same as FU-Regs and FU-FU: the
driving force (decision logic) goes in the *opposite* direction.
i don't recall the exact details of the comp.arch discussion,
it was 14 months ago.  looks like this was it:

https://groups.google.com/g/comp.arch/c/vdgvrYGoxTM/m/w8jAF56fBgAJ

ah, it was the Shadowing.  the Shadowing operates in the opposite
direction.

part 2
------

you *may* be referring to reducing the size of the FU-Regs Matrices.
a technique there is instead of one FU-Regs column per register,
to map multiple registers down to one column.

the most ridiculous but perfectly legitimate version of that is to
map *all* registers down to a single hazard bit per read and per
write port.

for the SPR regfile this may be directly necessary *right now*
because there are over 100 entries and it is flat-out completely
impractical to have a 100-wide SPR regfile hazard bitvector.

the upshot of such a type of drastic decision is that *only one
instruction at a time may be in-flight* which wishes to write to
the entire SPR regfile, and in a simple design in the case of writing
to the SPR regfile that may be exactly what is needed.

less draconian decisions can also be made by allowing certain
SPRs to have their own dedicated Hazard column such as for the
Galois Field instruction for setting the modulo.

this is already partly done in the form of splitting up the XER
SPR which has a very special 3-entry 2-bit regfile (SO, CA, OV
where the top - 2nd - bit of SO is ignored) and consequently
it has its own 3-wide Hazard Protection.
 
bottom line is that you do not have to think of the mapping of
hazard protection to be *exactly* one-to-one and onto registers:
FU-Regs protects *contended resources*.  defining what those
"contended resources" are is entirely up to you.

part 3
------

combining those two: i believe what you are saying is that you are
thinking in terms of a dedicated FU (or, more accurately, dedicated
Reservation Station) per "Virtually-Allocated-Register".  in other
words, in OoO Tomasulo terminology, it would be the "Reorder Buffer Row"
(ROB) or ROB in-flight row number.

this is definitely WaW and definitely requires a FU-Regs + FU-FU
pair (renamed to VirtRegs-RealRegs Virt-Virt or better
ROBInFlightRow-RealRegs ROBInFlightRow-ROBInFlightRow)

or, if either single-issue or absolute hell-on-earth multi-ported
lookup or severe limitations on the routing are tolerable, a CAM.

Comment 2 Luke Kenneth Casson Leighton 2021-12-02 11:52:01 GMT

ok, i thought a bit more, and i don't believe this idea will be useful
in exactly the way that it is originally envisaged (if i understand it
correctly), and it's quite simple: it would only be useful for instructions
that have one read (and no writes) or one write (and no reads).

direct-and-only association of FU-with-a-reg is a severe restriction.
at present there are some FUs with *six* read registers.  actually,
seven

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/cr/pipe_data.py;hb=HEAD

  10     regspec = [('INT', 'ra', '0:63'),      # 64 bit range
  11                ('INT', 'rb', '0:63'),      # 64 bit range
  12                ('CR', 'full_cr', '0:31'), # 32 bit range
  13                ('CR', 'cr_a', '0:3'),     # 4 bit range
  14                ('CR', 'cr_b', '0:3'),     # 4 bit range
  15                ('CR', 'cr_c', '0:3')]     # 4 bit: for CR_OP partial update

(there are also 3 output regfile ports on CR: this allows transfer in
from RA to CR and from CR to RT)

the SPR regspecs are six in and six out:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/spr/pipe_data.py;hb=HEAD

  19     regspec = [('INT', 'ra', '0:63'),        # RA
  20                ('SPR', 'spr1', '0:63'),      # SPR (slow)
  21                ('FAST', 'fast1', '0:63'),    # SPR (fast: LR, CTR etc)
  22                ('XER', 'xer_so', '32'),      # XER bit 32: SO
  23                ('XER', 'xer_ov', '33,44'),   # XER bit 34/45: CA/CA32
  24                ('XER', 'xer_ca', '34,45')]   # bit0: ov, bit1: ov32

the reason why there are six is because mtspr needs to construct
(or change) the three XER registers.

there are *six* separate regfiles, a seventh is necessary for the FPR
and an eigth is under consideration (MSR bits, similar to XER bits)
to speed up context-switching (writing MSR.pr as its own bit instead
of to the entire MSR) and other operations.

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;hb=HEAD

   6     * INT regfile   - 32x 64-bit
   7     * SPR regfile   - 110x 64-bit
   8     * CR regfile    - CR0-7
   9     * XER regfile   - XER.so, XER.ca/ca32, XER.ov/ov32
  10     * FAST regfile  - CTR, LR, TAR, SRR1, SRR2
  11     * STATE regfile  - PC, MSR, (SimpleV VL later)

given the heavy level of inter-connectedness across each pipeline between
different regfile types, it is simply not practical to consider allocating
one entire regfile for the entire lot, in order to have a single FU deal
with a single register.

all the ALUs use (update) XER and CR.  even LD/ST requires the addition
of XER.SO and CR (which will be... complicated)

  23     regspec = [('INT', 'o', '0:63'),   # RT
  24                ('INT', 'o1', '0:63'),  # RA (effective address, update)
  25                # TODO, later ('CR', 'cr_a', '0:3'),
  26                # TODO, later ('XER', 'xer_so', '32')

this is to be able to support the arithmetic ld/st (and cix)
instructions properly, which set CR0 and therefore also set XER.SO
which i can tell you will be a royal pain given that LDSTCompUnit
is a special (non-standard) FU.

bottom line is that due to the supercomputing nature of Power ISA,
the idea of dedicating one single virtual in-flight regfile entry to
one single Function Unit is a non-starter.

Comment 3 Jacob Lifshay 2021-12-02 18:10:25 GMT

This idea is intended for a cpu where all micro-ops only write to one register each...it can be extended by having multiple output registers per FU (assumes all outputs are written simultaneously) -- equivalent to one wide output register with fields for several ISA-level registers' values -- e.g. add. would have an output field for RT and for CR0.

The idea uses a mostly traditional register renaming scheme where ISA-level registers are renamed into "physical registers" or "hardware registers" (traditional names -- these are associated 1:1 with FUs' output registers in this idea) as a pipeline stage in the fetch-decode pipe immediately after decode and immediately before dispatching instructions to FUs.

For an example of what I mean by traditional register renaming, see:
https://ftp.libre-soc.org/power-cpu-sim/
(requires javascript and a modern browser)
but mentally add a dispatch stage after renaming.
Code:
https://salsa.debian.org/Kazan-team/power-cpu-sim/-/tree/10b113faab52890dd77809096d5a664ece6b069e

For cases such as:
add r0, ...
add r1, ...
add r2, ...
...
add r31, ...
where you likely need more physical registers than the number of FUs for any one ALU:
we could add a separate ALU that just copies the input value into cold-storage (aka. a move-only ALU) -- this'd be the only ALU that breaks the 1 FU 1 physical register rule -- it would have enough registers for all ISA-level registers to fit but only a few FUs. the register rename stage would insert copy ops before reallocating a physical register/FU (belonging to a normal ALU pipe) if it was still used by any ISA-level registers -- the rename stage would stall until both nothing was reading the FU to be reallocated's output and until the FU computed its output (stall happens anyway when all FUs for an ALU are still computing).

Comment 4 Jacob Lifshay 2021-12-02 18:13:16 GMT

(In reply to Luke Kenneth Casson Leighton from comment #2)
> ok, i thought a bit more, and i don't believe this idea will be useful
> in exactly the way that it is originally envisaged (if i understand it
> correctly), and it's quite simple: it would only be useful for instructions
> that have one read (and no writes) or one write (and no reads).

umm, can't a FU simultaneously depend on the outputs of several other FUs? you seem to have forgotten this...

Comment 5 Luke Kenneth Casson Leighton 2021-12-02 22:08:39 GMT

(In reply to Jacob Lifshay from comment #3)
> This idea is intended for a cpu where all micro-ops only write to one
> register each...

that's six separate Function Units in some cases for the Power ISA.
Load/Store would become five Function Units.  ShiftRot would be
three.  Condition Register CRops would become three.

remember that if you *don't* allocate enough FUs, the only option
is to stall.  so although the CR0 FU could be shared between different
FUs, there has to be enough to hold the entire in-flight Reservations
expected.

a large high-end (3+ ghz) multi-issue (8-issue) system normally has
a THOUSAND instructions in-flight at any one time.

you're talking about splitting up into between three to *six* operations,
which would be six **THOUSAND** Function Units with in-flight Reservations.

> e.g. add. would have an output field for RT and for CR0.

and another for XER.SO
and another for XER.CA
and another for XER.OV

that's five, not two.
 
yes, some instructions will not set XER.CA, or not set XER.OV, or
not set XER.SO: this is determined by the output itself, by the
pipeline itself.

the Reservation unfortunately still has to be made because the
Function Unit *might* write.

soc/fu/alu/output_stage.py:

  30         comb += oe.eq(op.oe.oe & op.oe.ok)
  31         with m.If(oe):
  32             # XXX see https://bugs.libre-soc.org/show_bug.cgi?id=319#c5
  33             comb += xer_so_o.data.eq(xer_so_i[0] | xer_ov_i[0]) # SO
  34             comb += xer_so_o.ok.eq(1)

this logic - tiny as it is would need to move to an entirely separate
Function Unit.  the subsequent lines to another separate Function Unit:

  35             comb += xer_ov_o.data.eq(xer_ov_i)
  36             comb += xer_ov_o.ok.eq(1) # OV/32 is to be set

that's for every ALU that has XER.SO/OV/CA, and there are several.

i think you'll find that this results in an alarmingly-high number
of Reservation Stations and consequently absolutely massive Dependency
Matrices.

(In reply to Jacob Lifshay from comment #4)

> umm, can't a FU simultaneously depend on the outputs of several other FUs?
> you seem to have forgotten this...

through the FU-FU Matrix, yes.  and ti's multi-read as well as multi-write
capable:

https://libre-soc.org/3d_gpu/fu_dep_cell_multi_6600.jpg

here's the source code:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/scoremulti/fu_fu_matrix.py;hb=HEAD

it outputs readable_o and writeable_o both of which are true (on a
per-FU basis) if there are no remaining write hazards (for readable_o)
or no remaining read hazards (for writeable_o) respectively.

* read-hazards are tracked per src operand in the FU. for ALU that is:
  - RA
  - RB
  - CR0 (due to writing the SO bit)
  - XER.SO
  - XER.CA
  - XER.OV
* write-hazards are tracked per dest operand in the FU.  for ALU that is:
  - RT
  - CR0 (due to writing the SO bit)
  - XER.SO
  - XER.CA
  - XER.OV

the latches go HI at Issue time and remain HI until the Great-Big-Or-Gate
for Read-Reg-Deps and Write-Reg-Deps of the corresponding FU-REGs Row says
that all Read-deps are cleared or all Write-deps are cleared...
*on a per-port* basis.

you can see from the diagram that on the READ side that there is an
OR gate per FU-FU-cell.  only when every source register latch in the
FU-FU-cell is cleared will the WRITE-WAIT signal go HI indicating that
this FU no longer blocks any other FUs from WRITING.

likewise for the WRITE side there is a corresponding OR gate to create
a READ-WAIT signal which only goes HI when all dest SRC latches
(RT, CR0, XER.SO, XER.CA, XER.OV) go LOW, indicating that this FU no
longer blocks any other FUs from READING.

reducing that down to one write per FU *does not* make the need to
actually track that write go away.  all it does is: move that need to
somewhere else.

so where at the moment the FU-FU DM can track N write-dependencies
per FU, you are talking about having N-times more FUs with only
single write-dep tracking (read-src tracking is still required)

also it does not take away the need for the READ side tracking
which must still be duplicated across all those FUs.

and given that FU-FU is an O(N^2) resource the effect on gate count
could be catastrophically high (several million gates)

there's something else that i can't quite put my finger on that's making
me... nervous / twitchy.  it could just be the numbers involved (the number
of RSes). given that it took four *months* for me to implement
Mitch Alsup's 2nd book chapter idea, and when we went over it we found
that the idea of replacing the FU-FU Matrix with a bitvector was flawed,
the idea of altering such a critical low-level algorithm when even Mitch
could have got it wrong, and how long it took to find that out, makes me
quite nervous, mainly because of the amount of time it takes to properly
evaluate these things.

Comment 6 Jacob Lifshay 2021-12-02 22:23:06 GMT

(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #3)
> > This idea is intended for a cpu where all micro-ops only write to one
> > register each...

I think you're misunderstanding still: the output register can have multiple fields in *one* FU's output register:

+--------------+
| FU           |
|              |
| output reg   |
| +----+-----+ |
| | RT | CR0 | |
| +----+-----+ |
|              |
+--------------+

the register renaming unit would tell the FUs that read from that output register, which field to read, if it isn't already obvious from the type of input for the instruction.

e.g.:
addi. r1, r3, 45
sub r4, r5, r1

the sub is told by the register renamer that its RB input comes from the addi.'s RT output field in the addi.'s FU's output register (which holds both r1 and cr0).

Comment 7 Jacob Lifshay 2021-12-02 22:28:43 GMT

one part that is very useful or required is for the decoder to determine exactly which registers each instruction reads/writes, instead of doing what we do (which I think is a major design flaw) where nothing knows exactly what registers will be written or not until the ALU finishes computing and sets the output valid flags.

Comment 8 Luke Kenneth Casson Leighton 2021-12-02 22:46:05 GMT

(In reply to Jacob Lifshay from comment #6)

> I think you're misunderstanding still: the output register can have multiple
> fields in *one* FU's output register:
> 
> +--------------+
> | FU           |
> |              |
> | output reg   |
> | +----+-----+ |
> | | RT | CR0 | |
> | +----+-----+ |
> |              |
> +--------------+

[that's five output registers, not two]

this creates false (unnecessary) dependencies that in turn puts pressure
on the read/write ports of the regfile and/or on the number of FUs required
to hold in-flight data due to missed write opportunities.

it is one of the (not many) flaws of the original 6600 design that there is
only one GO_READ per Computation Unit.

in the original 6600 design because there is only one GO_READ flag
the Function Unit may *only* read from *all* regfile ports when *all*
regfile ports are cleared.

in the design that i have implemented, there is one GO_READ per regfile
port and one GO_WRITE per regfile port.  NOT: one GO_READ per *Function
Unit* and one GO_WRITE per Function Unit.

without this, even if RT is free to write, the Function Unit is FORCED
to wait until the Condition Register file is also free to write (and
the other way round, as well).

this again puts pressure on the number of Function Units required because
the in-flight data is now waiting around for much longer than is necessary
[any time that ONLY INT is available for write is a missed opportunity.
any time that ONLY CR is available for write is a missed opportunity.
BOTH INT *AND* CR must be free and clear]

the original 6600 compensated for this by having a whopping five read ports
and two write ports on the "A" regfile (i think), which is absolutely
enormous for the time.

either way there is pressure on either:

* the number of regfile ports required
* the number of Function Units required, to hold in-flight data

also the effectiveness of Operand-Forwarding may be adversely affected,
but that's a separate analysis.

Comment 9 Luke Kenneth Casson Leighton 2021-12-02 23:01:40 GMT

(In reply to Jacob Lifshay from comment #7)
> one part that is very useful or required is for the decoder to determine
> exactly which registers each instruction reads/writes, instead of doing what
> we do (which I think is a major design flaw) where nothing knows exactly
> what registers will be written or not until the ALU finishes computing and
> sets the output valid flags.

you've completely misunderstood.  the decoder knows perfectly well which
registers are read and write: it has to.  i've just spent the past
three weeks making sure that it does.

here is the source code:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/power_regspec_map.py;hb=HEAD

what you have completely missed is that the logic that was added right
back at the start - two years ago - is an opportunity for optimisation.

what you are describing as a "major design flaw" is a perspective that
is total nonsense.

let's go through it.

no FU can ever be permitted to operate without its write-hazards being
monitored.

therefore even if that Function Unit **MIGHT** write, the write hazard
**MUST** repeat **MUST** STILL BE REQUESTED.

this is blindingly obvious up to this point: failure to note any
hazards results in catastrophic unrecoverable data corruption.

now, it so happens that only the Function Unit itself can determine,
itself, whether things such as XER.SO actually need to be written,
because writing to XER.SO is determined from the *input*, which
is, clearly, NOT YET EVEN AVAILABLE at the time that the instruction is
actually issued.

therefore we are FORCED to make that write-reservation JUST IN CASE.

ONLY once the 64-bit result has been computed can the *PIPELINE*
determine - at that point and at that point only - whether XER.SO needs
to be written to, or whether it does not.

and if it does not, then this is great!  the write-reservation can be
dropped!

what that in turn means is that:

* a completely redundant (unnecessary write) to a regfile port is
  dropped.
* register file port pressure is consequently reduced
* the Function Unit may complete EARLY which in turn REDUCES
  the pressure on Function Unit Reservations
* with the pressure on Function Unit Reservations comes in turn
  the possibility of a Order (N^2) reduction in gate count of
  the Dependency Matrices due to a reduction in the number of FUs.

overall it is a SIGNIFICANT RESOURCE SAVING.  it is complete utter
nonsense to categorise such resource savings as "major design flaws".

this is just part of how the Power ISA works, and it is slightly
alarming to me that after two years you are not familiar with these
subtleties.

Comment 10 Jacob Lifshay 2021-12-02 23:03:10 GMT

(In reply to Luke Kenneth Casson Leighton from comment #8)
> [that's five output registers, not two]

yeah, yeah...i don't want to type 5 registers. also, not all ALUs that can add necessarily must support all possible add-like instructions, they may just support add[i][.] and nothing else, so rt and cr0 would be sufficient there.

> this creates false (unnecessary) dependencies that in turn puts pressure
> on the read/write ports of the regfile and/or on the number of FUs required
> to hold in-flight data due to missed write opportunities.

well...there is always exactly one write port on the register, cuz the register is only ever written by its corresponding ALU. whenever an instruction is finished executing, it always immediately writes its result to the corresponding register, there are never any reasons to delay at all, once execution has started, so...there are never missed write opportunities.

re: unnecessary dependencies: other than load-update (which is weird and could be split in 2 uops: addr calc and load), basically all instructions can/should write all outputs simultaneously (when they finish the last stage of their execution pipe), so we never really have the case where some instructions' cr0 output is ready 5 cycles earlier than its rt output.

Comment 11 Jacob Lifshay 2021-12-02 23:23:32 GMT

(In reply to Luke Kenneth Casson Leighton from comment #9)
> now, it so happens that only the Function Unit itself can determine,
> itself, whether things such as XER.SO actually need to be written,
> because writing to XER.SO is determined from the *input*, which
> is, clearly, NOT YET EVEN AVAILABLE at the time that the instruction is
> actually issued.

well, I'm approaching it from the perspective of: the instruction is fully known at decode time, if the instruction is an addi, then it never writes SO, and any successive instructions that read SO ignore the addi, not waiting for it. if it's addo, then it *always* writes SO, writing 0 if necessary, and any successive instructions that need SO will *always* wait for the addo. it never *maybe* writes anything, cuz that has questionable benefits and requires additional logic, it's always completely determined at decode time.

Comment 12 Luke Kenneth Casson Leighton 2021-12-03 00:57:56 GMT

(In reply to Jacob Lifshay from comment #11)
> (In reply to Luke Kenneth Casson Leighton from comment #9)
> > now, it so happens that only the Function Unit itself can determine,
> > itself, whether things such as XER.SO actually need to be written,
> > because writing to XER.SO is determined from the *input*, which
> > is, clearly, NOT YET EVEN AVAILABLE at the time that the instruction is
> > actually issued.
> 
> well, I'm approaching it from the perspective of: the instruction is fully
> known at decode time, if the instruction is an addi, then it never writes
> SO, and any successive instructions that read SO ignore the addi, not
> waiting for it.

that's what the PowerDecoder2 does.  it's always done that, because we got
that trick from Microwatt.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=edf2893b3dec4749822db7d926efb4eaa0eea9b2;hb=HEAD#l966

 966         # rc and oe out
 967         comb += self.do_copy("rc", dec_rc.rc_out)
 968         if self.svp64_en:
 969             # OE only enabled when SVP64 not active
 970             with m.If(~self.is_svp64_mode):
 971                 comb += self.do_copy("oe", dec_oe.oe_out)
 972         else:
 973             comb += self.do_copy("oe", dec_oe.oe_out)

(this is where you can see the rule about OE being entirely ignored in
SVP64 is implemented).

"listening" to Rc and OE comes from the CSV files, which originally come
from microwatt decode1.vhdl.

therefore, addi *DOES NOT* require XER.SO writing.  here - source code
which i have already referred you to and you clearly haven't read or
asked questions about, just made arbitrary fundamental assumptions:

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_regspec_map.py;h=7c066d7dbd691c089a13d981b3851a02ee1f1f89;hb=HEAD#l88


  85         if name == 'xer_so':
  86             # SO needs to be read for overflow *and* for creation
  87             # of CR0 and also for MFSPR
  88             rd = RegDecodeInfo(((e.do.oe.oe[0] & e.do.oe.ok) |
  89                                   (e.xer_in & SO == SO)|
  90                                   (e.do.rc.rc & e.do.rc.ok)), SO, 3)

note there that if the PowerDecoder2 is instructed to ignore XER.SO, it
does so, and does not even request the Read Port.  there is a corresponding
piece of code - which you clearly haven't looked at - which likewise
performs the same check on the XER.SO write port.

 179         if name == 'xer_so':
 180             wr = RegDecodeInfo(e.xer_out | (e.do.oe.oe[0] & e.do.oe.ok),
 181                                     SO, 3) # hmmm

thus: addi DOES NOT request XER.SO, read or write.  only those instructions
which MIGHT require XER.SO actually request it.

and therefore there is no "major design flaw", just an alarming lack of
working knowledge on your part as to the internals of the design, even
after two years, leading you to create "alternative designs that you think
are better", rather than talking about it months ago, and resulting in
you spending signficant time *not* helping us fulfil our goals and
obligations.

classic "Not Invented Here" syndrome, i'm sorry to have to point out.


> if it's addo, then it *always* writes SO, writing 0 if
> necessary, and any successive instructions that need SO will *always* wait
> for the addo. it never *maybe* writes anything, cuz that has questionable
> benefits and requires additional logic, it's always completely determined at
> decode time.

no, it's not.  you've failed to listen to what i wrote.  it is *not possible*
to determine entirely at instruction decode time whether XER.SO needs to
be written to.  yes it is an optimisation, but an important one.

you are correct in that actually, with XER.SO not being popularly used,
it's not that important (for XER.SO).  however the code itself has this
capability (to "drop" write-port-requests, as determined based on information
determined **AFTER** instruction decode phase), which will become critically
important later on when predication is added.  at that point it will matter
a hell of a lot, hence why i went to the trouble of putting the infrastructure
in place even at this early stage, because it will be too damn difficult
to add later [the "ok" flags on ALU regspecs, which is part of MultiCompUnit
and part of the entire pipeline data specifications]

and it's not, honestly, that difficult to detect [but is a lot of design
work right throughout the entire pipelines]

this is the one line of code needed to identify when the condition
occurs:

 741         with m.If(fu.alu_done_o & latch_wrflag & ~fu_wrok):

that's:

* "the ALU is done i.e. it is requesting that its output (which
will include XER.SO) be 'latched'"
* at that exact moment, the dest "ok" flag may (or may not) be set
* latch_wrflag captures those registers that were requested back
  at ISSUE time

the combination of these tells you which registers were REQUESTED
to be written to, but the pipeline is telling you NO, i do NOT
need to write to them.

because the pipeline is saying "no write needed", there is never
going to be a corresponding write-request to the regfile, and
consequently the Write-Hazard may be dropped.

if it isn't dropped, all hell breaks loose, the entire Engine
will lock up permanently, because the write-port was requested
at ISSUE time but is never cleared.

it's solved with a one-line test [but required a LOT of careful
advance planning to get to that point, and a hell of a lot of
work on MultiCompUnit]

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;h=bb7f8ce9e9b7b8454bc583b7fa2363f99c6e62a7;hb=56d6f9114733a20015df85da59c5d2ce694a465b#l731