737 – in-order single-issue Power ISA 3.0 core

Bug 737 - in-order single-issue Power ISA 3.0 core

Summary: in-order single-issue Power ISA 3.0 core

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's second ASIC
Classification:	Unclassified
Component:	source code (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Cesar Strauss

URL:

Depends on:	726 747 749 1094
Blocks:	961 1149 690
	Show dependency tree / graph

Reported:	2021-10-30 20:23 BST by Luke Kenneth Casson Leighton
Modified:	2023-10-11 15:22 BST (History)
CC List:	3 users (show)

See Also:	589 741 742 743 751 491 262 51 450 465 92
NLnet milestone:	Future
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2021-10-30 20:23:02 BST

a straight in-order core is needed which has standard Power ISA 3.0
support and single-issue.  very simple, no reg renaming, no optimisations,
run-ahead speculation, unwinding or cancellation.  if an instruction cannot
complete 100% then stalling is deployed.

three? four? stages:

* fetch

  Stage API input:
     start (DMI)
     stop  (DMI)
     step  (DMI)
     pc    (DMI)

* decode

  Stage API input:
     CoreState
     raw_insn_i
     bigendian_i

* issue

  Stage API input, for core:
     self.maindecoder: 
        102 Decode2ToExecute1Type("core", opkls=IssuerDecode2ToOperand)
     self.state = CoreState("core")
     self.raw_insn_i = Signal(32) # raw instruction
     self.bigendian_i = Signal() # bigendian - TODO, set by MSR.BE

testing:

* python3 simple/test/test_issuer.py nosvp64 --allow-overlap hazard

Comment 1 Luke Kenneth Casson Leighton 2021-10-30 20:50:14 BST

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;hb=HEAD

core.py can likely be used as-is with no modifications, as long as the
issue is stalled at issue phase itself, if the register "write outstanding" bitvector has all the bits clear.

process:

* decode detects register numbers for write and read
  and creates a unary bitvector mask for each.
  - add r5, r7, r3 would create:
  - read vector of 0b0000000010001000
  - wr   vector of 0b0000p00000100000
* global WRITE vector is ANDed with instruction READ vector
  - this detects any RaW (Read after Write) hazards
  - if the AND has any bits set, the entire processor MUST stall
  - this instruction and all others must NOT be issued
  - if it is clear (on this or a future cycle) the instruction
    may proceed
* when the instruction proceeds the WRITE vector must be ORed
  into the GLOBAL write vector
* when an instruction completes, the bits that were set in
  the GLOBAL write vector must be cleared.
  - this can be done by hooking into the write port (snooping)
    of the regfile (PC update is done this way)

two things to note:

* be careful when both reading and writing to the same reg.
  the regfiles all have "operand forwarding" so it is technically
  possible to achieve, but watch out for the bitvector
* cancellation (exceptions) has to be thought through: anything
  still in the pipeline that was issued BEFORE the exception
  point has to be allowed to finish.
  strictly speaking if an exception could occur then no
  further instruction should be issued until the possibility
  of an exception has passed.  (i.e. stall)

Comment 2 Luke Kenneth Casson Leighton 2021-10-30 22:17:46 BST

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;hb=HEAD

regfiles are created either binary-addressed or unary-addressed,
core.py shows how:

 267         # select the required read port.  these are pre-defined sizes
 268         rfile = regs.rf[regfile.lower()]
 ...
 329                 if rfile.unary:
 330                     rens.append(addr_en)
 331                 else:
 332                     addrs.append(addr_en)
 333                     rens.append(rp)

therefore, for creating a bitvector class, the Regfiles.rf dict can be
enumerated, and a series of setter/getter ports added.  unary regfile
numbers (CRs) can be used as-is, but binary ones (RA, RB) must be
converted to unary, 1<<RA

funnily enough, a good class to use for that would be... an unary-addressable
regfile!

probably: RegFileArray(1,32) - 32 regs but 1 bit per reg.

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfile.py;h=c3f33393bde72951b27aa72664795c572913a7d0;hb=HEAD#l103

there will only be one bit per "reg", but enough ports need to be added
so that issue can read all it needs and write all bits from all operands
simultaneously.

by that i mean:

* for LD with update there are 2 reads and 2 write regs (GPRs)
* ST with update, 3 reads 1 write
* mul-accumulate 3 read 1 write

therefore the INT reg bitvector "regfile" will need to be 2W
but only 1R because although you want to be able to write 2
bits simultaneously (LD-with-update) only the Issuer will be
reading the global vector.

similar analysis for CR file, etc etc

there may be "better" ways to do this, but this is reasonably quick and easy.

Comment 3 Luke Kenneth Casson Leighton 2021-10-31 14:52:08 GMT

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_outmux_pipe.py;h=d94b6394b55519d3a1f8383f24d3abde4c6e9539;hb=HEAD#l125

here is an example of how to use MultiOutPipe, although it may not
becessarily be appropriate because of the different data sets
(different subsets of decoding for different pipelines). it *might*
still work though.

the muxid would be set to the fan-out pipeline number, where a map
was needed: muxid 0 => ALU, muxid 1 => Logical etc.

although... looking now at Core connect_instruction i think it takes
care of the fan-out perfectly well without needing MultiOutPipe
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;hb=HEAD#l193

so i think mostly instead it is fine to have a simple linear pipeline:

* fetch
* decode
* issue to Core which handles fanout

stall condition would be that the issue to core would not send "ready_o"
if it was detected that the Global Hazard Vector had a bit set.

each pipeline stage can be set as a StageAPI instance, but due to
the stalling you have to have functions which say if the data
is ready to be accepted
https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_buf_pipe.py;h=e738657a08c20fce0b97f8d87be3c21b22fa88be;hb=HEAD#l721

 720     @property
 721     def d_ready(self):
 722         """ data is ready to be accepted when this is true
 723         """
 724         return (self.count == 1)  # | (self.count == 3)
 725         return Const(1)
 726 
 727     def d_valid(self, i_ready):
 728         """ data is valid at output when this is true
 729         """
 730         return self.count == self.valid_trigger
 731         return Const(1)

Comment 4 Luke Kenneth Casson Leighton 2021-11-01 00:07:58 GMT

continuing the investigation: the MultiCompUnits in soc/fu/compunits.py
need to be concurrent, using ReservationStations.

* ReservationStations needs to be enhanced so that "fake" ALU instances
  are created which connect self.n[i] and self.p[i] into
  self.fakealus[i]

* the "fake" ALUs need to be passed each to MultiCompUnits
  in order to present multiple access points to the same
  underlying pipeline

* also their index muxid has to go into the fake alu

* each "fake" ALU back in compunits.py is given a number
  alu0 alu1 mul0 mul1 mul2 etc to give the impression
  of having multiple concurrent units

* in core.py the selector (enable signal) is masked out
  by FU busy signals to ensure that already-allocated
  Compunits do not try to allocate twice.

therefore there is:

* one actual (real) ALU
* one ReservationStation per ALU
* multiple ReservationStations which are named
* the appearance of more FunctionUnits than actually exists
* but all results are "managed".

Comment 5 Luke Kenneth Casson Leighton 2021-11-01 09:23:08 GMT

  96 class ReservationStations(Elaboratable):
 110     def __init__(self, num_rows, maskwid=0, feedback_width=None):
 111         self.num_rows = nr = num_rows
 112         self.feedback_width = feedback_width
 113         self.inpipe = InMuxPipe(nr, self.i_specfn, maskwid)   # fan-in
 114         self.outpipe = MuxOutPipe(nr, self.o_specfn, maskwid) # fan-out
 115 
 116         self.p = self.inpipe.p  # kinda annoying,
 117         self.n = self.outpipe.n # use pipe in/out as this class in/out

 118     def set_alu(self, alu):
 119         srlf.alu, self.pseudoalus = alu, []
 120         for i in range(num_rows):
 121             self.pseudoslus.append(PseudoALU(alu, self.p[i], self.n[i])

class PseudoALU:
    def __init__(self, alu, p, n):
        self.alu = alu
        self.p = p
        self.n = n

here, now, when a ReservationStation class is used, the actual
pipelined ALU will be presented as if it has simultaneous
(concurrent) access ports but in reality the fan-in / fan-out
of the RS Class only allows one request to be picked at a time.
importantly, then, all requests are "managed" (monitored) from
start to finish.

if num_rows=1 this is the "Single" situation right now with TestIssuer.

the length of each pipeline must be known because there must be
equal or more RS rows per pipeline stage.  if this is too many
then FSMs must be used instead.

Comment 6 Cesar Strauss 2021-11-01 21:15:50 GMT

I think I'd prefer to start from a clean (empty) InOrderIssuer.py, and copy things over from TestIssuer.py as needed. Another way would be to make a full copy, and try to morph it incrementally, one FSM at a time, while making sure the tests still pass. What do you think?

Comment 7 Luke Kenneth Casson Leighton 2021-11-01 23:19:30 GMT

(In reply to Cesar Strauss from comment #6)
> I think I'd prefer to start from a clean (empty) InOrderIssuer.py, and copy
> things over from TestIssuer.py as needed. Another way would be to make a
> full copy, and try to morph it incrementally, one FSM at a time, while
> making sure the tests still pass. What do you think?

well, the FSMs were not designed for single-clock, they were in most cases
designed for multi-clock operation.  much of the "setup" though (interfaces,
in/out, PowerDecoder2, core, DMI interface) will remain exactly the same
(it has to, really)

honestly, though, because of using the StageAPI, it may be simpler just
to write it from scratch, even starting from not actually having a core
at all but just fetch, issue (ignore it), increment the PC.

really there should not be more than 450 lines of code (even when including
DMI, DEC and TB), so cutting out the core (entirely, pretending the instruction
has been executed with a "fake" ALU, or not even looking at the MultiCompUnit
signals), it should be quite straightforward.

btw i have started on compunits adding ReservationStations, only to find
that the Mux-In and Mux-Out classes are hopeless :)  they do not meet the
ready/valid API properly and i think i know why: the Mux-Out needs a
"busy" flag, per fan-out output.

Comment 8 Jacob Lifshay 2021-11-01 23:37:51 GMT

I wrote a superscalar branch predictor that handles next-pc logic, it could probably be adapted for the in-order cpu by setting the issue width to 4-bytes (or 8-bytes if we want to handle any openpower v3.1 instructions). It has ready/valid inputs/outputs so should be pretty easy to integrate.

BranchPredictor class:
https://salsa.debian.org/Kazan-team/reg_rename_demo_cpu/-/blob/9c8690b116b0ad775aaeef066760a29ca6e60979/reg_rename_demo_cpu/fetch.py#L349

tests:
https://salsa.debian.org/Kazan-team/reg_rename_demo_cpu/-/blob/9c8690b116b0ad775aaeef066760a29ca6e60979/reg_rename_demo_cpu/test_fetch.py#L76

I wrote it as part of an attempt to show that a simple superscalar OoO register-renaming cpu wouldn't take a month to write.

Comment 9 Luke Kenneth Casson Leighton 2021-11-02 00:10:23 GMT

(In reply to Jacob Lifshay from comment #8)
> I wrote a superscalar branch predictor that handles next-pc logic, 

a branch predictor is a non-essential task.  anything that requires any
kind of cancellation (branch prediction being one such) is completely
off the table: stall is the sole exclusive option until such time as
code execution is successful.

there is insufficient time to do otherwise.

> it could
> probably be adapted for the in-order cpu by setting the issue width to
> 4-bytes (or 8-bytes if we want to handle any openpower v3.1 instructions).

no, those are off the table, too.

there is insufficient time.

> It has ready/valid inputs/outputs so should be pretty easy to integrate.

doesn't help if there is no InOrder Issuer to integrate into.

please keep this bugreport focussed on getting the required bare
minimum features completed as quickly as possible.

> I wrote it as part of an attempt to show that a simple superscalar OoO
> register-renaming cpu wouldn't take a month to write.

the focus of this bugreport has nothing to do with OoO or
register renaming, please keep it focussed exclusively on
the set task so as not to cause distraction.

Comment 10 Jacob Lifshay 2021-11-02 00:20:35 GMT

(In reply to Luke Kenneth Casson Leighton from comment #9)
> (In reply to Jacob Lifshay from comment #8)
> > I wrote a superscalar branch predictor that handles next-pc logic, 
> 
> a branch predictor is a non-essential task.  anything that requires any
> kind of cancellation (branch prediction being one such) is completely
> off the table: stall is the sole exclusive option until such time as
> code execution is successful.

we could have the fetch pipe fetch ahead...execution could still always stall rather than speculating -- all that would happen is it will have fetched the correct target instead of always obliviously fetching the instructions immediately following the branch so a branch doesn't always require flushing the entire fetch pipe... that will speed the whole cpu up by probably 50%.
> 
> there is insufficient time to do otherwise.

it isn't that complex imho... just an idea that we can use (or not).

> > I wrote it as part of an attempt to show that a simple superscalar OoO
> > register-renaming cpu wouldn't take a month to write.
> 
> the focus of this bugreport has nothing to do with OoO or
> register renaming, please keep it focussed exclusively on
> the set task so as not to cause distraction.

yup...just explaining where the code comes from.

Comment 11 Luke Kenneth Casson Leighton 2021-11-02 00:52:57 GMT

(In reply to Jacob Lifshay from comment #10)

> we could have the fetch pipe fetch ahead...execution could still always
> stall rather than speculating -- all that would happen is it will have
> fetched the correct target instead of always obliviously fetching the
> instructions immediately following the branch so a branch doesn't always
> require flushing the entire fetch pipe... that will speed the whole cpu up
> by probably 50%.

this is an optimistion, and it is scope-creep.

it is the absolute worst possible thing to do on a project that is
under time pressure to suggest optimisations, particularly when
no actual code even implementing the specified non-optimised design
does not even exist yet.

i have asked you multiple times to focus and not introduce distractions.

please stop recommending, raising, or discussing optimisations.

if there was no time pressure under contract it would be perfectly fine.

as we are under contract and under time pressure it is not fine
and never will be fine.

please listen and keep this bugreport focussed.

no more ideas that could jeapordise the contract: only action
and absolute focus on completion of the contract.

are we absolutely clear?

Comment 12 Luke Kenneth Casson Leighton 2021-11-07 15:42:13 GMT

commit 45d50392c8fd4a70867ef82db0bc2f366b408cee (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sun Nov 7 15:40:17 2021 +0000

    add hazard vectors to Regfiles
    the reason for adding it to Regfiles is because both In-Order and OoO
    need global hazard vectors.
    
    in the case of In-Order the hazard vector bits are set directly by the
    Issue Engine.
    
    in the case of Out-of-Order the vector bits are set by way of an
    amalgamation (Great Big Or Gate) of the columns from the DMs
    
    in either case the vectors are needed, so might as well be added to Regfiles

Comment 13 Luke Kenneth Casson Leighton 2021-11-07 15:47:53 GMT

https://bugs.libre-soc.org/show_bug.cgi?id=742#c5

ReservationStations2 is now working, and has been tested with num_rows=1
(and is in active usage already).

something to watch out for, in core.py:

        # enable the required Function Unit based on the opcode decode
        # note: this *only* works correctly for simple core when one and
        # *only* one FU is allocated per instruction
        for funame, fu in fus.items():
            fnunit = fu.fnunit.value
            enable = Signal(name="en_%s" % funame, reset_less=True)
            comb += enable.eq((self.e.do.fn_unit & fnunit).bool())
            comb += fu_bitdict[funame].eq(enable)

therefore it is not yet safe to enable num_rows > 1

it is perfectly fine to set num_rows=1 and have InOrder overlapping
instructions, that can go ahead (right now), there should be no
problem.

an add instruction will quite safely overlap with a mul instruction
will quite safely overlap with a DIV instruction etc. etc.

the problem will come if num_rows is set to greater than one: this will
result in more than one "en_{funame}" being set, and the instruction
will be sent to *more than one* Function Unit.

to solve that, a PriorityPicker is needed (just like exists in the
CDC6600 and therefore also in the scoreboard6600 experimental code)

but, that is for later: right now there is nothing stopping the InOrder
core from being written (just with all num_rows=1)

Comment 14 Luke Kenneth Casson Leighton 2021-11-07 22:17:30 GMT

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/div/fsm.py;h=1b22ca6f3f145f58e547451f496106e07bcc188d;hb=809cf2faa4450901779045cfaa89e69f70ed9f42#l133

crsar, here is an example of an FSM-based pipeline which is setting the
ready/valid signalling directly, itself.

you need:

* derive from ControlBase
* add p.i_data manually
* add n.o_data manually
* set p.o_ready only when ready to accept input (no stall)
* always accept one data per clock when p.i_valid & p.o_ready is true
* only send on data when n.i_ready is HI
* always set n.o_ready at the exact same time as n.o_data is valid

that's about it. quite easy. generally though it is better to properly
use the API by setting up a completely separate "stage" module which
is passed in to ControlBase.  or, the class to pass in stage=self.

hmm i just realised that the DIV FSM is actually completely ignoring the
Stage API, i will update it.

Comment 15 Luke Kenneth Casson Leighton 2021-11-07 22:51:15 GMT

done.

hilariously, there's a severe (catastrophic) bug in python 3.7 which
required a workaround (addition of a setup() function) to prevent
a (catastrophic) coredump by /usr/bin/python3.7 due to a null pointer
de-reference.

ignoring that...

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=b50d83f14e83aae30f52aedd030e385889a50867

 class FSMDivCoreStage(ControlBase):
     def __init__(self, pspec):
-        super().__init__()
-        self.pspec = pspec
-        self.p.i_data = CoreInputData(pspec)
-        self.n.o_data = CoreOutputData(pspec)
-        self.saved_input_data = CoreInputData(pspec)
+        self.pspec = pspec # store now: used in ispec and ospec
+        super().__init__(stage=self)
+        self.saved_input_data = self.ispec()
         self.empty = Signal(reset=1)
         self.saved_state = DivState(64, name="saved_state")
         self.div_state_next = DivStateNext(64)
         self.div_state_init = DivStateInit(64)
         self.divisor = Signal(unsigned(64))
 
+    def ispec(self):
+        return CoreInputData(self.pspec)
+
+    def ospec(self):
+        return CoreOutputData(self.pspec)

... you can see how rather than self.p.i_data and self.n.o_data
being explicitly set, the Stage API is supposed to be used: setting
i_data and o_data is the Stage API's job (inherited by ControlBase).

Comment 16 Luke Kenneth Casson Leighton 2021-11-09 17:57:20 GMT

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=7036fbf292ed8a5bc8393c3c95e15a28870ee325

i now have a suite of PriorityPickers at the front of Core issue.

this means that only one FU will ever get picked (even if there
are multiple ReservationStations).

this can actually be tested even with TestIssuer by issuing a DIV
instruction followed by (a few) add instructions.  the DIV will
still be running (64+ cycles) whilst the ADD gets round to being
loaded and issued.

of course if the DIV instruction tries to use the same registers
as the ADD it will get the wrong answer, but it will be a way
to test whether overlapping instructions can work at all.

Comment 17 Luke Kenneth Casson Leighton 2021-11-11 14:37:25 GMT

Cesar,

i am slowly morphing core so it can have regfile hazard vectors
(which you won't need to start with as long as running test
instructions that avoid hazards)

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=103d9306983e8782c930590fe58af2fc960ee216

this puts the requested port names into a pair of dictionaries.
i can then, for the "Hazards" regfiles, request the exact same
names, which will make it much easier in core.py to create a
matching "bit-setter" / "bit-clearer" system.

Comment 18 Luke Kenneth Casson Leighton 2021-11-16 20:20:13 GMT

the write protection hazard vector is now in place but is not in
use.  there is further work needed to cover the situation where
Data.ok is not set by the ALU but wrflags was set.  this is
reasonably straightforward to do.

after that, the bitvector read is ready to try out, and that can
be done with TestIssuer by using a DIV and an ADD instruction.
DIV will take many more cycles than the FSM so is perfect to try.

Comment 19 Luke Kenneth Casson Leighton 2021-11-17 13:42:20 GMT

cesar i added a FetchOutput data structure, for fetch pipeline ospec():

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=f1c63229cdf94d9fbca54086ff13d6a149245814

class FetchOutput:
    def __init__(self): #, svp64_en):
        self.state = CoreState("core_fetched")
        self.raw_insn_i = Signal(32) # one raw instruction
        self.bigendian_i = Signal() # bigendian - TODO, set by MSR.BE

Comment 20 Luke Kenneth Casson Leighton 2021-11-17 18:13:44 GMT

commit 7492a3533c61a6999d36df687c8d5e6e3603b0d6 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Nov 17 18:01:27 2021 +0000

    reading of regfile bitvector added, which activates on a per-FU basis
    at the regfile read port
    
    this is somewhat complete overkill because strictly speaking the
    read should be done at issue time.  fortunately, merging of lots of ORs
    results in the exact same thing, just distributed
    
    horribly inefficient though

to make use of pre-existing for-loops and data structures, decode_regfile_read()
at each (distributed) point can raise its flag, "is the main decoder requesting
that this register be read"

actually what is needed is: back in the connect_instruction() function
all available decode_regfile_read() functions are called there and then,
only the once, for each register.

looking at decode_regfile_read() there should, strictly speaking, be 17
separate bitvector requests/merges. at present there are *30* because
MUL requests RA, ALU requests RA, etc. etc.

Comment 21 Cesar Strauss 2021-11-19 10:33:34 GMT

I was thinking about the "core stopped" signal for DMI.

I think it could be generated by the AND of all the "o_ready" signals of the reservation stations. When all reservation stations are ready, it means they are not processing any instructions, and the core is stopped.

Another way would be to maintain a (small) counter of issued instructions, and another for retired instructions. When they match, there is no instruction in flight, and the core is stopped.

Comment 22 Luke Kenneth Casson Leighton 2021-11-19 13:05:35 GMT

(In reply to Cesar Strauss from comment #21)
> I was thinking about the "core stopped" signal for DMI.
> 
> I think it could be generated by the AND of all the "o_ready" signals of the
> reservation stations. When all reservation stations are ready, it means they
> are not processing any instructions, and the core is stopped.

ah no, it has to also include the DMI state of the user request.
as in: yes, you are right: stopped can be generated that way, but
ONLY when requested to do so.

> 
> Another way would be to maintain a (small) counter of issued instructions,
> and another for retired instructions. When they match, there is no
> instruction in flight, and the core is stopped.

there is already an fu busy signal per FU, a counter is not needed.
an OR of all FU Busy signals is already how the FSM "busy" is generated
right now.

Comment 23 Luke Kenneth Casson Leighton 2021-11-20 23:45:21 GMT

the integration of bitvectors went perfectly right up to the moment when
a combinatorial loop was detected.  the loop is as follows:

* bitvector regfiles were set to a combinatorial variant on operand
  forwarding.
* this so that set/clear of writes could be detected immediately by
  the read hazard, in the *current* cycle
* this so that issue could be held until the instruction was ready

unfortunately the actual setting/clearing is gated by FU issue/busy
which is also combinatorial which results in the read of the bitvector
being dependent on setting/clearing which then gates issue and hence
there's a loop.

the solution - which occurred to me just now - is this:

* make the bitvector regfiles a sync (one-clock delay)
* request the read of the bitvector at issue time (bear in mind
  it will arrive 1 clock late but that's ok because...)
* once issue is set, only on the *next* clock will the read-requests
  be set, at which point...
* the bitvector hazard read will also arrive and...
* if any bit is set, the read-requests can be gated out (prevented)

the irony is that this is *exactly* and i do mean exactly the job of
a FU-Regs Dependency Matrix

Comment 24 Luke Kenneth Casson Leighton 2021-11-21 21:55:59 GMT

(In reply to Luke Kenneth Casson Leighton from comment #23)

> * if any bit is set, the read-requests can be gated out (prevented)

success, by gating the rd.go.rel_o if there is even one read hazard for
the port.  this is quite draconian (excessive) but does the job.

now a process of checking for errors can begin: this will get complicated
quite quickly.

Comment 25 Cesar Strauss 2021-11-22 10:14:38 GMT

Current plan for the in-order pipelined issuer:

* Create pipelined_issuer.py in src/soc/simple (alongside issuer.py).

* Copy interface and initialization code from TestIssuerInternal into PipelinedIssuer class, so that one can be substituted for the other transparently.

* Add an argument to test_issuer.py, to instantiate a PipelinedIssuer module instead of TestIssuerInternal. Likely, need to adjust HDLRunner in https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/test/test_runner.py;h=03d4fe96197706bd8c52b9fc28d3cd8952dc05cf;hb=HEAD#l132

Comment 26 Luke Kenneth Casson Leighton 2021-11-22 13:54:22 GMT

(In reply to Cesar Strauss from comment #25)
> Current plan for the in-order pipelined issuer:
> 
> * Create pipelined_issuer.py in src/soc/simple (alongside issuer.py).

to help with clarity on that, what i think i will do this afternoon is split
out fetch_fsm into its own submodule.  it is one of the few with clear
in/out ready/valid signalling, whereas some of the others have dual
ready/valid.  planning ahead back when TestIssuerInternal was being developed,
they shouldn't have had, but hey.

> * Copy interface and initialization code from TestIssuerInternal into
> PipelinedIssuer class, so that one can be substituted for the other
> transparently.

good idea

> * Add an argument to test_issuer.py, to instantiate a PipelinedIssuer module
> instead of TestIssuerInternal. Likely, need to adjust HDLRunner in
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/test/
> test_runner.py;h=03d4fe96197706bd8c52b9fc28d3cd8952dc05cf;hb=HEAD#l132

yes, that one bypasses TestIssuer itself due to the clock-domain
crossing and the (optional) addition of a PLL and other things,
none of which are relevant for actual instruction running in
a unit test environment.

Comment 27 Luke Kenneth Casson Leighton 2021-11-22 21:23:09 GMT

IRC log notes

lkcl	yehyeh. this makes it... awkward to turn into separate FSMs, which then in turn can be morphed into pipelines	19:49
lkcl	perhaps by cutting out SVP64 entirely first it would become much easier	19:50
lkcl	everything should be a forward-chain (only)	19:53
lkcl	with the sole exception being:	19:53
lkcl	* reading of PC (if it is detected to have been changed by TRAP or BRANCH)	19:54
lkcl	* reading of MSR (same, by TRAP or MTMSR)	19:54
lkcl	* a global stall condition	19:54
lkcl	* a global "core reset" condition	19:55
lkcl	that's pretty much it: that's the only "backwards" feedback, from later stages to earlier ones, and even PC and MSR are via the regfile (already), not by special datapaths	19:56
lkcl	oh, of course, the exception flags, from LDST.	19:58
lkcl	those are also backwards-propagated	19:58
lkcl	under... guess what: stall conditions of course :)

Comment 28 Cesar Strauss 2021-11-23 09:19:08 GMT

(In reply to Luke Kenneth Casson Leighton from comment #27)
> yehyeh. this makes it... awkward to turn into separate FSMs, which then
> in turn can be morphed into pipelines
> perhaps by cutting out SVP64 entirely first it would become much easier
> everything should be a forward-chain (only)

Sure.

I can start fixing TestIssuerInternal FSMs to be forward-chain only, if it helps. In that case, I would leave the Fetch/Decode pipelines for later.

Comment 29 Luke Kenneth Casson Leighton 2021-11-23 21:35:24 GMT

(In reply to Cesar Strauss from comment #28)

> Sure.
> 
> I can start fixing TestIssuerInternal FSMs to be forward-chain only, if it
> helps. 

i think it will: then, when copying them to make the pipelines, actually
that becomes trivial.

> In that case, I would leave the Fetch/Decode pipelines for later.

i think it is an important intermediary step, that makes the creation
of pipelines almost trivial, especially when you can see how i split
out FetchFSM: actually the changes needed to make that "true" pipeline
are negligeable.

Comment 30 Luke Kenneth Casson Leighton 2021-11-27 15:46:28 GMT

the following cases are now successfully detected:

* read-after-write: instruction is issued but stalls,
   other instruction issue is NOT stalled (making
   the core "superscalar")
* write-after-write: instruction is NOT issued, but is
   captured and acknowledged back to the Issuer.
   attempts to issue continue and will succeed when
   the WaW hazard clears
* read-and-write-by-same-instruction: this case is
   covered by DELAYING write-hazard setting by one
   clock cycle, such that the RaW hazard check does
   NOT include the instruction currently being issued

potential problems to monitor:

* if the write-hazard bitvector setting is delayed, and
   the write to its SR Latch is also delayed, then two
   instructions issued one clock cycle apart might not
   properly detect read-after-write hazards.

   right now this is undetectable because the FSM
   of TestIssuer cannot issue instructions that fast
   (1 every 4-5 cycles)

* LD/ST instructions must not be acknowledged
   unless the opportunity for an exception has passed

* likewise branch and trap (anything that changes PC)

Comment 31 Luke Kenneth Casson Leighton 2021-12-15 01:13:35 GMT

must sort MSR use get_state exactly as with PC and SVSTATE.

also TODO create a TestIssuerBase class

Comment 32 Luke Kenneth Casson Leighton 2021-12-16 00:07:27 GMT

i have this split out now, separate file
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/inorder.py;hb=HEAD

the goal here is to remove every FSM, or, more to the point, make every FSM only one entry (one state).

an intermediary phase is to make every FSM two-state only:
one input ready/valid reader, one output ready/valid writer.

a Decode FSM (Decode stage) is currently missing, this should
involve pdecode2 and nothing else.  (oh except for state,
containing pc msr svstate i.e. CoreInput.

(In reply to Luke Kenneth Casson Leighton from comment #31)
> must sort MSR use get_state exactly as with PC and SVSTATE.
> 
> also TODO create a TestIssuerBase class

both done

Comment 33 Luke Kenneth Casson Leighton 2022-01-12 18:35:12 GMT

Cesar, i thought through some ideas, and if you pass the signal core.busy_o
globally back to fetch, that will be sufficient.

Fetch FSM:

      if not core.busy_o:
          n.data.insn = fetched_insn
          n.o_valid = 1

Decode Phase:

      if not core.busy_o
         same thing

in other words core.busy_o is the global stall condition.

in this way it will be really simple and easy to do.