Bug 313 - Create Branch Pipeline for POWER9
Summary: Create Branch Pipeline for POWER9
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Mac OS
: --- enhancement
Assignee: Michael Nolan
URL:
Depends on: 356 335
Blocks: 383
  Show dependency treegraph
 
Reported: 2020-05-15 16:30 BST by Luke Kenneth Casson Leighton
Modified: 2021-11-29 21:41 GMT (History)
3 users (show)

See Also:
NLnet milestone: NLNet.2019.10.043.Wishbone
total budget (EUR) for completion of task and all subtasks: 250
budget (EUR) for this task, excluding subtasks' budget: 250
parent task for budget allocation: 383
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
lkcl = { amount = 100, paid = 2020-08-21 } donated = { amount = 150, submitted = 2021-04-24, paid = 2021-05-01 }


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-05-15 16:30:18 BST
see * https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl
* https://libre-soc.org/openpower/isa/branch/
* https://libre-soc.org/openpower/isa/fixedtrap/
* https://git.libre-soc.org/?p=soc.git;a=tree;f=src/soc/fu/branch;hb=HEAD

we need:

* OP_B* - done
* OP_TRAP (moved to trap pipe)
* OP_RFID (also)

anything that sets the PC by setting NIA.  (decision made to have trap separate)
Comment 1 Michael Nolan 2020-05-15 18:26:23 BST
I'm working on defining the pipeline data type for this, and I'm wondering how I should handle the CR input. Conditional branches can branch on any of the 32 bits in the cr, so I think it should take in the entire CR. However, I suspect we'll be implementing cr0..cr7 as separate 4 bit registers to avoid false dependencies, so making the branch take in the whole register would complicate matters
Comment 2 Luke Kenneth Casson Leighton 2020-05-15 18:41:09 BST
(In reply to Michael Nolan from comment #1)
> I'm working on defining the pipeline data type for this, and I'm wondering
> how I should handle the CR input. Conditional branches can branch on any of
> the 32 bits in the cr, so I think it should take in the entire CR.

there is a selector in the insn, right? that is the "CR" equivalent of the RA, RB, RS and RT fields.

> However,
> I suspect we'll be implementing cr0..cr7 as separate 4 bit registers to
> avoid false dependencies, so making the branch take in the whole register
> would complicate matters

yes those are the two ways to do it:

A treat CR as separate 4 bit regs.

B treat it as one 64 bit reg

scheme A requires 8 regs as input if you need the full CR.  8 inputs from the Dependency Matrices is a bit much.

sometimes however you need the full 64 bit. i do have a Dep Matrix strategy for both, it involves a vector grouping system:

* both A and B are available, however q request for CR0 read will *also* set a request for the *whole* 64 bits of CR.

something like that.

however, it is complicated. 

so for now, i think we just go with full 64 bit.  annoying as that is.
Comment 3 Luke Kenneth Casson Leighton 2020-05-15 19:58:23 BST
hmm hmm i thought perhaps instead of passing nia into all ALUs (where it would
then not actually be used - the wires would be routed, taking up space but
then not actually go anywhere), to create a separate CompBROpSubset?

then, likewise in CompBROpSubset, anything _not_ used in any branch operation
can be removed from it.

diff --git a/src/soc/alu/alu_input_record.py b/src/soc/alu/alu_input_record.py
index 2c3dcc4..41a40eb 100644
--- a/src/soc/alu/alu_input_record.py
+++ b/src/soc/alu/alu_input_record.py
@@ -13,7 +13,6 @@ class CompALUOpSubset(Record):
     def __init__(self, name=None):
         layout = (('insn_type', InternalOp),
                   ('fn_unit', Function),
-                  ('nia', 64),
                   ('imm_data', Layout((("imm", 64), ("imm_ok", 1)))),
                     #'cr = Signal(32, reset_less=True) # NO: this is from the C
R SPR
                     #'xerc = XerBits() # NO: this is from the XER SPR
Comment 4 Luke Kenneth Casson Leighton 2020-05-15 20:01:50 BST
commit d1a2cb9f4852bd8118cb4db043e6622cae077830 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Fri May 15 19:58:47 2020 +0100

    add branch input record

will swap over to that in a mo
Comment 5 Luke Kenneth Casson Leighton 2020-05-15 20:14:17 BST
oink.  you *removed* NIA :)  well... we will probably need CompBROpSubset anyway.
things like data_len can be removed (i don't see branch needing that), and there
may be other fields as well.

i would have kiinda expected NIA to be passed in though (to CompBROpSubset), because
it comes from the Program Counter, which would be special-cased rather than stored
in an SPR.

hum hum, don't know.  oh.  right.  NIA is *next* instruction address, not an input.
ok so i think we need something called "cia" - current instruction address - in
CompBROpSubset.

btw do a git pull i just did a minor cleanup
Comment 6 Michael Nolan 2020-05-15 20:17:50 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> oink.  you *removed* NIA :)  well... we will probably need CompBROpSubset
> anyway.

Yep, since nothing but the branch unit used it

> hum hum, don't know.  oh.  right.  NIA is *next* instruction address, not an
> input.
> ok so i think we need something called "cia" - current instruction address -
> in
> CompBROpSubset.

I added CIA to the branch unit's input data because it wasn't one of the things output by the decoder (well NIA is, but it's defaulted to 0), and I needed to set it from the testbench.
Comment 7 Luke Kenneth Casson Leighton 2020-05-15 20:28:59 BST
ok am done.  btw comb Signals default to zero if not set... actually they
default to the value in their initialisation - Signal(len, reset=0x5000) -
so do not need to be initialised to zero.
Comment 8 Luke Kenneth Casson Leighton 2020-05-15 20:33:31 BST
(In reply to Michael Nolan from comment #6)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > oink.  you *removed* NIA :)  well... we will probably need CompBROpSubset
> > anyway.
> 
> Yep, since nothing but the branch unit used it

good call.

> > hum hum, don't know.  oh.  right.  NIA is *next* instruction address, not an
> > input.
> > ok so i think we need something called "cia" - current instruction address -
> > in
> > CompBROpSubset.
> 
> I added CIA to the branch unit's input data because it wasn't one of the
> things output by the decoder (well NIA is, but it's defaulted to 0),

as it's an output - and the CompBROpSubset is intended as as *input* to
the pipeline - yes NIA definitely goes.

> and I needed to set it from the testbench.

yehyeh.

we need cia however.  hmm hmm... it's not a "register" - there's no delay
in fetching it: the branch unit will receive the current PC right there,
because that (like the insn) is "known information".

yep, it needs to be added to CompBROpSubset.  will do that now...

commit 4e88a5d7027d5c60093925d38da9b1e63b77f743 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Fri May 15 20:33:15 2020 +0100

    add cia to CompBROpSubset, remove data_len
Comment 9 Luke Kenneth Casson Leighton 2020-05-15 20:36:10 BST
argh.

comb += branch.p.data_i.ctx.op.eq_from_execute1(pdecode2.e)

nope.  cia can't be part of BranchCompALUSubset - that's not its role,
the ctx.op only stores information decoded from the actual instruction.
will revert shortly.
Comment 10 Michael Nolan 2020-05-15 21:11:05 BST
diff --git a/src/soc/branch/pipe_data.py b/src/soc/branch/pipe_data.py
index cab43cf..6d97102 100644
--- a/src/soc/branch/pipe_data.py
+++ b/src/soc/branch/pipe_data.py
@@ -22,10 +22,11 @@ class BranchInputData(IntegerData):
         # We need both lr and spr for bclr and bcctrl. Bclr can read
         # from both ctr and lr, and bcctrl can write to both ctr and
         # lr.
<snip>
+        self.tar = Signal(64, reset_less=True) # Target Address Register
 

I removed the TAR input, because the branch instruction would need to decide which of the 3 inputs to branch to. Instead, it should be fed that operand in the spr input (though that does mean it will be fed ctr twice for bcctr...). Does that seem reasonable?
Comment 11 Luke Kenneth Casson Leighton 2020-05-15 21:31:18 BST
(In reply to Michael Nolan from comment #10)

> I removed the TAR input, because the branch instruction would need to decide
> which of the 3 inputs to branch to. Instead, it should be fed that operand
> in the spr input (though that does mean it will be fed ctr twice for
> bcctr...). Does that seem reasonable?

nggghhh.  *thinks*, *thinks*....

ok, the way that the Function Units work is: well, let me find a diagram..
https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg

although that's an ALU FU, in the case of a *branch* FU, those inputs would
be:

* CTR
* CIA
* CR
* TAR

basically, those latches at the top "accumulate" all of the inputs needed
(some of which as you can see actually come from the operand - this is
CompBROpSubset)

anything that needs to be read from a regfile - whether it be the SPR one,
the Condition Regfile or the INT Regfile - all needs to have those pairs
of REQUEST/GO_READ signals.

* the outgoing one *requests* the register (actually the regfile port)
* the incoming one (GO_READ) demands that the FU latch the incoming register
  data, right now.  it will not get another chance.

so the idea of doing time-synchronous feeding of the input data (sharing one
of the register inputs for two registers, one after the other) is massively
more complicated than simply storing the data in (multiple) Function Unit latches.

now, *at the regfile* there may only be one available Reg Port Bus.

therefore *over the (one) Reg Port Bus*, it may take two clock cycles:
* one to read CTR
* one to read TAR

and this can be done in any order: CTR first, TAR first, or if 2 Reg Port
Buses happen to be available, *both* simultaneously.

however that task is handled at the CompUnit (like in the diagram above),
precisely so that the pipelines *do not* have to have such complexity.

therefore, really, if the branch op needs 2 regs (one CTR, one TAR), then the
BranchCompUnit will need 2 latches to cover them, and consequently BranchInputData
needs to have both ctr and tar.

let's have a look at what was generated by the parser.  all the others are
subsets (less regs) than this one:

    @inject()
    def op_bctarl(self, CTR, CR, TAR, LR):
        global NIA
        if mode_is_64bit:
            M = 0
        else:
            M = 32
        if ~BO[2]:
            CTR = CTR - 1
        ctr_ok = BO[2] | ne(CTR[M:64], 0) ^ BO[3]
        cond_ok = BO[0] | ~(CR[BI + 32] ^ BO[1])
        if ctr_ok & cond_ok:
            NIA = concat(TAR[0:62], SelectableInt(value=0x0, bits=2))
        if LK:
            LR = CIA + 4
        return (CTR, CR, TAR, LR,)

that's interesting.  TAR is returned but i do not see it being modified.
that's a bug.

NIA however is calculated but is not returned *from* the... oh wait, yes
i see, it's a global.  ok.  got it.

so, yes, apologies, we can't use spr for dual-purpose, we need 2 regs,
and they might as well be named ctr and tar, rather than spr.
Comment 12 Luke Kenneth Casson Leighton 2020-05-15 21:39:27 BST
likewise we need LR at the same time.  yeah, i know.  it's a hell of a lot of
incoming wires and registers.

* LR 64-bit
* CIA 64-bit (48 probably more like?)
* CTR 64-bit
* TAR 64-bit
* CR 32-bit
* CompBROpSubset a *lot*.  around 64 bits due to the expanded decoding.

we just have to tolerate that, and have a *five* input Branch Computation Unit.

which has me slightly concerned for adding TRAP to the same pipeline, because
that brings in *another* 128 bits (!!) - RA 64-bit, RB 64-bit, plus it modifies
the CR so that's another 32 bits on the output side as well.  blegh.  can't do that.
new bugreport, there.
Comment 13 Luke Kenneth Casson Leighton 2020-05-15 21:46:10 BST
ahhh hang on: just looking at decoder/isa/system.py as well

that will take in... either:

* CTR for rfsvc
* SRR1 for one of the other ops
* HSRR1 for another

so yes!  CTR should remain named "spr" (or better, be named "spr1").

however the 2nd one... ahh op_rfid refers to *both* MSR *and* SRR1
so it should be called "spr2".

then, the execution issue will have to

* hard-decode the operations to detect which SPR is to go to spr1 and which to spr2
* put those SPR numbers into the SPR-Dependency-Matrix

and in this way, the right SPRs will arrive at spr1 and spr2

we should record those somewhere.... hmmm... in the BranchInputData file for now?
Comment 14 Luke Kenneth Casson Leighton 2020-05-15 22:07:40 BST
waaaa...

    insn       CR  SPR1  SPR2    SPR3
    ----       --  ----  ----    ----
    op_b       xx  LR     xx     xx
    op_ba      xx  LR     xx     xx
    op_bl      xx  LR     xx     xx
    op_bla     xx  LR     xx     xx
    op_bc      CR, LR,    CTR    xx
    op_bca     CR, LR,    CTR    xx
    op_bcl     CR, LR,    CTR    xx
    op_bcla    CR, LR,    CTR    xx
    op_bclr    CR, LR,    CTR    xx
    op_bclrl   CR, LR,    CTR    xx
    op_bcctr   CR, LR,    CTR    xx
    op_bcctrl  CR, LR,    CTR    xx
    op_bctar   CR, LR,    CTR,   TAR
    op_bctarl  CR, LR,    CTR,   TAR

    op_sc      xx  xx     xx     MSR
    op_scv     xx  LR,    SRR1,  MSR
    op_rfscv   xx  LR,    CTR,   MSR
    op_rfid    xx  SRR0,  SRR1,  MSR
    op_hrfid   xx  HSRR0, HSRR1, MSR

and if we did TRAP in the Branch pipeline as well, that would be
*six* incoming 64-bit register latch paths.  nooo, i don't think so :)

so the above table should guide the numbering.  LR - that's not an SPR,
is it?  if not, then ho hum we should call the field lr_or_spr1 or something
horrible.  not a huge fan of long data structure names, but hey :)

* the BranchCompUnit should expect to read the SPR regfile and allocate
  lr_or_spr1, spr2, and spr3 accordingly.
* BranchInputData should be the target recipient of that data
* branch/main_stage should decode the three fields lr_or_spr1, spr2 and spr3
  likewise according to that table.

for op_bc* that's dead simple: lr = lr_or_spr1, ctr = spr2, tar = spr3

for op_sc and other system calls it's more complex, but doable.
Comment 15 Michael Nolan 2020-05-15 22:26:53 BST
(In reply to Luke Kenneth Casson Leighton from comment #14)
> waaaa...
> 
>     insn       CR  SPR1  SPR2    SPR3
>     ----       --  ----  ----    ----
>     op_b       xx  LR     xx     xx
>     op_ba      xx  LR     xx     xx
>     op_bl      xx  LR     xx     xx
>     op_bla     xx  LR     xx     xx
>     op_bc      CR, LR,    CTR    xx
>     op_bca     CR, LR,    CTR    xx
>     op_bcl     CR, LR,    CTR    xx
>     op_bcla    CR, LR,    CTR    xx
>     op_bclr    CR, LR,    CTR    xx
>     op_bclrl   CR, LR,    CTR    xx
>     op_bcctr   CR, LR,    CTR    xx
>     op_bcctrl  CR, LR,    CTR    xx
>     op_bctar   CR, LR,    CTR,   TAR
>     op_bctarl  CR, LR,    CTR,   TAR
> 
>     op_sc      xx  xx     xx     MSR
>     op_scv     xx  LR,    SRR1,  MSR
>     op_rfscv   xx  LR,    CTR,   MSR
>     op_rfid    xx  SRR0,  SRR1,  MSR
>     op_hrfid   xx  HSRR0, HSRR1, MSR
> 
> and if we did TRAP in the Branch pipeline as well, that would be
> *six* incoming 64-bit register latch paths.  nooo, i don't think so :)


That table is a bit large, at least for the branch instructions. None of the branches except for bclr(l) need to read LR, only write to it. And bclr doesn't need to read TAR, so multiplexing TAR/LR on the same port would be possible. 

> so the above table should guide the numbering.  LR - that's not an SPR,
> is it? 
It is, it's SPR #8


what are the opcodes starting with op_sc? I don't see any of them in the power_enums list, nor do I see corresponding opcodes in the power ISA specification
Comment 16 Michael Nolan 2020-05-15 22:27:33 BST
(In reply to Michael Nolan from comment #15)

> 
> what are the opcodes starting with op_sc? I don't see any of them in the
> power_enums list, nor do I see corresponding opcodes in the power ISA
> specification

NVM, I see where they're from
Comment 17 Luke Kenneth Casson Leighton 2020-05-15 23:23:53 BST
(In reply to Michael Nolan from comment #15)
> (In reply to Luke Kenneth Casson Leighton from comment #14)
> > waaaa...
> > 
> >     insn       CR  SPR1  SPR2    SPR3
> >     ----       --  ----  ----    ----
> >     op_b       xx  LR     xx     xx
> >     op_ba      xx  LR     xx     xx
> >     op_bl      xx  LR     xx     xx
> >     op_bla     xx  LR     xx     xx
> >     op_bc      CR, LR,    CTR    xx
> >     op_bca     CR, LR,    CTR    xx
> >     op_bcl     CR, LR,    CTR    xx
> >     op_bcla    CR, LR,    CTR    xx
> >     op_bclr    CR, LR,    CTR    xx
> >     op_bclrl   CR, LR,    CTR    xx
> >     op_bcctr   CR, LR,    CTR    xx
> >     op_bcctrl  CR, LR,    CTR    xx
> >     op_bctar   CR, LR,    CTR,   TAR
> >     op_bctarl  CR, LR,    CTR,   TAR
> > 
> >     op_sc      xx  xx     xx     MSR
> >     op_scv     xx  LR,    SRR1,  MSR
> >     op_rfscv   xx  LR,    CTR,   MSR
> >     op_rfid    xx  SRR0,  SRR1,  MSR
> >     op_hrfid   xx  HSRR0, HSRR1, MSR
> > 
> > and if we did TRAP in the Branch pipeline as well, that would be
> > *six* incoming 64-bit register latch paths.  nooo, i don't think so :)
> 
> 
> That table is a bit large, at least for the branch instructions. None of the
> branches except for bclr(l) need to read LR, only write to it. 

err... nuts.  the parser is bringing lr in as an input, when it should not.
that's another bug.  let me update the table...

> And bclr
> doesn't need to read TAR, so multiplexing TAR/LR on the same port would be
> possible. 

yep.  good catch.
 
> > so the above table should guide the numbering.  LR - that's not an SPR,
> > is it? 
> It is, it's SPR #8

okaaay good.

 
> what are the opcodes starting with op_sc? I don't see any of them in the
> power_enums list, nor do I see corresponding opcodes in the power ISA
> specification

(i saw you found them, later comment - decoder/isa/system.py)

sooo, iiinteresting: OP_TRAP needs up to 3 SPRs, but OP_B* does not.
that is almost worthwhile considering having OP_TRAP and OP_B* in the
same pipeline... mm... 5 64-bit datapaths...  yyeah no :)
Comment 18 Luke Kenneth Casson Leighton 2020-05-16 00:49:17 BST
while you're doing the cr pipe i'll sort out spr1/2/3 in branch.
Comment 19 Luke Kenneth Casson Leighton 2020-05-16 01:56:36 BST
(In reply to Luke Kenneth Casson Leighton from comment #18)
> while you're doing the cr pipe i'll sort out spr1/2/3 in branch.

done - please do let me know if the submodule is messed up (it shouldn't be).

in the unit test it will be necessary to have some detection logic (decoding
the op somewhat) in order to correctly set spr1,2,3.  this will not be
"wasted" code because once written it is exactly what the instruction-issue
phase will have to do.
Comment 20 Luke Kenneth Casson Leighton 2020-05-17 18:37:49 BST
michael is this correct?

            # Yes, the CTR only counts 32 bits
            ctr = Signal(64, reset_less=True)
            comb += ctr.eq(self.i.ctr - 1)

that's 64-bit for ctr, and the input self.i.ctr is also 64-bit
Comment 21 Luke Kenneth Casson Leighton 2020-05-17 19:11:06 BST
    if (mode_is_64bit) then M <- 0
    else M <- 32

answer's "no".  32-bit in 32-bit mode, 64-bit in 64-bit mode
Comment 22 Michael Nolan 2020-05-17 20:17:38 BST
(In reply to Luke Kenneth Casson Leighton from comment #21)
>     if (mode_is_64bit) then M <- 0
>     else M <- 32
> 
> answer's "no".  32-bit in 32-bit mode, 64-bit in 64-bit mode

Oops. I thought I had seen somewhere that that wasn't the case when I wrote the comment, but decided to double check and forgot to delete the comment.
Comment 23 Luke Kenneth Casson Leighton 2020-05-19 19:30:14 BST
i looked more closely at the difference between branch, sys and trap, and rheir combination would be a massive EIGHT incoming register ports.

* RA, RB, CR, MSR for trap
* 2x SPRs, MSR for syscalls
* 2x SPRs, no MSR for branches

this is too much.

i had assumed that MSR was a SPR, it isn't.

so the number of SPRs for branch can be reduced to 2.  spr3 can be deleted.
Comment 24 Luke Kenneth Casson Leighton 2020-05-19 22:02:23 BST
ehn?  any clues, michael?

```
Traceback (most recent call last):
  File "fu/branch/test/test_pipe_caller.py", line 134, in run_all
    sim = Simulator(m)
  File "/home/lkcl/src/libresoc/nmigen/nmigen/back/pysim.py", line 935, in __init__
    self._fragment = Fragment.get(fragment, platform=None).prepare()
  File "/home/lkcl/src/libresoc/nmigen/nmigen/hdl/ir.py", line 39, in get
    obj = obj.elaborate(platform)
  File "/home/lkcl/src/libresoc/nmigen/nmigen/hdl/dsl.py", line 537, in elaborate
    fragment.add_subfragment(Fragment.get(self._named_submodules[name], platform), name)
  File "/home/lkcl/src/libresoc/nmigen/nmigen/hdl/ir.py", line 39, in get
    obj = obj.elaborate(platform)
  File "/home/lkcl/src/libresoc/nmigen/nmigen/hdl/dsl.py", line 537, in elaborate
    fragment.add_subfragment(Fragment.get(self._named_submodules[name], platform), name)
  File "/home/lkcl/src/libresoc/nmigen/nmigen/hdl/ir.py", line 39, in get
    obj = obj.elaborate(platform)
  File "/home/lkcl/src/libresoc/soc/src/soc/decoder/power_decoder.py", line 315, in elaborate
    m = PowerDecoder.elaborate(self, platform)
  File "/home/lkcl/src/libresoc/soc/src/soc/decoder/power_decoder.py", line 262, in elaborate
    comb += self.op._eq(row)
  File "/home/lkcl/src/libresoc/soc/src/soc/decoder/power_decoder.py", line 139, in _eq
    self.internal_op.eq(InternalOp[row['internal op']]),
  File "/usr/lib/python3.7/enum.py", line 351, in __getitem__
    return cls._member_map_[name]
KeyError: 'OP_TDI'
```
Comment 25 Luke Kenneth Casson Leighton 2020-05-19 22:06:03 BST
nggggh submodules is beginning to piss me off.  git silently failed and
refused to note that the submodule had been changed.  nothing reported
by "git diff", nothing reported by "git submodule" - nothing.
Comment 26 Luke Kenneth Casson Leighton 2020-05-19 22:14:25 BST
added OP_RFID it seemed to be missing?
Comment 27 Michael Nolan 2020-05-19 22:19:46 BST
(In reply to Luke Kenneth Casson Leighton from comment #25)
> nggggh submodules is beginning to piss me off.  git silently failed and
> refused to note that the submodule had been changed.  nothing reported
> by "git diff", nothing reported by "git submodule" - nothing.

Yep... I haven't had that happen to me yet, but it's pissing me off too.

I removed the opcodes OP_TDI, OP_TD, OP_TWI, OP_TW and replaced them with OP_TRAP
Comment 28 Jacob Lifshay 2020-05-19 22:22:39 BST
(In reply to Michael Nolan from comment #27)
> (In reply to Luke Kenneth Casson Leighton from comment #25)
> > nggggh submodules is beginning to piss me off.  git silently failed and
> > refused to note that the submodule had been changed.  nothing reported
> > by "git diff", nothing reported by "git submodule" - nothing.
> 
> Yep... I haven't had that happen to me yet, but it's pissing me off too.

We definitely should move to a monorepo
Comment 29 Luke Kenneth Casson Leighton 2020-05-19 22:41:37 BST
(In reply to Jacob Lifshay from comment #28)
> (In reply to Michael Nolan from comment #27)

> > Yep... I haven't had that happen to me yet, but it's pissing me off too.
> 
> We definitely should move to a monorepo

no.  that sends the wrong message, unfortunately [it says, very loudly, "other specific focussed alternative uses for our work not welcome, in the slightest.  download our monorepo containing everything you don't want or need, and like it.  or Go Home.  we don't care"]

when you have time, look up the nightmare of freeswitch ignoring standard development practices when it comes to dependency management.

the intention of having the csv and isatables in the wiki is to make it clear that they are independently useable *outside* of our project.

without requiring gigabyte additional downloads and dependencies which we need but they most certainly do not.

really they should be split out of the wiki entirely, to make it *really* clear that they are nothing to do with soc *or* libresoc.

it is just submodules pissing me off as i am not used to active development with them.

i'll get over it.
Comment 30 Luke Kenneth Casson Leighton 2020-05-19 22:44:37 BST
(In reply to Michael Nolan from comment #27)

> Yep... I haven't had that happen to me yet, but it's pissing me off too.

i keep forgetting to do "git submodule update " after every git pull.

i think i will write a shell script which does both pull and update.

hmm i wonder... maybe it should be added as a git hook

> I removed the opcodes OP_TDI, OP_TD, OP_TWI, OP_TW and replaced them with
> OP_TRAP

i saw.  i added OP_RFID to enums.  rhat was missing.
Comment 31 Luke Kenneth Casson Leighton 2020-05-22 03:39:44 BST
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/power_decoder2.py;h=dbd87f967255f97bbec0b47a2b295780d6381033;hb=e46a1253d05ea47b9e62714ed50bcec6a7530d0f#l299

is that supposed to be a full 32 bit ranged lookup?

for branch, the full 32 bit CR is needed and BI looks up ...

oh i see.

the idea is to use 3 bits of BI to look up CR00 to 7

then in OP_BC look up the remaining 2 bits of BI to select which bit of CRn is to be used, right?

that's neat because different branches ysing different CRs would not block on the CR regfile.
Comment 32 Luke Kenneth Casson Leighton 2020-05-22 09:44:57 BST
(In reply to Luke Kenneth Casson Leighton from comment #31)

> the idea is to use 3 bits of BI to look up CR00 to 7

(3 MSBs)

> then in OP_BC look up the remaining 2 bits of BI to select which bit of CRn
> is to be used, right?

(2 LSBs)

except, small issue: endian-ness looks like it will be an issue.

in main_stage.py the bit-order is reversed using [31-BI]

        # The bit of CR selected by BI
        cr_bit = Signal(reset_less=True)
        comb += cr_bit.eq((self.i.cr & (1<<(31-BI))) != 0)

if we are to reduce self.i.cr (BranchInputData.cr) to 4 bits,
should that not be
        comb += cr_bit.eq((self.i.cr & (1<<(3-BI))) != 0)

but also, back in power_decoder2.py, for DecodeCRIn, what about here?

            with m.Case(CRInSel.BI):
                comb += self.cr_bitfield.data.eq(self.dec.BI[2:5])
                comb += self.cr_bitfield.ok.eq(1)
Comment 33 Michael Nolan 2020-05-22 14:10:31 BST
(In reply to Luke Kenneth Casson Leighton from comment #32)
> (In reply to Luke Kenneth Casson Leighton from comment #31)
> 
> > the idea is to use 3 bits of BI to look up CR00 to 7
> 
> (3 MSBs)
> 
> > then in OP_BC look up the remaining 2 bits of BI to select which bit of CRn
> > is to be used, right?
> 
> (2 LSBs)

Yep, that's the idea!

> 
> except, small issue: endian-ness looks like it will be an issue.
> 
> in main_stage.py the bit-order is reversed using [31-BI]
> 
>         # The bit of CR selected by BI
>         cr_bit = Signal(reset_less=True)
>         comb += cr_bit.eq((self.i.cr & (1<<(31-BI))) != 0)
> 
> if we are to reduce self.i.cr (BranchInputData.cr) to 4 bits,
> should that not be
>         comb += cr_bit.eq((self.i.cr & (1<<(3-BI))) != 0)
> 
> but also, back in power_decoder2.py, for DecodeCRIn, what about here?
> 
>             with m.Case(CRInSel.BI):
>                 comb += self.cr_bitfield.data.eq(self.dec.BI[2:5])
>                 comb += self.cr_bitfield.ok.eq(1)

I think the register file should be designed to address the 8 crs in the correct bit order. If I ask for cr0, I should get cr0, whether cr0 comes from the 4 lsbs of the full register, or the 4 msbs.
Comment 34 Luke Kenneth Casson Leighton 2020-05-22 14:46:22 BST
(In reply to Michael Nolan from comment #33)
> (In reply to Luke Kenneth Casson Leighton from comment #32)
> > > then in OP_BC look up the remaining 2 bits of BI to select which bit of CRn
> > > is to be used, right?
> > 
> > (2 LSBs)
> 
> Yep, that's the idea!

excccellent, muhahaha oopssorry.

> I think the register file should be designed to address the 8 crs in the
> correct bit order. If I ask for cr0, I should get cr0, whether cr0 comes
> from the 4 lsbs of the full register, or the 4 msbs.

sigh.  i will need help when it comes to that.  i will have no clue which
way round.

basically i believe it may be quite simple... but this is POWER.

* the CR regfile will be subdivided into 8 separate latch-banks (not a
  standard SRAM, at all)

* there will be 8 read-enable (and write-enable) lines.

* it will be possible to read/write *all* of those simultaneously to obtain
  the full 32-bit CR.  this on a 32-bit data path

* it will also be possible to read/write individual 4-bit CRs by a single
  read/write-enable line.  this on a 4-bit data path

the DMs take care of ensuring that 32 and 4 bit read/writes are entirely
mutually exclusively time-sliced.

however for the purposes of this discussion, the LSBs/MSBs i *believe*
is as simple as hard-wiring the ordering correctly on the 32-bit path.

* all 8 4-bit read/writes *at the regfile*, to produce the full 32-bit CR,
  will be LSB/MSB *hard-wired* correctly, right there.


so with that in mind, would i be correct in thinking that the CR Regfile
actually needs to store the data as:

CR0: bit 3 2 1 0 CR1: 7 6 5 4 CR2: 11 10 9 8

?
Comment 35 Michael Nolan 2020-05-22 15:11:56 BST
(In reply to Luke Kenneth Casson Leighton from comment #34)

> however for the purposes of this discussion, the LSBs/MSBs i *believe*
> is as simple as hard-wiring the ordering correctly on the 32-bit path.
> 
> * all 8 4-bit read/writes *at the regfile*, to produce the full 32-bit CR,
>   will be LSB/MSB *hard-wired* correctly, right there.
> 
> 
> so with that in mind, would i be correct in thinking that the CR Regfile
> actually needs to store the data as:
> 
> CR0: bit 3 2 1 0 CR1: 7 6 5 4 CR2: 11 10 9 8
> 
> ?

I think it'll actually be

CR0: bit 0 1 2 3 CR1: 4 5 6 7 CR2: 8 9 10 11

but it'll need some testing to be sure.
Comment 36 Luke Kenneth Casson Leighton 2020-05-22 15:33:48 BST
(In reply to Michael Nolan from comment #35)

> I think it'll actually be
> 
> CR0: bit 0 1 2 3 CR1: 4 5 6 7 CR2: 8 9 10 11

and.. um.. um... branch main_stage does crbit = cr[3-BI[0:2]] something like that?

> but it'll need some testing to be sure.

sigh :)
Comment 37 Michael Nolan 2020-05-22 15:51:46 BST
(In reply to Luke Kenneth Casson Leighton from comment #36)
> (In reply to Michael Nolan from comment #35)
> 
> > I think it'll actually be
> > 
> > CR0: bit 0 1 2 3 CR1: 4 5 6 7 CR2: 8 9 10 11
> 
> and.. um.. um... branch main_stage does crbit = cr[3-BI[0:2]] something like
> that?

Just updated it to use the new interface, and yep that's what it does.
Comment 38 Michael Nolan 2020-05-22 19:21:34 BST
I have a formal proof for branches now, I handled the CRs in a similar way to the CR unit so I know the CR handling is good.

(reply at https://bugs.libre-soc.org/show_bug.cgi?id=335#c1)
Comment 39 Luke Kenneth Casson Leighton 2020-05-23 02:29:34 BST
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/branch/pipe_data.py;h=28099eaa84f065065132208abaa9d8fa4a6f003f;hb=HEAD

whoops the regspec for cr needs to be 0:3 range.  4 means "select bit 4 only".  ie only a 1 bit wide CR.
Comment 40 Luke Kenneth Casson Leighton 2020-05-23 04:06:19 BST
looking at putting CR_ISEL into CR pipe, it also uses BC so i checked DecodeCRin and it puts all 5 bits of BC into the reg bitfield.

that seems to be an error, i think it should be BC[2:5] like the others

i grepped BC in the csv files and at present only BR uses BC in the CRin field.
Comment 41 Luke Kenneth Casson Leighton 2020-06-01 21:39:26 BST
bctar bcctr and bclr need to be added as unit tests.  bclr fails at the moment:

  File "fu/branch/test/test_pipe_caller.py", line 208, in assert_outputs
    self.assertEqual(branch_taken, sim_branch_taken, code)
AssertionError: 1 != False : bclr 0, 5, 0
Comment 42 Luke Kenneth Casson Leighton 2020-06-02 19:27:47 BST
michael i saw the change in fu/branch/test_pipe_caller.py to use spr2.
it now needs to be read_fast1.data and read_fast2.data,
not read_spr1.data and read_spr2.data.
Comment 43 Michael Nolan 2020-06-02 19:43:44 BST
(In reply to Luke Kenneth Casson Leighton from comment #42)
> michael i saw the change in fu/branch/test_pipe_caller.py to use spr2.
> it now needs to be read_fast1.data and read_fast2.data,
> not read_spr1.data and read_spr2.data.

How do you think TAR should be handled? Should it be a fast SPR as well?
Comment 44 Luke Kenneth Casson Leighton 2020-06-02 20:09:36 BST
(In reply to Michael Nolan from comment #43)
> (In reply to Luke Kenneth Casson Leighton from comment #42)
> > michael i saw the change in fu/branch/test_pipe_caller.py to use spr2.
> > it now needs to be read_fast1.data and read_fast2.data,
> > not read_spr1.data and read_spr2.data.
> 
> How do you think TAR should be handled? Should it be a fast SPR as well?

yes.  it's accessed often enough.

if you can hack that into DecodeA and DecodeB, such that this is in power_decoder2.py
rather than in the unit test, it would be great:

       insn_type = yield dec2.e.insn_type
        if insn_type == InternalOp.OP_BCREG.value:
            xo9 = yield dec2.dec.FormXL.XO[9]
            xo5 = yield dec2.dec.FormXL.XO[5]
            if xo9 == 0:
                yield branch.p.data_i.spr2.eq(sim.spr['LR'].value)
            elif xo5 == 1:
                yield branch.p.data_i.spr2.eq(sim.spr['TAR'].value)
            else:
                yield branch.p.data_i.spr2.eq(sim.spr['CTR'].value)

then, read_fast1 and read_fast2 can be used in both unit tests:
fu/compunit/test/test_branch_compunit.py
and
fu/branch/test/test_pipe_caller.py

if you can handle the modifications to power_decoder2.py that would be
great.

bear in mind that DecodeA reads CTR, (which goes into read_fast1)
and DecodeB reads LR or TAR (which goes into read_fast2).

so the above test needs to be split.
Comment 45 Luke Kenneth Casson Leighton 2020-06-02 21:11:58 BST
it's ok i got it, i appreciate you're dropping in on occasion at the moment.
Comment 46 Michael Nolan 2020-06-02 21:19:31 BST
(In reply to Luke Kenneth Casson Leighton from comment #44)

> bear in mind that DecodeA reads CTR, (which goes into read_fast1)
> and DecodeB reads LR or TAR (which goes into read_fast2).
> 
> so the above test needs to be split.

So main_stage.py will need to check whether the instruction is a bcctr and use spr1, otherwise use spr2?
Comment 47 Luke Kenneth Casson Leighton 2020-06-02 21:30:58 BST
(In reply to Michael Nolan from comment #46)
> (In reply to Luke Kenneth Casson Leighton from comment #44)
> 
> > bear in mind that DecodeA reads CTR, (which goes into read_fast1)
> > and DecodeB reads LR or TAR (which goes into read_fast2).
> > 
> > so the above test needs to be split.
> 
> So main_stage.py will need to check whether the instruction is a bcctr and
> use spr1, otherwise use spr2?

sigh yes.  although what i was going to do was to just make BCREG use spr2
for CTR/LR/TAR and BC uses spr1=CTR.

i believe that may end up being simpler, because main_stage.py would not
need to decode XO[5/9]?
Comment 48 Michael Nolan 2020-06-02 22:03:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #44)

> yes.  it's accessed often enough.

Lol, back when I was doing some PowerPC reverse engineering I never saw a bctar. Then again, it was a fairly old chip so TAR may not have been introduced yet. 


> 
> then, read_fast1 and read_fast2 can be used in both unit tests:
> fu/compunit/test/test_branch_compunit.py
> and
> fu/branch/test/test_pipe_caller.py

This has been done to test_pipe_caller, still need to do the compunit one.
Comment 49 Luke Kenneth Casson Leighton 2020-06-02 22:33:55 BST
(In reply to Michael Nolan from comment #48)
> (In reply to Luke Kenneth Casson Leighton from comment #44)
> 
> > yes.  it's accessed often enough.
> 
> Lol, back when I was doing some PowerPC reverse engineering I never saw a
> bctar.

okok :)  put another way: i don't want the hassle of adding an extra regfile
port (a slow SPR one) to the branch regspecs :)

> > then, read_fast1 and read_fast2 can be used in both unit tests:
> > fu/compunit/test/test_branch_compunit.py
> > and
> > fu/branch/test/test_pipe_caller.py
> 
> This has been done to test_pipe_caller, still need to do the compunit one.

the compunit one _should_ just pick it up automatically, because of the use
of regspecs.

howeverrrr.... there's something fishy going on in test_bc_cr, which
is only noticed on the *next* instruction, test_bc_ctr

here's test_bc_cr output phase:

check extra output 'bc 12, 3, -12720' {}  <-- uh-oh... no output regs.


test_bc_ctr


following that, this - in the output log - is extremely bad:

before inputs, rd_rel, wr_rel:  0b1100 0b101 <--- wr_rel *must* be zero here.
rd_rel 0 wait HI 0 spr1 0x57b4136e
rd_rel 0 wait HI 1 spr1 0x57b4136e
rd_rel 2 wait HI 1 cr_a 0x0
rd_rel 3 wait HI 1 cia 0x0

if that occurs, it means that the pipeline has set Data.ok flags but that
PowerDecode2 has *NOT* set the corresponding write regs information.

or the other way round.
Comment 50 Luke Kenneth Casson Leighton 2020-06-02 22:38:28 BST
problem is caused by test_bc_reg() - any one of those will cause
subsequent tests to fail
Comment 51 Luke Kenneth Casson Leighton 2020-06-02 23:41:28 BST
found it.  when no registers were to be written, MultiCompUnit couldn't cope.
this actually happens with some of the branch tests.  no change to LR,
no change to CTR, no change to NIA (because the branch was not taken):
wark-wark.

i added yet another hacked-in condition to MCU to get it to test if the
ALU was finished.