Bug 336 - ALU CompUnit needs to recognise that RA (src1) can be zero
Summary: ALU CompUnit needs to recognise that RA (src1) can be zero
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Cesar Strauss
URL:
Depends on: 334
Blocks: 383
  Show dependency treegraph
 
Reported: 2020-05-21 17:24 BST by Luke Kenneth Casson Leighton
Modified: 2021-01-02 15:30 GMT (History)
1 user (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments
Suggested pipeline design for the ALU Comp Unit (63.63 KB, application/pdf)
2020-06-15 00:21 BST, Cesar Strauss
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-05-21 17:24:42 BST
whilst DecodeA picks up that RA may be a zero immediate, this is not actually passed through Decode2ExecuteType, nor does the ALU CompUnit recognise it.
Comment 1 Luke Kenneth Casson Leighton 2020-05-21 17:35:52 BST
forgot that i had raised 334 last night.  this one can therefore be about the CompUnit.

the change needed is:

* first that Decode2ExecuteType needs a flag ra_is_zero

* second that CompAluOpSubset and CompLogicalALUSubset need likewise

* third - and this is the kicker - that the FSM for src1_i needs a Mux to select RA or zero *and* it needs to *no longer* request a READ for src1 (rd.req[0]) if the ra_is_zero flag is set.

* additionally that if the z flag is set, the rd_done part of the FSM does NOT wait for GO_READ on the src1 port (rd.go[0])
i will draw out a diagram.

again.

:)
Comment 2 Cesar Strauss 2020-05-21 22:12:48 BST
Dear Luke,

Please do not bother drawing a gate level diagram for the FSM. Your description is enough.

For the dataflow (registers, ALU, muxes), yes, a diagram is useful, but do represent the whole FSM as an empty box.

I find that I can better design an FSM by starting from a behavioral description, preferably in the form of a timing diagram.

If you like, you can draw one in paper, of what you want to see in the GTKWave window. With arrows linking what edge causes which transition in another signal.

What do you think?

Regards,

Cesar
Comment 3 Luke Kenneth Casson Leighton 2020-05-21 22:41:39 BST
(In reply to Cesar Strauss from comment #2)
> Dear Luke,
> 
> Please do not bother drawing a gate level diagram for the FSM. Your
> description is enough.

we need one for the documentation, which is an extremely important aspect of this project.

plus i find it very difficult to think of timing related designs without a gate level drawing.

as in i literally can't write these FSMs without one (!)

this one is wrong (imm needs muxing)
https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg

the imm mux path in this one is correct
https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg

immed mode hooks RD_REQ2, switching it off.
 
> For the dataflow (registers, ALU, muxes), yes, a diagram is useful, but do
> represent the whole FSM as an empty box.
> 
> I find that I can better design an FSM by starting from a behavioral
> description,

then comment 1 should do the trick?

RD_REQ1 (rd.req[0]) needs to be disabled if oper_i.zero_a is set, and src_i[0] set to zero in all 64 bits.


> preferably in the form of a timing diagram.

i am used to examining those, drawing them not so much.

 
> If you like, you can draw one in paper, of what you want to see in the
> GTKWave window. With arrows linking what edge causes which transition in
> another signal.
> 
> What do you think?

i can give it a shot once i am confident with the gate level.

are you ok to go via the description for now?
Comment 4 Cesar Strauss 2020-05-22 02:11:00 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Cesar Strauss from comment #2)
> > Dear Luke,
> > 
> > Please do not bother drawing a gate level diagram for the FSM. Your
> > description is enough.
> 
> we need one for the documentation, which is an extremely important aspect of
> this project.

I agree, documentation is essencial.

But I argue that a diagram that shows too much detail, and goes all the way to gate level, is more like source code than documentation.

> plus i find it very difficult to think of timing related designs without a
> gate level drawing.
> 
> as in i literally can't write these FSMs without one (!)

Sure, I understand. It's just practice, and realizing the convenience.

At the start, I used to input all my FPGA designs with a schematics editor. The FPGA library actually had blocks for many of the 74XX logic family parts.

Then, as I learned about VHDL, I started translating some of my previous designs to it.

Eventually, I got so used to it, that I started writing VHDL directly. It's so much more compact, so less tangled wires, so less time spent arranging the symbols around the page.

Lastly, I learned about Verilog, and switched to it for its simplicity.

> > For the dataflow (registers, ALU, muxes), yes, a diagram is useful, but do
> > represent the whole FSM as an empty box.
> > 
> > I find that I can better design an FSM by starting from a behavioral
> > description,
> 
> then comment 1 should do the trick?

Yes, it's all I need.

Regards,

Cesar
Comment 5 Luke Kenneth Casson Leighton 2020-05-23 13:51:01 BST
Cesar, hi,

just to let you know, this should *not* interfere with adding is_zero support.  it however will be necessary to use this:

    if hasattr(self.oper_i, "zero_a"):
        # ok now we know that this operation requires zero_a support, we can        
        # mux in zero_a.  otherwise we must put register a straight through

i am doing a minor code-morph on MultiCompUnit which makes it suitable for general use across multiple pipelines, as specified by "regspecs".  an example is below.

note that this example - which uses CompCROpSubset **NOT** CompALUOpSubset -
is why we will need the "detection" above.

please do not let this stop you from adding zero_a detection support, let us
keep in touch and coordinate here.

i will have to add "detect if there is an operand B immediate" in the same way,
so you can see how it is done.


class CompCROpSubset(Record):
    def __init__(self, name=None):
        layout = (('insn_type', InternalOp),
                  ('fn_unit', Function),
                  ('insn', 32),
                  ('read_cr_whole', 1),
                  ('write_cr_whole', 1),
                  )

class CRInputData(IntegerData):
    regspec = [('INT', 'a', '0:63'),      # 64 bit range
               ('INT', 'b', '0:63'),      # 6B bit range
               ('CR', 'full_cr', '0:31'), # 32 bit range
               ('CR', 'cr_a', '0:3'),     # 4 bit range
               ('CR', 'cr_b', '0:3'),     # 4 bit range
               ('CR', 'cr_c', '0:3')]     # 4 bit range
class CROutputData(IntegerData):
    regspec = [('INT', 'o', '0:63'),      # 64 bit range
               ('CR', 'full_cr', '0:31'), # 32 bit range
               ('CR', 'cr_o', '0:3')]     # 4 bit range
    ...

class CRPipeSpec(CommonPipeSpec):
    regspec = (CRInputData.regspec, CROutputData.regspec)
    opsubsetkls = CompCROpSubset
Comment 6 Luke Kenneth Casson Leighton 2020-05-23 14:07:12 BST
ok done for imm_data, here is the location, it is virtually a cookie-cut of what
is done for imm_data, bear in mind it will need to be src_l.q[0] and because
src_l.q[0] covers RA, where src_l.q[1] covers RB.  likewise the src1_or_imm
needs to go into sl[0] not sl[1]


```
        # if the operand subset has "zero_a" we implicitly assume that means
        # src_i[0] is an INT register type where zero can be multiplexed in, instead.
        # see https://bugs.libre-soc.org/show_bug.cgi?id=336
        #if hasattr(oper_r, "zero_a"):
            # select zero immediate if opcode says so.  however also change the latch
            # to trigger *from* the opcode latch instead.
            # ...
            # ...

        # if the operand subset has "imm_data" we implicitly assume that means
        # "this is an INT ALU/Logical FU jobbie, RB is multiplexed with the immediate"
        if hasattr(oper_r, "imm_data"):
            # select immediate if opcode says so.  however also change the latch
            # to trigger *from* the opcode latch instead.
            op_is_imm = oper_r.imm_data.imm_ok
```
Comment 7 Cesar Strauss 2020-05-23 20:09:27 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)
> ok done for imm_data, here is the location, it is virtually a cookie-cut of
> what
> is done for imm_data, bear in mind it will need to be src_l.q[0] and because
> src_l.q[0] covers RA, where src_l.q[1] covers RB.  likewise the src1_or_imm
> needs to go into sl[0] not sl[1]

I have done the above. I held my commit because you seemed to be working on the file. It's OK to commit now?
Comment 8 Luke Kenneth Casson Leighton 2020-05-23 20:16:28 BST
(In reply to Cesar Strauss from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #6)
> > ok done for imm_data, here is the location, it is virtually a cookie-cut of
> > what
> > is done for imm_data, bear in mind it will need to be src_l.q[0] and because
> > src_l.q[0] covers RA, where src_l.q[1] covers RB.  likewise the src1_or_imm
> > needs to go into sl[0] not sl[1]
> 
> I have done the above.

excellent.

> I held my commit because you seemed to be working on the file.

yeh don't do that: just shove it in :)

> It's OK to commit now?

git pull to merge, first, then go for it.  i happen to just be going out shopping
anyway.
Comment 9 Luke Kenneth Casson Leighton 2020-05-23 20:39:57 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)
> (In reply to Cesar Strauss from comment #7)
> > (In reply to Luke Kenneth Casson Leighton from comment #6)
> > > ok done for imm_data, here is the location, it is virtually a cookie-cut of
> > > what
> > > is done for imm_data, bear in mind it will need to be src_l.q[0] and because
> > > src_l.q[0] covers RA, where src_l.q[1] covers RB.  likewise the src1_or_imm
> > > needs to go into sl[0] not sl[1]
> > 
> > I have done the above.
> 
> excellent.
> 
> > I held my commit because you seemed to be working on the file.
> 
> yeh don't do that: just shove it in :)
> 
> > It's OK to commit now?
> 
> git pull to merge, first, then go for it.  i happen to just be going out
> shopping
> anyway.

looks great, cesar - unit test time.  it should be as simple as setting
dut.oper_i.zero_a.eq(1) - or adding an extra argument to op_sim() to
allow that to be set dynamically - then setting a to a non-zero value
and checking that, yes, actually, a is now irrelevant (set to zero).

a more advanced version of the op_sim() test should be not to set rd.go.eq(0b11)
in an ad-hoc fashion: instead to check rd_rel_o bits.

this starts to be very much like the planned LD/ST unit test modifications,
and you could probably use the exact same code for both.

however please do everything in an incremental fashion: no more than 5-10/15
lines at a time when modifying existing code (if that is practical).

so, start with adding the extra argument to op_sim() to allow zero_a to be set

then add a test call in scoreboard_sim op_sim(a=12312313213, zero_a=True, ...)

and so on.
Comment 10 Luke Kenneth Casson Leighton 2020-05-23 21:52:31 BST
cesar i created a common function for both zero a and imm b muxing.  should be functionally the same.

note that the Mux will automatically have its inputs extended with zero padding to the max operand width.

so i replaced Const 0 with just 0
Comment 11 Luke Kenneth Casson Leighton 2020-05-24 00:10:34 BST
brilliant, just saw this, Cesar.  confirmed that it is functional.

commit 260625df9309f8f35541207cab431dd8dba90c5a
Author: Cesar Strauss <cestrauss@gmail.com>
Date:   Sat May 23 19:52:08 2020 -0300

    Add a few test cases with zero_a set, in combination with imm_ok


so the next one is: we need something that's a bit more sophisticated,
designed more along the lines of test_inout_mux_pipe.py:

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_inout_mux_pipe.py;h=03fddde0f7ab12ec9023fc57b913442ba42c120a;hb=HEAD#l52

not quite that sophisticated, but close.

basically, we need:

* one function (similar to TestInput.send) which takes care of *one*
  REQ_READ and GO_RD signalling bit.  it should monitor:
  self.rd.req[N], then set self.rd.go[N] for *one* cycle (one blank yield)
  then set it self.rd.go[N] back to zero and confirm (assert) that
  self.rd.req[N] has gone to zero

* another function (similar again to TestInput.rcv) which does the same
  thing for REQ_WRITE and GO_WRITE.

* another function - a very simple one - which sets issue_i
  for one cycle (yield in between), then waits for "self.busy" to drop to 0

we can make it more sophisticated as we go along.

it should be... only... about... 80 lines of code, at a guess.
Comment 12 Luke Kenneth Casson Leighton 2020-05-24 02:46:13 BST
https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_buf_pipe.py;hb=HEAD

just occurred to me, see class Test3, this nay be simpler to start from.

remove the for loops for now, add a parameter read_idx to the send function, and write_idx to rcv function.

read_idx will refer to self.compunit.rd.rel[read_idx] you see how that would go?
Comment 13 Cesar Strauss 2020-05-24 14:46:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #11)
> brilliant, just saw this, Cesar.  confirmed that it is functional.
> 
> commit 260625df9309f8f35541207cab431dd8dba90c5a
> Author: Cesar Strauss <cestrauss@gmail.com>
> Date:   Sat May 23 19:52:08 2020 -0300
> 
>     Add a few test cases with zero_a set, in combination with imm_ok
> 

Still need a way to test the case of zero_a and/or imm_data being entirely absent, in which case the muxes are not generated.
Comment 14 Luke Kenneth Casson Leighton 2020-05-24 15:00:28 BST
(In reply to Cesar Strauss from comment #13)

> Still need a way to test the case of zero_a and/or imm_data being entirely
> absent, in which case the muxes are not generated.

ah if you create a test which uses... mmm.... which one is dead-simple...
CompCROpSubset, then that doesn't have imm_data or zero_a.

see soc.experiment.alu_hier.DummyALU

i've just added a DummyALU for you, it ignores operand b, ignores the
actual operand type (up to you if you want to put that back, do something
different), just copies a to o - nothing sophisticated.

so you can copy test_compunit_regspec1(), call it test_compunit_regspec2()
then cookie-cut scoreboard_sim() and op_sim() to remove trying to set imm_data,
invert_a, and zero_a (because those don't exist in CompCROpSubset), and
just assert that the result is equal to the 1st operand, a.

if you really wanted to you could actually remove b from the inspec in
test_compunit_regspec2() and also in DummyALU.
Comment 15 Cesar Strauss 2020-05-24 15:37:24 BST
Found this:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=75df2153517ecd733cd5270776e88d76b8df5d06;hb=HEAD#l213

    name = "data_r%d" % i
    data_r = Signal(self.cu._get_srcwid(i), name=name, reset_less=True)
    latchregister(m, self.get_out(i), data_r, req_l.q[i], name)

We are giving the same name for two things, the output signal of the latchregister, and its internal DFF (given by the function parameter). nMigen will append "$1" to one of them, but which?

Can we go and choose a name, given to its internal DFF, different from its output, in this and similar cases?

Also:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=75df2153517ecd733cd5270776e88d76b8df5d06;hb=HEAD#l208

    latchregister(m, self.oper_i, oper_r, self.issue_i, "oper_r")

In this case, as it happens, oper_r sub-signals do not receive names that begin with "oper_r". So, what receives the "oper_r" name is actually the internal DFF, which confused me at no end on the GTKWave timing diagram. This is how I found this issue.
Comment 16 Luke Kenneth Casson Leighton 2020-05-24 15:44:32 BST
(In reply to Cesar Strauss from comment #15)
> Found this:
> 
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/
> compalu_multi.py;h=75df2153517ecd733cd5270776e88d76b8df5d06;hb=HEAD#l213
> 
>     name = "data_r%d" % i
>     data_r = Signal(self.cu._get_srcwid(i), name=name, reset_less=True)
>     latchregister(m, self.get_out(i), data_r, req_l.q[i], name)
> 
> We are giving the same name for two things, the output signal of the
> latchregister, and its internal DFF (given by the function parameter).
> nMigen will append "$1" to one of them, but which?

wonderful, isn't it?

> Can we go and choose a name, given to its internal DFF, different from its
> output, in this and similar cases?

sure.  i'd suggest doing this:
     latchregister(m, self.get_out(i), data_r, req_l.q[i], name)
-->
     latchregister(m, self.get_out(i), data_r, req_l.q[i], name+"_l")

> 
> Also:
> 
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/
> compalu_multi.py;h=75df2153517ecd733cd5270776e88d76b8df5d06;hb=HEAD#l208
> 
>     latchregister(m, self.oper_i, oper_r, self.issue_i, "oper_r")

sure - feel free to call it "oper_l" instead
Comment 17 Luke Kenneth Casson Leighton 2020-05-24 16:07:49 BST
(In reply to Luke Kenneth Casson Leighton from comment #16)

> sure.  i'd suggest doing this:
>      latchregister(m, self.get_out(i), data_r, req_l.q[i], name)
> -->
>      latchregister(m, self.get_out(i), data_r, req_l.q[i], name+"_l")

> sure - feel free to call it "oper_l" instead

btw, commit this immediately - do not wait to "accumulate" a series of commits,
and *definitely* do not put multiple separate and distinct "purposes" into
the one commit.

git is intelligent enough to merge commits that are not on the exact same line.
therefore, to speed up collaboration, do take advantage of that: do not wait,
just commit it - and commit "single-purpose".  if you find you want to use the
word "and" in the commit message, *stop* - it means that the commit is not
single-purpose.  more on that in HDL_workflow.
Comment 18 Luke Kenneth Casson Leighton 2020-05-24 21:01:47 BST
(In reply to Luke Kenneth Casson Leighton from comment #17)

> btw, commit this immediately - do not wait to "accumulate" a series of
> commits,

nicely done, cesar.

commit a5968ecf27c0be3b357342889aaa8e48c1a519cb
Author: Cesar Strauss <cestrauss@gmail.com>
Date:   Sun May 24 16:44:12 2020 -0300

    Rename the internal DFF of latchregisters to avoid conflict
Comment 19 Cesar Strauss 2020-05-24 21:29:54 BST
(In reply to Luke Kenneth Casson Leighton from comment #18)
> (In reply to Luke Kenneth Casson Leighton from comment #17)
> 
> > btw, commit this immediately - do not wait to "accumulate" a series of
> > commits,
> 
> nicely done, cesar.
> 
> commit a5968ecf27c0be3b357342889aaa8e48c1a519cb
> Author: Cesar Strauss <cestrauss@gmail.com>
> Date:   Sun May 24 16:44:12 2020 -0300
> 
>     Rename the internal DFF of latchregisters to avoid conflict

Thanks.

When I did the above, I noticed src/soc/experiment/compalu.py and src/soc/experiment/compldst.py have been suffering from bitrot. I suppose they are previous attempts, no longer updated, and kept for historical reasons?
Comment 20 Luke Kenneth Casson Leighton 2020-05-24 21:41:42 BST
(In reply to Cesar Strauss from comment #19)

> When I did the above, I noticed src/soc/experiment/compalu.py and
> src/soc/experiment/compldst.py have been suffering from bitrot. I suppose
> they are previous attempts, no longer updated, and kept for historical
> reasons?

they're only there from the transition to multi-bit RD/WR signalling.
Comment 21 Cesar Strauss 2020-05-25 01:56:02 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> * third - and this is the kicker - that the FSM for src1_i needs a Mux to
> select RA or zero *and* it needs to *no longer* request a READ for src1
> (rd.req[0]) if the ra_is_zero flag is set.
> 
> * additionally that if the z flag is set, the rd_done part of the FSM does
> NOT wait for GO_READ on the src1 port (rd.go[0])

Just a reminder that the FSM still needs adjustment for rd.rel/rd.go. Checked in GTKWave that rd.rel is still being set unconditionally after issue_i.
Comment 22 Luke Kenneth Casson Leighton 2020-05-25 02:19:55 BST
(In reply to Cesar Strauss from comment #21)

> Just a reminder that the FSM still needs adjustment for rd.rel/rd.go.
> Checked in GTKWave that rd.rel is still being set unconditionally after
> issue_i.

if rd.rel[0] is being set when zero_a is on, yes that's bad.

rd.go[N] is an external response to rd.req[N].  this is why it is critical to get rd.req and wr.req bits set right because the external user of the CompUnits kniws nothing about *why* they are set, they just respond.
Comment 23 Luke Kenneth Casson Leighton 2020-05-25 02:27:14 BST
 255         # all request signals gated by busy_o.  prevents picker problems
 256         m.d.comb += self.busy_o.eq(opc_l.q) # busy out
 257         bro = Repl(self.busy_o, self.n_src)
 258         m.d.comb += self.rd.rel.eq(src_l.q & bro) # src1/src2 req rel
 259 

ah.

right.

try this, at line 258

sls = []
for l in sl:
   sls.append(l[2]) # get latch conditions
self.rd.rel.eq(Cat(*sls) & bro)

that is bringing in the immediate selection.  i think.  from the muxes.
Comment 24 Luke Kenneth Casson Leighton 2020-05-25 02:31:48 BST
m.d.comb += src_sel.eq(Mux(op_is_imm, self.opc_l.q, self.src_l.q[i]))
sl[i][2] = src_sel

hmm you see how that interacts, from the src_srl?
Comment 25 Luke Kenneth Casson Leighton 2020-05-25 03:21:01 BST
(In reply to Luke Kenneth Casson Leighton from comment #24)
> m.d.comb += src_sel.eq(Mux(op_is_imm, self.opc_l.q, self.src_l.q[i]))
> sl[i][2] = src_sel
> 
> hmm you see how that interacts, from the src_srl?

not quite correct, did something different, do git pull see if you can
get the test to work.  will take a look in morning.
Comment 26 Luke Kenneth Casson Leighton 2020-05-25 03:29:11 BST
interesting!

    # XXX - immediate and zero is not a POWER mode (and won't work anyway)
    # reason: no actual operands.
    #result = yield from op_sim(dut, 5, 2, InternalOp.OP_ADD, zero_a=1,
    #                                imm=8, imm_ok=1)
    #assert result == 8

this test is invalid, because it effectively is a "update register with
immediate", disabling both A and B regs.

i believe there are other opcodes to cover this, therefore it is not a
valid test.   which is good because the FSM doesn't cope with it
(it relies on at least one RD.GO coming in, and if both operands are
immediates, that can never happen)
Comment 27 Luke Kenneth Casson Leighton 2020-05-25 05:05:03 BST
argh yes it is a valid test.  there is an add zero immediate instruction.

so this means rd_done will never be set, and the FSM never proceeds to completion on 2-operand CompUnits
Comment 28 Cesar Strauss 2020-05-25 08:28:45 BST
Don't lose sleep about it, I can handle the FSM.(In reply to Luke Kenneth Casson Leighton from comment #25)
> not quite correct, did something different, do git pull see if you can
> get the test to work.  will take a look in morning.

Don't lose sleep about it, I can handle the FSM. It's my bread and butter.
Comment 29 Cesar Strauss 2020-05-25 10:42:25 BST
(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Cesar Strauss from comment #21)
> 
> > Just a reminder that the FSM still needs adjustment for rd.rel/rd.go.
> > Checked in GTKWave that rd.rel is still being set unconditionally after
> > issue_i.
> 
> if rd.rel[0] is being set when zero_a is on, yes that's bad.
> 
> rd.go[N] is an external response to rd.req[N].  this is why it is critical
> to get rd.req and wr.req bits set right because the external user of the
> CompUnits kniws nothing about *why* they are set, they just respond.

In this case, "external user" are the dependency matrices, right?

I've read https://libre-soc.org/3d_gpu/architecture/6600scoreboard/.

As far as I understand it, the Dependency Matrices also need to know themselves that the operand is immediate, and not set the corresponding FU-Reg Latch, correct? Is this already implemented?
Comment 30 Luke Kenneth Casson Leighton 2020-05-25 11:43:52 BST
(In reply to Cesar Strauss from comment #28)
> Don't lose sleep about it, I can handle the FSM.(In reply to Luke Kenneth
> Casson Leighton from comment #25)
> > not quite correct, did something different, do git pull see if you can
> > get the test to work.  will take a look in morning.
> 
> Don't lose sleep about it, I can handle the FSM. It's my bread and butter.

:)

i realised that actually there's probably nothing wrong with the modification
i made: it's the unit test that must not check a flag (rd.req) that is never
going to get set.


(In reply to Cesar Strauss from comment #29)
> (In reply to Luke Kenneth Casson Leighton from comment #22)
> > if rd.rel[0] is being set when zero_a is on, yes that's bad.
> > 
> > rd.go[N] is an external response to rd.req[N].  this is why it is critical
> > to get rd.req and wr.req bits set right because the external user of the
> > CompUnits kniws nothing about *why* they are set, they just respond.
> 
> In this case, "external user" are the dependency matrices, right?

yes.

> I've read https://libre-soc.org/3d_gpu/architecture/6600scoreboard/.
> 
> As far as I understand it, the Dependency Matrices also need to know
> themselves that the operand is immediate, 

not quite.  immediates are immediate, therefore there is no concept of a
"dependency".  

therefore, technically you are correct, but it is by way of "total omission
of involvement".

> and not set the corresponding FU-Reg Latch, correct?

by way of the DMs not in the slightest remote bit knowing *anything* about
immediates.... yes.

> Is this already implemented?

yes.  the DMs manage "dependencies".  an immediate is not a Dependency
Hazard (not a read or write register).
Comment 31 Cesar Strauss 2020-05-25 16:14:34 BST
Regarding the parallel test:

Pushed a proof of concept.

Defective as it was, I pushed it anyway, hoping for some feedback on the bug. Found it myself literally seconds after pushing.
Comment 32 Luke Kenneth Casson Leighton 2020-05-25 16:25:43 BST
(In reply to Cesar Strauss from comment #31)
> Regarding the parallel test:
> 
> Pushed a proof of concept.
> 
> Defective as it was, I pushed it anyway, hoping for some feedback on the
> bug. Found it myself literally seconds after pushing.

:)

ok i added some stubs here.
Comment 33 Luke Kenneth Casson Leighton 2020-05-25 16:32:01 BST
(In reply to Luke Kenneth Casson Leighton from comment #32)
> (In reply to Cesar Strauss from comment #31)
> > Regarding the parallel test:
> > 
> > Pushed a proof of concept.
> > 
> > Defective as it was, I pushed it anyway, hoping for some feedback on the
> > bug. Found it myself literally seconds after pushing.
> 
> :)
> 
> ok i added some stubs here.

right.  so i've pushed another commit which shows where we would expect to
put the inputs in (constructor).  rather than have them directly as arguments
to op_sim(), all those arguments - a, b, op, inv_a, imm_ok, zero_a - they
become the arguments to the constructor.

also expected_o should be in the constructor as well.
Comment 34 Cesar Strauss 2020-05-26 00:06:24 BST
(In reply to Luke Kenneth Casson Leighton from comment #33)
> right.  so i've pushed another commit which shows where we would expect to
> put the inputs in (constructor).  rather than have them directly as arguments
> to op_sim(), all those arguments - a, b, op, inv_a, imm_ok, zero_a - they
> become the arguments to the constructor.
> 
> also expected_o should be in the constructor as well.

Implemented the issue_i/busy_o protocol check, for the moment.

Will add the above as I need them, incrementally.
Comment 35 Luke Kenneth Casson Leighton 2020-05-26 00:41:36 BST
(In reply to Cesar Strauss from comment #34)


> Implemented the issue_i/busy_o protocol check, for the moment.

i saw.  looks good.

> Will add the above as I need them, incrementally.

ok.

it will be fascinating to compare against the formal proof
Comment 36 Cesar Strauss 2020-05-26 11:01:25 BST
(In reply to Luke Kenneth Casson Leighton from comment #35)
> it will be fascinating to compare against the formal proof

I saw src/soc/fu/compunits/formal/proof_fu.py. Very interesting.

How about a formal proof of the ALU interface?

Also, maybe DummyALU could be truly dummy (doesn't drive its signals), and let Assume drive them, in the CompUnit formal proof.
Comment 37 Luke Kenneth Casson Leighton 2020-05-26 11:15:16 BST
(In reply to Cesar Strauss from comment #36)
> (In reply to Luke Kenneth Casson Leighton from comment #35)
> > it will be fascinating to compare against the formal proof
> 
> I saw src/soc/fu/compunits/formal/proof_fu.py. Very interesting.

it is also very "clean".

> How about a formal proof of the ALU interface?

the interaction with it will be needed, yes, because of the signalling ready/valid.

in particular the ALU may *not* necessarily be ready to proceed at the time operands are available.

> Also, maybe DummyALU could be truly dummy (doesn't drive its signals), and
> let Assume drive them, in the CompUnit formal proof.

i have no idea.  we do need some way to check that data actually gets passed through.
Comment 38 Cesar Strauss 2020-05-26 17:29:34 BST
(In reply to Luke Kenneth Casson Leighton from comment #37)
> (In reply to Cesar Strauss from comment #36)
> > How about a formal proof of the ALU interface?
> 
> the interaction with it will be needed, yes, because of the signalling
> ready/valid.
> 
> in particular the ALU may *not* necessarily be ready to proceed at the time
> operands are available.

Feel free to put this on my queue, if you wish (formal proof of the CompUnit-ALU interface).

While I have no practical experience on formal proofs, I've been reading about it for some time, with the goal of applying it to my own designs. I like to think I've reached a good understanding.
Comment 39 Luke Kenneth Casson Leighton 2020-05-26 18:22:35 BST
(In reply to Cesar Strauss from comment #38)

> > in particular the ALU may *not* necessarily be ready to proceed at the time
> > operands are available.
> 
> Feel free to put this on my queue, if you wish (formal proof of the
> CompUnit-ALU interface).

sure, go for it.

> While I have no practical experience on formal proofs, I've been reading
> about it for some time, with the goal of applying it to my own designs. I
> like to think I've reached a good understanding.

good place to learn.

i would suggest simply adding them to proof_fu.py - focus on ready_(i/o),
valid_(i/o) - these are accessible signals.

if you assume that the DummyALU takes exactly one cycle between data_i and
data_o, this may make life easier.  the sequence will go:

* p.valid_i is set HI by CU
* p.ready_o is asserted HI by ALU in response (takes 1 cycle and asserts for
  ONLY one cycle)
* n.valid_o is asserted HI by ALU
* n.ready_i is set HI by CU
* n.valid_o is asserted LO by ALU in response (takes 1 cycle)

regarding data:

* data input should be placed by CU on the input at the same time as it
  raises p.valid_i.  it should REMAIN valid until p.ready_o is seen asserted.

* data output should be ready and valid when n_ready_i is HI.
all other times, it *may* still be left on the bus, but should *NOT* be
taken to be valid.
 
this is basically a standard ready/valid signalling system used by wishbone,
AXI4, etc. etc. etc. etc.
Comment 40 Luke Kenneth Casson Leighton 2020-05-31 17:42:02 BST
cesar i've moved the unit tests for MultiCompUnit into their own file,
they were getting very big and justified their own module.
Comment 41 Cesar Strauss 2020-06-01 12:09:03 BST
I updated the unit test to deal with rdmaskn.

For how long rdmaskn must be held valid? If I assert it for only one cycle, together with issue_i, it doesn't work. rel still rises on the next cycle.

Shouldn't it be latched, as with zero_a and imm_ok?
Comment 42 Luke Kenneth Casson Leighton 2020-06-01 12:46:27 BST
Cesar: this is kiiinda true... indirectly :)  i'm mulling over the words...

rdmask disables (entirely disables) bits of rd.rel, and consequently
rd.go *should* never be raised.  rd.rel being the output (which is
masked out), rd.go being the input, it's a "protocol violation" if
raised.

commit f930c1814f19c41f082c819494cf246edd09082d
Author: Cesar Strauss <cestrauss@gmail.com>
Date:   Mon Jun 1 07:56:12 2020 -0300

    Add rdmaskn parameter and assert it along issue_i
    
    Like zero_a and imm_ok, rdmaskn disallows the activation of go.
Comment 43 Cesar Strauss 2020-06-01 12:55:04 BST
(In reply to Luke Kenneth Casson Leighton from comment #42)
> rdmask disables (entirely disables) bits of rd.rel, and consequently
> rd.go *should* never be raised.  rd.rel being the output (which is
> masked out), rd.go being the input, it's a "protocol violation" if
> raised.

So rdmask, as an input, must be driven and held high, for all the time that busy_o is active?
Comment 44 Luke Kenneth Casson Leighton 2020-06-01 13:01:22 BST
(In reply to Cesar Strauss from comment #41)
> I updated the unit test to deal with rdmaskn.
> 
> For how long rdmaskn must be held valid?

the entire time (the entire busy cycle)

> If I assert it for only one cycle,
> together with issue_i, it doesn't work. rel still rises on the next cycle.
> 
> Shouldn't it be latched, as with zero_a and imm_ok?

hmmm hmmm that's a really good point.  actually... it should be generated
from the Decoder2 information (Decode2Execute1 in power_decoder2.py),
and brought in via the CompXXXOpSubsets.

if it came in via those, it would be part of oper_i and would automatically
be latched.

it's *different* information depending on which Function Unit
is used, based on the regspecs which are collected together in
soc/fu/compunits/compunits.py (from soc/fu/*/pipe_data.py)

i'll need to think about it - can you set it manually for now and hold
it set for the duration of busy_o == True?

btw it should not be necessary to set the entire mask to all 1s, because
there does not exist an FU which takes zero operands.
Comment 45 Luke Kenneth Casson Leighton 2020-06-01 13:02:46 BST
(In reply to Cesar Strauss from comment #43)

> So rdmask, as an input, must be driven and held high, for all the time that
> busy_o is active?

yes - for now.  i will look into whether (and how, and if) it should be
moved into CompXXXOpSubsets (into oper_i).
Comment 46 Luke Kenneth Casson Leighton 2020-06-01 22:50:43 BST
(In reply to Luke Kenneth Casson Leighton from comment #45)
> (In reply to Cesar Strauss from comment #43)
> 
> > So rdmask, as an input, must be driven and held high, for all the time that
> > busy_o is active?
> 
> yes - for now.  i will look into whether (and how, and if) it should be
> moved into CompXXXOpSubsets (into oper_i).

just fyi, Cesar:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/alu/pipe_data.py;h=33f54b026b95e06860a9f8b6a9a1b2797fdd59ee;hb=93fb5e930101a3bbb317e6180fc598f56b43cb9c#l67

line 67 (right at the end).  there's a function, "rdflags", which is there for
all soc.fu.*.pipe_data.py and it will be called by the Decoder and passed the
data structure (Decode2Execute1Type) which has all of the information, from
the actual instruction, about which register file ports (rd.req[0-N]) should
be activated.

it has to be done from the Decoder because only the Decoder (PowerDecode2) has
the relevant information.

but... what PowerDecode2 *doesn't* know is the order of the regfile ports
at the Function Unit... so... the function "rdflags" defines it.
Comment 47 Luke Kenneth Casson Leighton 2020-06-01 23:04:40 BST
I've updated the Documentation, here:

https://libre-soc.org/3d_gpu/architecture/decoder/
Comment 48 Luke Kenneth Casson Leighton 2020-06-03 00:06:41 BST
hi cesar i added a hack-test for when wrmak is zero.  this basically says that when the ALU says that is done, if there isno data to write then the MCU write-request phase is also done.

the logic becomes pretty similar to what is in LDSTCompUnit, which for ST does not write a reg, either.
Comment 49 Cesar Strauss 2020-06-06 21:33:53 BST
I'm advancing on the ALU CompUnit parallel unit test. Having gone past the rd rel/go interface, I've moved on to the ALU input interface.

The ALU input interface uses a valid/ready protocol. I assume that this, like rel/go, is similar to the wishbone interface. As such:

1) The port is idle when valid_o and ready_i are both zero.
2) If valid_o is high but ready_i is low, we have a wait state.
3) When both are high, we have a data transfer, and the transaction ends.
4) It is illegal to keep ready_i high when valid_o is low.

As a diagram:

clk      1 2 3 4 5 6
data_i   . D D D D .
valid_i  . 1 1 1 1 .
ready_o  . . . . 1 .
latched  . . . . D .

Back to back transactions are also permitted:

clk      1 2 3 4 5 6
data_i   . A A B C .
valid_i  . 1 1 1 1 .
ready_o  . . 1 1 1 .
latched  . . A B C .

Well, while tracing the ALU input handshake, in the parallel unit test, in src/soc/experiment/test/test_compalu_multi.py, I found this:

          clk     1 2 3 4 5 6
CompUnit: valid_i . . 1 1 . .
ALU:      ready_o . . . 1 1 .
                          ^
                          trouble

On cycle 5, ready_o is still high while valid_i is low. This can generate trouble if ready_o is wired-or with other modules.

Worse yet, on the src/soc/experiment/alu_hier.py unit test:

valid_i . . 1 1 1 1 1 1 . .
ready_o . . . 1 1 1 1 1 1 .

This is supposed to cover a single transaction, but in my undestanding I see 5 acknolowdged back-to-back transactions followed by a illegal state.

Can I go ahead and fix all these cases, to be more like wishbone? Or, is my assumption of wishbone likeness wrong?
Comment 50 Luke Kenneth Casson Leighton 2020-06-06 21:53:05 BST
(In reply to Cesar Strauss from comment #49)
> I'm advancing on the ALU CompUnit parallel unit test. Having gone past the
> rd rel/go interface, I've moved on to the ALU input interface.
> 
> The ALU input interface uses a valid/ready protocol. I assume that this,
> like rel/go, is similar to the wishbone interface.

standard wishbone / AXI4 yes.  the difference being that rel/go will NEVER be back to back (multiple transactions one after the other) however ready/valid can be.

however in the MultiCompUnits it is never used for that, because MCU is single transaction only.

> As such:
> 
> 1) The port is idle when valid_o and ready_i are both zero.
> 2) If valid_o is high but ready_i is low, we have a wait state.
> 3) When both are high, we have a data transfer, and the transaction ends.

or, takes place on that clock cycle, abd ends "as far as the next clock cycle is concerned".

> 4) It is illegal to keep ready_i high when valid_o is low.

> This is supposed to cover a single transaction, but in my undestanding I see
> 5 acknolowdged back-to-back transactions followed by a illegal state.

this does not surprise me.  the alu_hier.py ALUs were written "in a hurry" and are there purely for test purposes only.


> Can I go ahead and fix all these cases, to be more like wishbone? Or, is my
> assumption of wishbone likeness wrong?

yes please do...

or...

last week i considered throwing them out and using nmutil.singlepipe ALUs instead.

if you feel it does not take a long time to "fix" alu_hier.py ALU classes please do go ahead.

otherwise if it will take a long time it would be better to drop them and replace them with ALUs based on e.g. TestBufPipe (see nmutil unit tests)

again: they are only there for test purposes, however as tests it is indeed good for them to properly conform to the required API
Comment 51 Cesar Strauss 2020-06-06 22:20:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #50)
> if you feel it does not take a long time to "fix" alu_hier.py ALU classes
> please do go ahead.

One more thing. These test ALUs are non-pipelined, right? I mean, they take a finite number of cycles to sequence through their stages, but are not able to accept a new operation, each cycle, meanwhile, right?

In this case, I think setting ready_o combinatorially, based on valid_i being high, and the sequence counter being zero, is enough. This will even keep ready_o from activating if valid_i is high and the ALU is still busy. Will check if it works.
Comment 52 Luke Kenneth Casson Leighton 2020-06-06 23:38:35 BST
(In reply to Cesar Strauss from comment #51)
> (In reply to Luke Kenneth Casson Leighton from comment #50)
> > if you feel it does not take a long time to "fix" alu_hier.py ALU classes
> > please do go ahead.
> 
> One more thing. These test ALUs are non-pipelined, right? I mean, they take
> a finite number of cycles to sequence through their stages, but are not able
> to accept a new operation, each cycle, meanwhile, right?

correct.  this is to simulate FSMs (such as a DIV operation) and also to provide a way to check that the MultiCompUnits work properly when there is a delay.

it is a SEVERE violation of the entire MultiCompUnit API for a MCU to expect to process (generate) more than one result at a time.

this because that implies that it is "abandoning" a result, in direct violation of its role and responsibility to monitor the result through to completion.

the way to use pipelines: for an N Stage Pipeline, have a minimum of N MultiCompUnits with a fan-in and fan-out.  this is what the class ReservationStations is for.

the whole purpose of writing this parallel test is to be able to work towards the scenario of doing exactly that: have Nx MultiCompUnits throw 100-200 results at a pipeline (a "proper" N-stage pipeline) and check that the results are consistent.

however that is for later.

> In this case, I think setting ready_o combinatorially, based on valid_i
> being high, and the sequence counter being zero, is enough. This will even
> keep ready_o from activating if valid_i is high and the ALU is still busy.
> Will check if it works.

go for it.
Comment 53 Cesar Strauss 2020-06-07 21:35:32 BST
commit 71a791cf5f05bd5858fdc60f4ccdca936edc1d75
Author: Cesar Strauss <cestrauss@gmail.com>
Date:   Sun Jun 7 16:32:11 2020 -0300

    Make the test ALU conform to the valid/ready protocol
    
    Adjust the test case accordingly.

It was definitively not a one-line change. But it was fun.

Select any other operation than MUL, ADD, or SHR, and it will do a combinatorial subtraction, with zero-delay and no registering.

The interesting thing about zero-delay is that, due to the absence of an internal register, it will not release the input data until the result is transferred out.

Basically, the handshake signals of the input and output ports are directly connected between them, only the data is modified between the ports.

Nice change, working with RTL again. Will go back to the parallel unit tests, now.
Comment 54 Luke Kenneth Casson Leighton 2020-06-07 23:55:38 BST
(In reply to Cesar Strauss from comment #53)
> commit 71a791cf5f05bd5858fdc60f4ccdca936edc1d75
> Author: Cesar Strauss <cestrauss@gmail.com>
> Date:   Sun Jun 7 16:32:11 2020 -0300
> 
>     Make the test ALU conform to the valid/ready protocol
>     
>     Adjust the test case accordingly.
> 
> It was definitively not a one-line change. But it was fun.

oh good! :)

> Select any other operation than MUL, ADD, or SHR, and it will do a
> combinatorial subtraction, with zero-delay and no registering.
> 
> The interesting thing about zero-delay is that, due to the absence of an
> internal register, it will not release the input data until the result is
> transferred out.

oh wait... that's a *combinatorial* ALU?  iinteresting.  i wonder
what that could be used for.

hmmm i tried putting OP_NOP in there, and both compunit1 and parallel
locked up.  this tends to indicate a combinatorial loop.  any clues?

 
> Basically, the handshake signals of the input and output ports are directly
> connected between them, only the data is modified between the ports.
> 
> Nice change, working with RTL again. Will go back to the parallel unit
> tests, now.

:)
Comment 55 Cesar Strauss 2020-06-08 02:20:16 BST
(In reply to Luke Kenneth Casson Leighton from comment #54)
> oh wait... that's a *combinatorial* ALU?  iinteresting.  i wonder
> what that could be used for.

Well, the ALU output is already latched in MultiCompUnit. Maybe having another output register, inside the ALU itself, is redundant?

> hmmm i tried putting OP_NOP in there, and both compunit1 and parallel
> locked up.  this tends to indicate a combinatorial loop.  any clues?

I see. I'm looking into it.
Comment 56 Luke Kenneth Casson Leighton 2020-06-08 02:28:51 BST
(In reply to Cesar Strauss from comment #55)
> (In reply to Luke Kenneth Casson Leighton from comment #54)
> > oh wait... that's a *combinatorial* ALU?  iinteresting.  i wonder
> > what that could be used for.
> 
> Well, the ALU output is already latched in MultiCompUnit. Maybe having
> another output register, inside the ALU itself, is redundant?

ah right yes i understand, now.

yes, absolutely.  the only reason for the latch, inside the test ALU, was to simulate the concept of a pipeline stage.

however you are absolutely right: with both the incoming src regs being latched, and the outgoing result being latched also by MultiCompUnit, it is not necessary ro latch inside alu_hier.ALU as well.


> 
> > hmmm i tried putting OP_NOP in there, and both compunit1 and parallel
> > locked up.  this tends to indicate a combinatorial loop.  any clues?
> 
> I see. I'm looking into it.

it may be because "now" is set at the default action.  i have yet to work out how to detect combinatorial loops.  there may be something that can be done using yosys synth, michael showed me something a couple weeks ago.  ltp command?
Comment 57 Cesar Strauss 2020-06-09 13:36:44 BST
(In reply to Luke Kenneth Casson Leighton from comment #56)
> (In reply to Cesar Strauss from comment #55)
> > (In reply to Luke Kenneth Casson Leighton from comment #54)
> > > hmmm i tried putting OP_NOP in there, and both compunit1 and parallel
> > > locked up.  this tends to indicate a combinatorial loop.  any clues?
> > 
> > I see. I'm looking into it.
> 
> it may be because "now" is set at the default action.  i have yet to work
> out how to detect combinatorial loops.  there may be something that can be
> done using yosys synth, michael showed me something a couple weeks ago.  ltp
> command?

Found it manually, staring from valid_o, and changing comb to sync along the way. Interesting debug experience.

I don't really recall being much afflicted by this sort of issue. But then, I normally try to avoid using latches as much as possible, this may be the reason.

commit c8c13adf3da9a6ca8b7765edc597ec7fdf19e8d8
Author: Cesar Strauss <cestrauss@gmail.com>
Date:   Tue Jun 9 08:57:31 2020 -0300

    Avoid a combinatorial loop on valid_o
    
    The path was:
    
    all_rd (1) -> all_rd_pulse (1) -> alui_l.s (1) ->
    -> alu.p.valid_i (1) -> ALU (zero-delay) -> alu.n.valid_o (1) ->
    -> rok_l.r (1) -> all_rd (0)
    
    Decided to break the loop on the reset of the read-done, write proceed
    latch (rok_l.r), with no ill effects on performance.
    
    Added a test case for this, using the zero-delay ALU (OP_NOP).
Comment 58 Luke Kenneth Casson Leighton 2020-06-09 13:43:36 BST
(In reply to Cesar Strauss from comment #57)
> (In reply to Luke Kenneth Casson Leighton from comment #56)
> > it may be because "now" is set at the default action.  i have yet to work
> > out how to detect combinatorial loops.  there may be something that can be
> > done using yosys synth, michael showed me something a couple weeks ago.  ltp
> > command?
> 
> Found it manually, staring from valid_o, and changing comb to sync along the
> way. Interesting debug experience.

i've done that before :)

> I don't really recall being much afflicted by this sort of issue. But then,
> I normally try to avoid using latches as much as possible, this may be the
> reason.

yyeah they're not a "normal" way to do FSMs.  however with some aspects being
combinatorial and some being sync, we can't use "standard" techniques.

nicely done.  test_core.py still works, which is important.
Comment 59 Cesar Strauss 2020-06-15 00:21:04 BST
Created attachment 64 [details]
Suggested pipeline design for the ALU Comp Unit

(In reply to Luke Kenneth Casson Leighton from comment #58)
> (In reply to Cesar Strauss from comment #57)
> > I don't really recall being much afflicted by this sort of issue. But then,
> > I normally try to avoid using latches as much as possible, this may be the
> > reason.
> 
> yyeah they're not a "normal" way to do FSMs.  however with some aspects being
> combinatorial and some being sync, we can't use "standard" techniques.

Did you consider implementing the ALU CompUnit as a pipeline?

I have put some thought into it. See the attached diagram.

First, consider that rel / go protocol seems to be compatible with the valid / ready protocol, by connecting n.ready_o->rd.rel_i and rd.go_o->n.valid_i. Likewise on the dest port. A pass-thru stage helps, I think.

In the idle state, the switches will be set as shown, disconnecting the pipeline from the input and output ports, connecting them instead to a full sink and empty source respectively. So, there can be no activity on these ports.

At issue time, you transition the dest switch from straight to crossed. Also, in each parallel arm, choose either one of the src or imm switches to cross.

The barrier / combiner joins the input data, and outputs n.valid_o only when both p.valid_o are active. And then, it copies n.ready_i to both p.ready_i. This combiner takes care to inform the ALU that all its operands input are valid.

The ALU outputs directly to the dest port. holding the data itself, according to the protocol, until the dest port can accept it.

Notice the little need for FSMs and latches in the design. The total complexity is divided in manageable pieces, tucked inside the pipeline stages.

If all stages work in isolation, we can be confident that the design will work when put together.

Does it make sense?

I think it is worth investing the time to implement this. If it turns up that it works, we can be more confident on the design. Even apply the same principle to LDSTCompUnit.

I can give it a try, in parallel with other tasks. It would be a nice way to learn more about the pipeline API.
Comment 60 Luke Kenneth Casson Leighton 2020-06-15 00:42:40 BST
(In reply to Cesar Strauss from comment #59)
> Created attachment 64 [details]
> Suggested pipeline design for the ALU Comp Unit
> 
> (In reply to Luke Kenneth Casson Leighton from comment #58)
> > (In reply to Cesar Strauss from comment #57)
> > > I don't really recall being much afflicted by this sort of issue. But then,
> > > I normally try to avoid using latches as much as possible, this may be the
> > > reason.
> > 
> > yyeah they're not a "normal" way to do FSMs.  however with some aspects being
> > combinatorial and some being sync, we can't use "standard" techniques.
> 
> Did you consider implementing the ALU CompUnit as a pipeline?

you mean, MultiCompUnit itself?  if so: this would violate the mission-critical
protocol, the purpose for which MultiCompUnit exists.

MultiCompUnit *must not* abandon its data.  its absolutely must track the
operands from start to finish.

we call this a "Phase-aware Function Unit".  "phase-aware" because it absolutely
has to keep track of the start time and completion time.
failure to do so would be absolutely disastrous.

a pipeline *abandons* its data.  it says "here, you deal with it, i'm abdicating
all responsibility for data-tracking".  i.e. once the data is in the pipeline,
there is *no hardware* in the actual pipeline itself which tracks or handles
responsibility for tracking the completion of that result.

this was the subject of a rather excruciating discussion on comp.arch: it became
clear that there simply does not exist an Industry-standard or Computer Science
term that covers "Function Units which track data from operand to result".

one had to be invented, and the best that could be thought of was "Phase-aware
Function Unit"

for an out-of-order system, it is absolutely imperative that we have
Phase-aware Function Units, and that "management" task is what MultiCompUnit
is for.

a pipeline *cannot* fulfil that role.  by definition and by design.

however.... MultiCompUnit can *MANAGE* pipelines... by using the ready/valid
protocol.

and if we have multiple MultiCompUnit front-ends onto a single pipeline,
*now* we have Reservation Stations.

and that's what the ReservationStations class is for:
https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/concurrentunit.py;hb=HEAD#l40


> I can give it a try, in parallel with other tasks. It would be a nice way to
> learn more about the pipeline API.

actually we do need a ReservationStation version of a MultiCompUnit unit test.
a 2 or 3 stage "real" but dummy pipeline (simply passing its input through to
its output, maybe doing +1 on each stage or something, or, heck, doing ADD
would be perfectly on)

then there would be 4 to 5 MultiCompUnits wired up to a ReservationStation version
of that pipeline (see nmutil multipipe.py)

the unit test would then be an adaptation of the test_inout_mux_pipe.py test,
throwing multiple inputs in and seeing if they come out successfully at the
end.

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_inout_mux_pipe.py;hb=HEAD

there's some other code i can refer you to, if you're interested
(and we should raise a new bugreport for it)  bug #385
Comment 61 Cesar Strauss 2020-06-15 11:06:44 BST
(In reply to Luke Kenneth Casson Leighton from comment #60)
> (In reply to Cesar Strauss from comment #59)
> > Did you consider implementing the ALU CompUnit as a pipeline?
> 
> you mean, MultiCompUnit itself?  if so: this would violate the
> mission-critical
> protocol, the purpose for which MultiCompUnit exists.

I understand that MultiCompUnit is sequential by contract, and that allowing multiple operations in flight is a violation of that protocol.

That was not my idea when I conceived using the pipeline API in the design of MultiCompUnit.

The idea was to use the pipeline API to enforce sequential operation (eliminating all complex FSMs), and at the same time enforcing the ready/valid and rel/go protocols for free.

Is started when I suspected that the rel/go protocol of the src and dest ports maybe were compatible with the valid/ready protocol of the ALU.

I was then that I conceived of the src and dest ports being joined to the ALU using the pipeline API.

The only thing, is making sure that only one operation is in flight in the pipeline at a any time. A simple FSM would take care of that, I think.

But then, I would be crippling the power of the pipeline concept, which does seems like a waste.

Oh well, at least I conceived an example of a parallel, dynamically reconfigurable pipeline, with new elements like cross-switches and barriers/combinators, that are maybe useful in other contexts.
Comment 62 Luke Kenneth Casson Leighton 2020-06-15 12:58:13 BST
(In reply to Cesar Strauss from comment #61)
> (In reply to Luke Kenneth Casson Leighton from comment #60)
> > (In reply to Cesar Strauss from comment #59)
> > > Did you consider implementing the ALU CompUnit as a pipeline?
> > 
> > you mean, MultiCompUnit itself?  if so: this would violate the
> > mission-critical
> > protocol, the purpose for which MultiCompUnit exists.
> 
> I understand that MultiCompUnit is sequential by contract, and that allowing
> multiple operations in flight is a violation of that protocol.
> 
> That was not my idea when I conceived using the pipeline API in the design
> of MultiCompUnit.

ah?

> The idea was to use the pipeline API to enforce sequential operation
> (eliminating all complex FSMs), and at the same time enforcing the
> ready/valid and rel/go protocols for free.

ahh okaaay now i get it.  sorry: comparison-and-difference was what it
took for me to understand.
 
> Is started when I suspected that the rel/go protocol of the src and dest
> ports maybe were compatible with the valid/ready protocol of the ALU.

yes this is extremely likely.
 
> I was then that I conceived of the src and dest ports being joined to the
> ALU using the pipeline API.
> 
> The only thing, is making sure that only one operation is in flight in the
> pipeline at a any time. A simple FSM would take care of that, I think.
> 
> But then, I would be crippling the power of the pipeline concept, which does
> seems like a waste.

not quite.  it's that "tracking" which is paramount, everything else is
secondary.
 
> Oh well, at least I conceived an example of a parallel, dynamically
> reconfigurable pipeline, with new elements like cross-switches and
> barriers/combinators, that are maybe useful in other contexts.

:)

what you've come up with is basically a pipeline-compatible "information
collector and distributor", where:

* certain information has to come in from multiple different sources
* processing has to take place which cannot begin until all incoming
  information is received
* processed information has to be distributed out to multiple destinations

this may indeed prove extremely useful, although under some very special
circumstances.

given that the ready/valid protocol is basically Wishbone compatible
it *might* turn out to be the case that it's useful for special Bus
Transactions.

now that i think about it - it was over a year ago - i did sort-of begin
something like this (at least on the receiving side), it was a multi-bit
ready/valid with a funnel.

i didn't exactly abandon it... it just turned out to be complex.

i really like it... however given the time invested into MultiCompUnits,
and the timescales, we really should prioritise, and come back to this
idea later.

can you raise a bugreport with a summary and a link to the attachment?
Comment 63 Cesar Strauss 2020-06-17 23:18:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #62)
> i really like it... however given the time invested into MultiCompUnits,
> and the timescales, we really should prioritise, and come back to this
> idea later.
> 
> can you raise a bugreport with a summary and a link to the attachment?

Done, with deferred status.

https://bugs.libre-soc.org/show_bug.cgi?id=391

Placed there a cleaned up version of the diagram.
Comment 64 Cesar Strauss 2020-06-28 23:04:00 BST
(In reply to Cesar Strauss from comment #49)
> The ALU input interface uses a valid/ready protocol.
> 
> 1) The port is idle when valid_o and ready_i are both zero.
> 2) If valid_o is high but ready_i is low, we have a wait state.
> 3) When both are high, we have a data transfer, and the transaction ends.
> 4) It is illegal to keep ready_i high when valid_o is low.

After going through the Pipeline API, I realized case 4 is not illegal at all.

It should read:

4) If ready_i is high but valid_o is low, we have a wait state, from the sending side.

I changed the test ALU input port to reflect this, by not forcing p.ready_o low when p.valid_i is low.

commit 2b4b3a653806c543a17b5f6e6db2c00d24996210
Author: Cesar Strauss <cestrauss@gmail.com>
Date:   Sun Jun 28 18:38:03 2020 -0300

    Let p.ready_o be active while the test ALU is idle
    
    The valid/ready protocol doesn't actually forbid p.ready_o
    being active while p.valid_i is inactive. It just mean that
    the ALU is idle, and is ready to accept new data.
    
    This should help avoiding potential combinatorial loops from
    p.ready_o to p.valid_i.
Comment 65 Cesar Strauss 2020-10-30 22:57:07 GMT
(In reply to Luke Kenneth Casson Leighton from comment #11)
> basically, we need:
> 
> * one function (similar to TestInput.send) which takes care of *one*
>   REQ_READ and GO_RD signalling bit.  it should monitor:
>   self.rd.req[N], then set self.rd.go[N] for *one* cycle (one blank yield)
>   then set it self.rd.go[N] back to zero and confirm (assert) that
>   self.rd.req[N] has gone to zero

commit ff916d3932fc4fe90c9e8d1cc3bfa8c3818dceb9
Author: Cesar Strauss <cestrauss@gmail.com>
Date:   Wed Oct 28 19:57:55 2020 -0300

    Implement an operand producer that talks the rel_o/go_i handshake
    
    It can be instantiated once for each operand port, working in parallel
    with the main test-bench process.

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=ff916d3932fc4fe90c9e8d1cc3bfa8c3818dceb9

> * another function (similar again to TestInput.rcv) which does the same
>   thing for REQ_WRITE and GO_WRITE.
> 
> * another function - a very simple one - which sets issue_i
>   for one cycle (yield in between), then waits for "self.busy" to drop to 0
> 
> we can make it more sophisticated as we go along.
> 
> it should be... only... about... 80 lines of code, at a guess.

This will be next.
Comment 66 Cesar Strauss 2020-11-21 15:06:07 GMT
I've finished the CompUnit parallel test for the Shifter. Next, I'll port the ALU tests, followed by the Load / Store CompUnit. 

It features:

1) Operands can arrive in any order, or simultaneously

Simulated delays are specified per port.

2) Results (if multiple) can also be accepted in any order

3) Results are checked as they are produced

4) Detects duplicated or missing operand requests or results

A running transaction counter is maintained at each port. As busy is negated, all counters must match.

To see:

1) Update nmutil, to get the latest write_gtkw functionality

2) Run:

$ python ~/src/soc/src/soc/experiment/test/test_compalu_multi.py 

$ gtkwave ~/test/soc/test_compunit_fsm1.gtkw 

There are three transactions, and each time the order of arrival of operands varies.
Comment 67 Luke Kenneth Casson Leighton 2020-11-21 17:33:46 GMT
(In reply to Cesar Strauss from comment #66)
> I've finished the CompUnit parallel test for the Shifter. Next, I'll port
> the ALU tests, followed by the Load / Store CompUnit. 

fantastic, Cesar.

out of curiosity, given the subject of this bugreport: is testing of operations that are zero for A included? :)  actually this is more for the PowerDecoder2 (or sub-decoders), now.
Comment 68 Cesar Strauss 2020-11-21 17:50:53 GMT
(In reply to Luke Kenneth Casson Leighton from comment #67)
> out of curiosity, given the subject of this bugreport: is testing of
> operations that are zero for A included? :)  actually this is more for the
> PowerDecoder2 (or sub-decoders), now.

Not yet, but it will be.

For simplicity, I began with the Shifter, that doesn't have either "zero A" or immediates. But I really will implement these next, for the ALU test case.
Comment 69 Luke Kenneth Casson Leighton 2020-11-21 19:15:06 GMT
(In reply to Cesar Strauss from comment #68)
> (In reply to Luke Kenneth Casson Leighton from comment #67)
> > out of curiosity, given the subject of this bugreport: is testing of
> > operations that are zero for A included? :)  actually this is more for the
> > PowerDecoder2 (or sub-decoders), now.
> 
> Not yet, but it will be.

fantastic.

> For simplicity, I began with the Shifter, that doesn't have either "zero A"
> or immediates. But I really will implement these next, for the ALU test case.

great to hear, this will confirm the low level

i wonder... although the CompUnit itself needs zero-immediate tests, to make sure it can cope at the low level, really, we also need actual unit tests (test_*_compunit.py) at the level up.

for the low-level that you are doing (which is equally as important), it does not interact with PowerDecoder2.

the Op Subset Record is given to CompUnit *by* PowerDecoder2, and it is PowerDecoder2 which analyses the instruction and identifies whether the register is RA or RA-or-Zero.

that interaction is what also needs testing: a nonzero value placed into the regfile, use RA-or-Zero, have PowerDecoder2 decode that, and confirm that zero was passed as the operand.
Comment 70 Cesar Strauss 2020-11-22 20:15:58 GMT
(In reply to Luke Kenneth Casson Leighton from comment #69)
> that interaction is what also needs testing: a nonzero value placed into the
> regfile, use RA-or-Zero, have PowerDecoder2 decode that, and confirm that
> zero was passed as the operand.

Understood.

Meanwhile, the ALU parallel test case is done, including handling of the zero_a and imm_ok cases.

These operand producers are kind of "dumb": they will happily respond to a request, even if zero_a or imm_ok is asserted. But, they will still increment their transaction counter. The counters are checked at the end, when busy is negated, and the mismatch will be detected.

There is still a DummyALU CompUnit, with three operands, that remains to be ported. Also, I will remove my earlier unfinished attempt.
Comment 71 Cesar Strauss 2020-11-27 22:29:11 GMT
I finished porting the DummyALU test case.

I found a bug in the test case itself. DummyALU is built with CompCROpSubset, but the test case instanced a MultiCompUnit with CompALUOpSubset.

The next step is to add test cases for read and write masks, and shadow/go_die.

To see the current work:
$ python ~/src/soc/src/soc/experiment/test/test_compalu_multi.py 

# MultiCompUnit + FSM Shifter:
$ gtkwave test_compunit_fsm1.gtkw

# MultiCompUnit + ALU:
$ gtkwave test_compunit_regspec1.gtkw

# MultiCompUnit + DummyALU:
$ gtkwave test_compunit_regspec3.gtkw
Comment 72 Luke Kenneth Casson Leighton 2020-11-27 23:00:03 GMT
ah i'm so sorry, that was me trying to reduce the signals and codesize of examples because CR subset is much smaller, sorry! not properly documented.
Comment 73 Cesar Strauss 2020-12-31 17:58:37 GMT
(In reply to Cesar Strauss from comment #71)
> The next step is to add test cases for read and write masks, and
> shadow/go_die.

The ALU CompUnit unit test now checks that masked-out input and output ports do not make requests. Individual test cases can specify whether any port is masked-out.

The next step is to actually invent some new operations for the test ALU, that do not make use of some of the read ports, and maybe add some write ports to it as well.
Comment 74 Luke Kenneth Casson Leighton 2020-12-31 18:09:04 GMT
(In reply to Cesar Strauss from comment #73)
> (In reply to Cesar Strauss from comment #71)
> > The next step is to add test cases for read and write masks, and
> > shadow/go_die.
> 
> The ALU CompUnit unit test now checks that masked-out input and output ports
> do not make requests. Individual test cases can specify whether any port is
> masked-out.

fantastic.

> The next step is to actually invent some new operations for the test ALU,
> that do not make use of some of the read ports,

the simplest one there is just bit-inversion (o = ~a).  or, perhaps, add 1?

> and maybe add some write ports to it as well.

hmm i can't think of anything that would be needed to be very difficult, it only needs to be different and detectable.  how about add 1 for port 1 and subtract 1 for port 2?
Comment 75 Cesar Strauss 2021-01-02 13:20:53 GMT
(In reply to Luke Kenneth Casson Leighton from comment #74)
> (In reply to Cesar Strauss from comment #73)
> > (In reply to Cesar Strauss from comment #71)
> > The next step is to actually invent some new operations for the test ALU,
> > that do not make use of some of the read ports,
> 
> the simplest one there is just bit-inversion (o = ~a).  or, perhaps, add 1?

Thanks, but I ended up going with sign extension, there was already a MicrOp Enum allocated to it. I also added a NOP, which does not use any read or write ports.

> > and maybe add some write ports to it as well.
> 
> hmm i can't think of anything that would be needed to be very difficult, it
> only needs to be different and detectable.  how about add 1 for port 1 and
> subtract 1 for port 2?

Thanks, but I ended up implementing a simple condition register output port, for all existing instructions, conditioned on the rc bit which was already available on CompALUOpSubset.

So, there are now test cases that include all combination of masked read and write ports.

Some observations:

1) It seems that MultiCompUnit depends on the ALU *.ok output fields being held valid long enough into the write phase. In the test ALU, I tried holding *.ok valid just for the duration of the ALU valid_o, and it didn't work. I ended up instead directly decoding the input operation, combinatorially, into the *.ok, so they do stay valid during the entire instruction cycle.

2) I noticed that, for the CompUnit outputs, MultiCompUnit strips out the *.ok ALU output fields by flattening the Data Record (*.data and *.ok) into a Signal, truncating it. This works, but relies on the "ok" field being the last field.

I'm now moving on to the final task, of testing shadown/go_die.
Comment 76 Luke Kenneth Casson Leighton 2021-01-02 14:41:17 GMT
(In reply to Cesar Strauss from comment #75)

> > subtract 1 for port 2?
> 
> Thanks, but I ended up implementing a simple condition register output port,
> for all existing instructions, conditioned on the rc bit which was already
> available on CompALUOpSubset.

both great, it sounds like this was easier.  fantastic.
 
> So, there are now test cases that include all combination of masked read and
> write ports.
> 
> Some observations:
> 
> 1) It seems that MultiCompUnit depends on the ALU *.ok output fields being
> held valid long enough into the write phase. 

yes, very important.  they are the "data is valid and needs to be written" signals.

they are needed because some ALUs will decide *themselves* if the data is to be written (XER.so for example is done this way.  you don't know if it will be modified)

i.e. if the "ok" flag is False then the write.req must definitely not be raised


> In the test ALU, I tried
> holding *.ok valid just for the duration of the ALU valid_o, and it didn't
> work. I ended up instead directly decoding the input operation,
> combinatorially, into the *.ok, so they do stay valid during the entire
> instruction cycle.
> 
> 2) I noticed that, for the CompUnit outputs, MultiCompUnit strips out the
> *.ok ALU output fields by flattening the Data Record (*.data and *.ok) into
> a Signal, truncating it.


ah, right: if you find any of those... yyyees, strictly speaking this should not happen: there is supposed to be detection of the data/ok and to create a suitable Record, matching it.  i may have missed some of those and used Signal instead.

at least however nmigen will cope fine by flattening and correctly transferring the data.


> This works, but relies on the "ok" field being the
> last field.

that definitely needs documenting.


> 
> I'm now moving on to the final task, of testing shadown/go_die.

star.  this one is important for the predication, and when we get to SIMD.
Comment 77 Cesar Strauss 2021-01-02 15:03:47 GMT
(In reply to Luke Kenneth Casson Leighton from comment #76)
> (In reply to Cesar Strauss from comment #75)
> > 2) I noticed that, for the CompUnit outputs, MultiCompUnit strips out the
> > *.ok ALU output fields by flattening the Data Record (*.data and *.ok) into
> > a Signal, truncating it.
> 
> 
> ah, right: if you find any of those... yyyees, strictly speaking this should
> not happen: there is supposed to be detection of the data/ok and to create a
> suitable Record, matching it.  i may have missed some of those and used
> Signal instead.

Sure.
Here, you see data_r is a Record:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=d7e32f28c556e76aff9be146ce280eba9745bb09;hb=HEAD#l268

A little below (line 281) it is appended to "drl".

Then, drl[i] is placed in dest[i]. which is a Signal:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=d7e32f28c556e76aff9be146ce280eba9745bb09;hb=HEAD#l356

The issue is actually mentioned in a comment, when dest[i] is created:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=d7e32f28c556e76aff9be146ce280eba9745bb09;hb=HEAD#l83
Comment 78 Luke Kenneth Casson Leighton 2021-01-02 15:30:39 GMT
(In reply to Cesar Strauss from comment #77)

> The issue is actually mentioned in a comment, when dest[i] is created:
> 
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/

yep, coming back to me.  been a while.  yes, back where lro=get_out() is used, strictly speaking the same thing should be done.

however with those being internal Signals that get copied into the Regspec outputs, restoring the .ok at that point, we get away with it :)