305 – Create Pipelined ALU similar to alu_hier.py

Bug 305 - Create Pipelined ALU similar to alu_hier.py

Summary: Create Pipelined ALU similar to alu_hier.py

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	PC Windows

Importance:	--- enhancement
Assignee:	Michael Nolan

URL:

Depends on:	319 361 306 334 337 358 360 369
Blocks:	341 383
	Show dependency tree / graph

Reported:	2020-05-08 16:17 BST by Michael Nolan
Modified:	2021-11-29 21:40 GMT (History)
CC List:	3 users (show)

See Also:	310 313 314 315 323 324 325 328 330 332 338 339 461
NLnet milestone:	NLNet.2019.10.043.Wishbone
total budget (EUR) for completion of task and all subtasks:	400
budget (EUR) for this task, excluding subtasks' budget:	400
parent task for budget allocation:	383
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	lkcl = { amount = 200, paid = 2020-08-21 } donated = { amount = 200, submitted = 2021-04-24, paid = 2021-05-01 }

Attachments
scoreboard mechanics image (16.19 KB, image/png) 2020-05-08 16:29 BST, Luke Kenneth Casson Leighton	Details
ALU main stage (134.16 KB, image/png) 2020-05-20 15:01 BST, Luke Kenneth Casson Leighton	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Nolan 2020-05-08 16:17:00 BST

From discussion with Luke:

an actual pipelined ALU, that expands on the alu_hier.py concept and
adds more functions - however it also needs to use PartitionedSignal.
this can default to "all off" initially.

Additionally, make certain to use the ReservationStations base class and test
infrastructure (runfp) exactly as i described in here:

https://bugs.libre-soc.org/show_bug.cgi?id=208#c55

*all* ALUs need to have that common interface (multiple
inputs-outputs).  if you can, use runfp - if there's any assumptions
about the format being FP numbers, add an extra parameter which does
debug printing as hexadecimal.  i *think* it should do that anyway
because i used the same runfp test code for FP-to-INT and INT-to-FP

Comment 1 Luke Kenneth Casson Leighton 2020-05-08 16:28:21 BST

http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/006371.html

Comment 2 Michael Nolan 2020-05-08 16:28:42 BST

From email:

I noticed alu_hier was nicely broken up into different function units (adder, branch unit, multiplier, etc.). Is there a nice way you know of to do that with the nmutil pipeline stages? Like a way to merge stage 1 of the adder pipe with stage one of the branch pipe to get the stage 1 of the alu?

Comment 3 Michael Nolan 2020-05-08 16:29:05 BST

Luke's Reply:

On Fri, May 8, 2020 at 3:03 PM Michael N <mtnolan2640@gmail.com> wrote:

>> On May 8, 2020, at 9:28 AM, Michael Nolan <mtnolan2640@gmail.com> wrote:
>>
>> Now that I have branches working in the simulator, I was thinking of starting work on the integer ALU. Taking a look at alu_hier.py, I think I'll keep the interface but replace the implementation with a pipelined one.
>>
>> --Michael
>>
> I noticed alu_hier was nicely broken up into different function units (adder, branch unit, multiplier, etc.).

yes, with an "op" - which in our case will basically be "decoder.InternalOp"

> Is there a nice way you know of to do that with the nmutil pipeline stages?

yyyess.... except it's too complicated to explain right just this
minute.  raise the question in the bugreport.

> Like a way to merge stage 1 of the adder pipe with stage one of the branch pipe to get the stage 1 of the alu?

ok if you are talking about *outside* of ReservationStations, you are
mixing "purposes".  the Branch pipeline *does not* interact - at all -
with the ALU pipeline.

can you look again at Mitch's ScoreBoard Mechanics1.pdf

damnit this needs to be on the bugreport, i haven't got time, 10m until meeting

see page 8, diagram at the top.

Those 4 batches of Function Units are *completely separate*.  they do
*NOT* interact.

basically, the first ALU, do it entirely as combinatorial just as in
alu_hier.py and assume only one clock cycle.  leave out MUL and DIV
because those i believe need to be minimum 2 stages for MUL and a
*lot* for DIV.

actually you know what, screw it: make separate ones for MUL and DIV,
pretty much exactly like in that diagram on p8.  we're doing a
radically different design as a proof-of-concept.

* add sub neg compare logical shift all go in their own
Reservation-Station-style FunctionUnit
* MUL in its own separate RS/FU
* DIV likewise

l.

Comment 4 Luke Kenneth Casson Leighton 2020-05-08 16:29:35 BST

Created attachment 52 [details]
scoreboard mechanics image

michael can you drop this onto the wiki scoreboard page

Comment 5 Michael Nolan 2020-05-08 16:32:54 BST

(In reply to Michael Nolan from comment #3)

> 
> yes, with an "op" - which in our case will basically be "decoder.InternalOp"
> 

> 
> damnit this needs to be on the bugreport, i haven't got time, 10m until
> meeting
> 
> see page 8, diagram at the top.
> 
> Those 4 batches of Function Units are *completely separate*.  they do
> *NOT* interact.
Oh ok. So essentially we will have *separate* ports for the addition pipe, the multiplication pipe, the branch pipe, and instructions can be issued to each in parallel. 


> actually you know what, screw it: make separate ones for MUL and DIV,
> pretty much exactly like in that diagram on p8.  we're doing a
> radically different design as a proof-of-concept.
> 
> * add sub neg compare logical shift all go in their own
> Reservation-Station-style FunctionUnit
> * MUL in its own separate RS/FU
> * DIV likewise
yeah looks like it

Comment 6 Luke Kenneth Casson Leighton 2020-05-08 16:35:47 BST

src/soc/alu/pipe_data.py

will need to pass in everything and pass out everything.  carry-in as
well as carry-out.

the pipelines have to take *everything* that they need to produce a
result: they are not permitted to access global resources such as
regfiles.

this is why we have different FunctionUnits with totally different
specs() because for Branch it will take CTR and LR *as input*, where
for ALU it will take CR0 and a and b and so on

and for the MUL FU it will take about *four* possibly 5 inputs:
RA RB RC (for MAC) *and* carry-in *and* CR0 etc. etc.

Comment 7 Luke Kenneth Casson Leighton 2020-05-08 16:40:50 BST

(In reply to Michael Nolan from comment #5)

> > Those 4 batches of Function Units are *completely separate*.  they do
> > *NOT* interact.
> Oh ok. So essentially we will have *separate* ports for the addition pipe,
> the multiplication pipe, the branch pipe, and instructions can be issued to
> each in parallel. 

yeeees :)

now you're getting it.

the DMs preserve the register-usage order and manage the regfile Buses.

the only "stalling" that's done is when the Issue stage cannot find a free
ReservationStation row to reserve for the instruction.

the output from the pipeline goes into the RS row "output" latch and sits
there until the FunctionUnit says "DMs say we have a free and clear path
to write to the regfile: go... like... go NOW".

the more regfile ports you have, the more results get written in parallel.

likewise for regfile reads for getting the operands *into* the RS Row(s).

Comment 8 Luke Kenneth Casson Leighton 2020-05-08 17:48:11 BST

(In reply to Luke Kenneth Casson Leighton from comment #4)
> Created attachment 52 [details]
> scoreboard mechanics image
> 
> michael can you drop this onto the wiki scoreboard page

thank you.  meeting still not happened yet, waiting for a call, did
a quick section and also added the corresponding write diagram
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/?updated

so just to emphasise, michael: for ALUInitialData it will need:

+class ALUInitialData(IntegerData):
+    def __init__(self, pspec):
+        super().__init__(pspec)
+        self.a = Signal(64, reset_less=True)
+        self.b = Signal(64, reset_less=True)

plus XER carry-in as an input operand and also the opcode itself.
although... that's going to be in ctx, as opkls?  hmm... see separate
comment

and a *separate* output class (not the same as ALUInitialData)
will be needed with:

* output result (RT)
* CA CA32
* CR0
* SO OV OV32

Comment 9 Luke Kenneth Casson Leighton 2020-05-08 17:54:32 BST

there's a bit of a nuisance... no there isn't, you're good.
FPPipeContext contains the muxid (which the ReservationStation
uses as an array index).  you got it right by passing in
CompAlUOpSubset as the opkls.

Comment 10 Michael Nolan 2020-05-08 18:04:07 BST

Does it make sense to have the ALU inputs a and b be PartitionedSignals?

Comment 11 Luke Kenneth Casson Leighton 2020-05-08 18:10:44 BST

(In reply to Michael Nolan from comment #10)
> Does it make sense to have the ALU inputs a and b be PartitionedSignals?

later.  not right now.  if we have time we convert them to PartitionedSignals.

Comment 12 Luke Kenneth Casson Leighton 2020-05-08 19:38:56 BST

(In reply to Luke Kenneth Casson Leighton from comment #11)
> (In reply to Michael Nolan from comment #10)
> > Does it make sense to have the ALU inputs a and b be PartitionedSignals?
> 
> later.  not right now.  if we have time we convert them to
> PartitionedSignals.

i know :)  we did all that work and we need to focus first on a
scalar version.  it means not needing to fuss about parallel-masked
decisions and so on, which i am expecting shouuuld make development
time faster.

Comment 13 Luke Kenneth Casson Leighton 2020-05-09 00:08:27 BST

this:

        # If there's an immediate, set the B operand to that
        with m.If(self.i.ctx.op.imm_data.imm_ok):
            comb += self.o.b.eq(self.i.ctx.op.imm_data.imm)
        with m.Else():
            comb += self.o.b.eq(self.i.b)

...is already done by the Computation Unit:

        # select immediate if opcode says so.  however also change the latch
        # to trigger *from* the opcode latch instead.
        op_is_imm = oper_r.imm_data.imm_ok
        src2_or_imm = Signal(self.rwid, reset_less=True)
        src_sel = Signal(reset_less=True)
        m.d.comb += src_sel.eq(Mux(op_is_imm, opc_l.q, src_l.q[1]))
        m.d.comb += src2_or_imm.eq(Mux(op_is_imm, oper_r.imm_data.imm,
                                                  self.src2_i))

there is *not* supposed to be a relationship between "register numbers"
and "Function Unit Operands".  Function Units are not permitted to know
about registers, at all (however by a coincidence we happen to be
preserving the order, RA = operand1, RB = operand2 etc).

however... i quite like that it's possible to test against the actual
decoded instructions, and that makes me think that maybe yes, the
FunctionUnit should *not* be decoding the immediate, just pass it
through.

otherwise to achieve the same unit test coverage it would be necessary
to have the immediate decoding in the unit test.

i will go ahead and modify the compalu_multi.py to remove immediate
decoding/selection from here:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=f9cb83ea13184bb6ff9ceedb7627b7ea67295c55;hb=HEAD#l166


----


here, it receives data in ALUInputData format (which is correct)

class ALUInputStage(PipeModBase):
    def __init__(self, pspec):
        super().__init__(pspec, "input")

    def ispec(self):
        return ALUInputData(self.pspec)

    def ospec(self):
        return ALUInputData(self.pspec)


however the output spec is a *modified* copy of the same data structure.

with "carry_in" being *generated* by ALUInputStage, ALUInputData should
*not* have "carry_in" as a field.

the reason is because it will never be set by the *user* of this stage,
and consequently a dangling wire will end up being created.

i've created a class ALUFirstInputData (bad name i know) which should
be used instead.  


-----

what's "so" intended for?

Comment 14 Michael Nolan 2020-05-09 14:13:41 BST

(In reply to Luke Kenneth Casson Leighton from comment #13)

> with "carry_in" being *generated* by ALUInputStage, ALUInputData should
> *not* have "carry_in" as a field.

We still need a carry_in from XER for things like add with carry. 

> the reason is because it will never be set by the *user* of this stage,
> and consequently a dangling wire will end up being created.
> -----
> 
> what's "so" intended for?

Summary Overflow.

Comment 15 Luke Kenneth Casson Leighton 2020-05-09 14:37:42 BST

(In reply to Michael Nolan from comment #14)
> (In reply to Luke Kenneth Casson Leighton from comment #13)
> 
> > with "carry_in" being *generated* by ALUInputStage, ALUInputData should
> > *not* have "carry_in" as a field.
> 
> We still need a carry_in from XER for things like add with carry. 

ah doh, yes you are absolutely right.

hmm hmmm sooo....

actually.... the default action from the if elif batch is to copy that carry_in from the input spec over to the output spec.


> 
> > the reason is because it will never be set by the *user* of this stage,
> > and consequently a dangling wire will end up being created.
> > -----
> > 
> > what's "so" intended for?
> 
> Summary Overflow.

okaay.  so if it's a generated output, set only as a result as part of the operation (in the 2nd or 3rd stage), it should *not* be in the ispec(), only in the ospec() for later stage(s)

basically:

* if something is needed in the stage it must be added to the ispec()

* if something is produced *by* the stage ot must be added to the ospec()

* if something is needed by later stages, the only way they will get it is by "passing through" the current stage and consequently it has to be added to *both* ispec and ospec.

* if a final output is generated by an early stage, then clearly the only way to get it all the way to the ends is, again, by "passing through" to later stages and consequently putting it into both ispec and ospec.

summary-out sounds like it is output-only (3rd stage?). 
 carry-out likewise (from 2nd stage)

carry-in you are right is needed as input *and* output on *this* (first) stage, where in later stages it will not be.

a and b and op are all "passthru" for this 1st stage. however a and b are input only or the 2nd stage, but you *might* need some of the information from op for the 3rd stage.

welcome to pipelines :)

Comment 16 Michael Nolan 2020-05-09 14:41:48 BST

(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Michael Nolan from comment #14)
> > (In reply to Luke Kenneth Casson Leighton from comment #13)
> > 
> > > with "carry_in" being *generated* by ALUInputStage, ALUInputData should
> > > *not* have "carry_in" as a field.
> > 
> > We still need a carry_in from XER for things like add with carry. 
> 
> ah doh, yes you are absolutely right.
> 
> hmm hmmm sooo....
> 
> actually.... the default action from the if elif batch is to copy that
> carry_in from the input spec over to the output spec.
> 
> 
> > 
> > > the reason is because it will never be set by the *user* of this stage,
> > > and consequently a dangling wire will end up being created.
> > > -----
> > > 
> > > what's "so" intended for?
> > 
> > Summary Overflow.
> 
> okaay.  so if it's a generated output, set only as a result as part of the
> operation (in the 2nd or 3rd stage), it should *not* be in the ispec(), only
> in the ospec() for later stage(s)

Summary Overflow is "sticky". Once one instruction sets the overflow bit, summary overflow gets set until it's cleared by a mtspr instruction. Effectively for each instruction it's (so = so | ov). That's why it's included in the input stage.

Comment 17 Luke Kenneth Casson Leighton 2020-05-09 14:45:26 BST

(In reply to Michael Nolan from comment #14)

> We still need a carry_in from XER for things like add with carry.

just to check, XER is a Condition Register, right?

we may almost certainly need to split out the *individual* fields of XER and other Condition Registers into actual separate "registers", even though they are one or two bits.

the reason is because if we do not it will create artificial dependency barriers between instructions that read or write some part of XER but not another.

therefore all of CR0-CR7 need to be *different* "operands", likewise OV32, OV and so on.

Comment 18 Luke Kenneth Casson Leighton 2020-05-09 14:48:32 BST

(In reply to Michael Nolan from comment #16)

> Summary Overflow is "sticky". Once one instruction sets the overflow bit,
> summary overflow gets set until it's cleared by a mtspr instruction.
> Effectively for each instruction it's (so = so | ov). That's why it's
> included in the input stage.

ohhh. yep.  got it now.  can you drop a comment into the data structure to that effect? the idea being that "why is this here" should be answered (or, a reminder of the answer)

Comment 19 Luke Kenneth Casson Leighton 2020-05-09 15:04:31 BST

ok as it's "pass-through", i added so (and a comment) to main_stage.
also added some similar comments (and pass-through of so) on
input_stage.

diff --git a/src/soc/alu/main_stage.py b/src/soc/alu/main_stage.py
index e801c96..07fb018 100644
--- a/src/soc/alu/main_stage.py
+++ b/src/soc/alu/main_stage.py
@@ -39,7 +39,9 @@ class ALUMainStage(PipeModBase):
             with m.Case(InternalOp.OP_XOR):
                 comb += self.o.o.eq(self.i.a ^ self.i.b)
 
+        ###### sticky overflow and context, both pass-through #####
+        comb += so.eq(self.i.so)
         comb += self.o.ctx.eq(self.i.ctx)
 
         return m

about this:

class ALUOutputStage(PipeModBase):
    def __init__(self, pspec):
        super().__init__(pspec, "output")

    def ispec(self):
        return ALUOutputData(self.pspec)

    def ospec(self):
        return ALUOutputData(self.pspec)

if ALUOutputSpec is designed to generate ov, ov32, cr0 in *this*
stage (final one), then an ALUIntermediateData is needed
which does not have ov, ov32 or cr0 (etc.) in it.

Comment 20 Luke Kenneth Casson Leighton 2020-05-09 15:06:38 BST

(In reply to Luke Kenneth Casson Leighton from comment #19)

> stage (final one), then an ALUIntermediateData is needed

p.s. feel free to pick a better name :)

Comment 21 Luke Kenneth Casson Leighton 2020-05-09 19:23:58 BST

hiya michael i deployed a trick that was used in the PartitionedAdder.
rather than use 2 separate adds, A and B are shifted up by 1 bit, then
you put a 1 into A[0] and the carry into B[0].  that way you do a 66
bit add, discard bit 0, result is bits 1-64, and carry is the MSB.

Comment 22 Luke Kenneth Casson Leighton 2020-05-09 20:08:05 BST

are you studying this btw?
https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl

Comment 23 Luke Kenneth Casson Leighton 2020-05-09 20:12:28 BST

that's interesting.  what's the popcounts all about?
https://github.com/antonblanchard/microwatt/blob/master/logical.vhdl

Comment 24 Luke Kenneth Casson Leighton 2020-05-09 22:18:30 BST

https://github.com/antonblanchard/microwatt/blob/master/ppc_fx_insns.vhdl

line 563 is where sraw, srad, srawi and sradi exist.  there's a *lot*
of code-duplication, which is fairly normal in "traditional" HDLs because
parameterisation can be harder to do.

Comment 25 Michael Nolan 2020-05-09 22:35:01 BST

(In reply to Luke Kenneth Casson Leighton from comment #24)
> https://github.com/antonblanchard/microwatt/blob/master/ppc_fx_insns.vhdl
> 
> line 563 is where sraw, srad, srawi and sradi exist.  there's a *lot*
> of code-duplication, which is fairly normal in "traditional" HDLs because
> parameterisation can be harder to do.

Those are already done in main_stage.py, though they could do with a bit more rigorous testing.

Comment 26 Michael Nolan 2020-05-09 22:42:03 BST

I just realized I'm going to run into some difficulty with rlwinm, as the values controlling the mask amounts (MB and ME) aren't exposed by Decode2ToExecute1. Should they be added, or is there another path you think we should take for things like this (there are almost certainly other instructions with fields like that).

Comment 27 Luke Kenneth Casson Leighton 2020-05-10 05:25:04 BST

(In reply to Michael Nolan from comment #26)
> I just realized I'm going to run into some difficulty with rlwinm, as the
> values controlling the mask amounts (MB and ME) aren't exposed by
> Decode2ToExecute1. Should they be added, or is there another path you think
> we should take for things like this (there are almost certainly other
> instructions with fields like that).

hmm hmm don't know.  microwatt throws the actual instruction opcode around, as a global, which is fine for single issue

for a large scale multi issue processor with pipelines, not so much.

option 1

* pass the undecoded insn via Execute1Tyoe and add the POWER decoder "field selector" to pspec.  this will get the bitfields and forms etc

option 2

* add mb and me and slowly more and more copies of fields over time.

honestly we will probably have to do (1) sooner rather than later although  (2) is the easier option.

any other ideas?

Comment 28 Luke Kenneth Casson Leighton 2020-05-10 05:30:53 BST

(In reply to Michael Nolan from comment #25)
> (In reply to Luke Kenneth Casson Leighton from comment #24)
> > https://github.com/antonblanchard/microwatt/blob/master/ppc_fx_insns.vhdl
> > 
> > line 563 is where sraw, srad, srawi and sradi exist.  there's a *lot*
> > of code-duplication, which is fairly normal in "traditional" HDLs because
> > parameterisation can be harder to do.
> 
> Those are already done in main_stage.py,

nice.

> though they could do with a bit
> more rigorous testing.

i noticed an assert on bits in SelectableInt not being happy.

cmp by the way apologies for moving the add into the switch statement.

don't do trap in this Function Unit, it changes the PC so i think really should be in a special FunctionUnit which outputs a new PC, alongside Branch etc.

Comment 29 Luke Kenneth Casson Leighton 2020-05-11 00:03:16 BST

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/alu/main_stage.py;hb=HEAD#l137

i'm fairly certain that this:

 136             with m.Case(InternalOp.OP_RLC):
 137                 with m.If(self.i.ctx.op.imm_data.imm_ok):
 138                     comb += rotate_amt.eq(self.i.ctx.op.imm_data.imm[0:5])
 139                 with m.Else():
 140                     comb += rotate_amt.eq(self.i.b[0:5])

can just be this:

 136             with m.Case(InternalOp.OP_RLC):
 137                 comb += rotate_amt.eq(self.i.b[0:5])

because in input_stage.py there is this:

  40         # If there's an immediate, set the B operand to that
  41         with m.If(self.i.ctx.op.imm_data.imm_ok):
  42             comb += self.o.b.eq(self.i.ctx.op.imm_data.imm)
  43         with m.Else():
  44             comb += self.o.b.eq(self.i.b)

Comment 30 Michael Nolan 2020-05-11 00:06:51 BST

(In reply to Luke Kenneth Casson Leighton from comment #29)
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/alu/main_stage.py;
> hb=HEAD#l137
> 
> i'm fairly certain that this:
> 
>  136             with m.Case(InternalOp.OP_RLC):
>  137                 with m.If(self.i.ctx.op.imm_data.imm_ok):
>  138                     comb +=
> rotate_amt.eq(self.i.ctx.op.imm_data.imm[0:5])
>  139                 with m.Else():
>  140                     comb += rotate_amt.eq(self.i.b[0:5])
> 
> can just be this:
> 
>  136             with m.Case(InternalOp.OP_RLC):
>  137                 comb += rotate_amt.eq(self.i.b[0:5])
> 
> because in input_stage.py there is this:
> 
>   40         # If there's an immediate, set the B operand to that
>   41         with m.If(self.i.ctx.op.imm_data.imm_ok):
>   42             comb += self.o.b.eq(self.i.ctx.op.imm_data.imm)
>   43         with m.Else():
>   44             comb += self.o.b.eq(self.i.b)

The issue is the `rlwimi` instruction, which in addition to RS and an immediate also reads in RA. Because of this, I disabled putting the immediate into operand b in the input stage of the ALU for the RLC opcode

Comment 31 Luke Kenneth Casson Leighton 2020-05-11 00:11:51 BST

these would be better (again) as (sigh) add(x,y) etc.  sub(63, n) rather than 63-n

also if the use of EXTZ64 *really* is necessary, that's actually a bug in the spec
rather than something that we should arbitrarily fix, and it should be reported.

@@ -168,9 +168,9 @@
     @inject()
     def op_slw(self, RB, RS):
         n = RB[59:64]
-        r = ROTL32(RS[32:64], n)
+        r = ROTL32(EXTZ64(RS[32:64]), n.value)
         if eq(RB[58], 0):
-            m = MASK(32, 63 - n)
+            m = MASK(32, 63 - n.value)
         else:
             m = concat(0, repeat=64)
         RA = r & m

Comment 32 Luke Kenneth Casson Leighton 2020-05-11 00:15:34 BST

(In reply to Michael Nolan from comment #30)

> The issue is the `rlwimi` instruction, which in addition to RS and an
> immediate also reads in RA. Because of this, I disabled putting the
> immediate into operand b in the input stage of the ALU for the RLC opcode

interesting.  am taking a closer look.  it doesn't look that different.

Comment 33 Luke Kenneth Casson Leighton 2020-05-11 00:42:01 BST

i haven't committed it (i leave it to you to evaluate)
this passes the tests nosetests3 decoder/isa/test_caller.py 

register RA is read in - however it's not involved in the datapath for RB.

here's the pseudocode 

* rlwinm RA,RS,SH,MB,ME (Rc=0)
* rlwinm.  RA,RS,SH,MB,ME (Rc=1)

    n <- SH
    r <- ROTL32((RS)[32:63], n)
    m <- MASK(MB+32, ME+32)
    RA <- r & m

* rlwnm RA,RS,RB,MB,ME (Rc=0)
* rlwnm.  RA,RS,RB,MB,ME (Rc=1)

    n <- (RB)[59:63]
    r <- ROTL32((RS)[32:63], n)
    m <- MASK(MB+32, ME+32)
    RA <- r & m

the only difference is in n

both are 5 bit, SH should be coming from the immediate-decoding, (RB)[59:63]
means get the 5 LSBs of RB.

i believe we're ok making the change below.  it doesn't make sense that IBM
would design something that needs a special exception path for the immediate,
just for one instruction.


lkcl@fizzy:~/src/libreriscv/soc/src/soc$ git diff
diff --git a/src/soc/alu/input_stage.py b/src/soc/alu/input_stage.py
index 389954b..2dda52a 100644
--- a/src/soc/alu/input_stage.py
+++ b/src/soc/alu/input_stage.py
@@ -39,8 +39,7 @@ class ALUInputStage(PipeModBase):
         ##### operand B #####
 
         # If there's an immediate, set the B operand to that
-        with m.If(self.i.ctx.op.imm_data.imm_ok &
-                  ~(self.i.ctx.op.insn_type == InternalOp.OP_RLC)):
+        with m.If(self.i.ctx.op.imm_data.imm_ok):
             comb += self.o.b.eq(self.i.ctx.op.imm_data.imm)
         with m.Else():
             comb += self.o.b.eq(self.i.b)
diff --git a/src/soc/alu/main_stage.py b/src/soc/alu/main_stage.py
index 59c896a..173ca78 100644
--- a/src/soc/alu/main_stage.py
+++ b/src/soc/alu/main_stage.py
@@ -134,10 +134,7 @@ class ALUMainStage(PipeModBase):
                     comb += self.o.o.eq(rotl_out & mask)
 
             with m.Case(InternalOp.OP_RLC):
-                with m.If(self.i.ctx.op.imm_data.imm_ok):
-                    comb += rotate_amt.eq(self.i.ctx.op.imm_data.imm[0:5])
-                with m.Else():
-                    comb += rotate_amt.eq(self.i.b[0:5])
+                comb += rotate_amt.eq(self.i.b[0:5])
                 comb += maskgen.mb.eq(mb+32)
                 comb += maskgen.me.eq(me+32)
                 comb += mask.eq(maskgen.o)

Comment 34 Luke Kenneth Casson Leighton 2020-05-11 01:04:15 BST

https://libre-soc.org/openpower/isatables/

hmmm, anton's team has synthesised the commonality between the different
rotates and shifts

https://github.com/antonblanchard/microwatt/blob/master/rotator.vhdl

back in execute1.vhdl they do this:

  right_shift <= '1' when e_in.insn_type = OP_SHR else '0';
  rot_clear_left <= '1' when e_in.insn_type = OP_RLC or e_in.insn_type = OP_RLCL else '0';
  rot_clear_right <= '1' when e_in.insn_type = OP_RLC or e_in.insn_type = OP_RLCR else '0';

Comment 35 Luke Kenneth Casson Leighton 2020-05-11 02:11:55 BST

i can do a brain-dead translation of this code to nmigen, tomorrow, if you're
good with that?  (i quite like doing no-brain work...)
https://github.com/antonblanchard/microwatt/blob/master/rotator.vhdl

Comment 36 Luke Kenneth Casson Leighton 2020-05-11 11:30:59 BST

(In reply to Luke Kenneth Casson Leighton from comment #35)
> i can do a brain-dead translation of this code to nmigen, tomorrow, if you're
> good with that?  (i quite like doing no-brain work...)
> https://github.com/antonblanchard/microwatt/blob/master/rotator.vhdl

okok i just did it anyway :)
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=0604fe1e0b040def2e3f599fc523340eeeefe519

i cut out the hand-coded ROTL and replaced it with... ROTL.

Comment 37 Michael Nolan 2020-05-11 14:23:24 BST

(In reply to Luke Kenneth Casson Leighton from comment #33)
> i haven't committed it (i leave it to you to evaluate)
> this passes the tests nosetests3 decoder/isa/test_caller.py 
> 
> register RA is read in - however it's not involved in the datapath for RB.
> 
> here's the pseudocode 
> 
> * rlwinm RA,RS,SH,MB,ME (Rc=0)
> * rlwinm.  RA,RS,SH,MB,ME (Rc=1)
> 
>     n <- SH
>     r <- ROTL32((RS)[32:63], n)
>     m <- MASK(MB+32, ME+32)
>     RA <- r & m
> 
> * rlwnm RA,RS,RB,MB,ME (Rc=0)
> * rlwnm.  RA,RS,RB,MB,ME (Rc=1)
> 
>     n <- (RB)[59:63]
>     r <- ROTL32((RS)[32:63], n)
>     m <- MASK(MB+32, ME+32)
>     RA <- r & m
> 
> the only difference is in n
> 
> both are 5 bit, SH should be coming from the immediate-decoding, (RB)[59:63]
> means get the 5 LSBs of RB.
> 
> i believe we're ok making the change below.  it doesn't make sense that IBM
> would design something that needs a special exception path for the immediate,
> just for one instruction.


Look at rlwimi:

n <- SH
r <- ROTL32(RS[32:63], n)
m <- MASK(MB+32, ME+32)
RA <- r & m | (RA) & ~m

It definitely does require reading in RS, RA, and SH

Comment 38 Michael Nolan 2020-05-11 15:31:39 BST

Ugh. I added support to the alu testbench for reading the registers specified in the instruction, instead of hardcoding them to 1, 2, and 3. I have it read the register select ports in the decoder in the order [3, 1, 2], and the first 2 with valid data end up in the a and b inputs of the ALU. I'm a bit annoyed about the weird order, but everything seems to work

Comment 39 Luke Kenneth Casson Leighton 2020-05-11 16:51:36 BST

(In reply to Michael Nolan from comment #37)

> Look at rlwimi:
> 
> n <- SH
> r <- ROTL32(RS[32:63], n)
> m <- MASK(MB+32, ME+32)
> RA <- r & m | (RA) & ~m
> 
> It definitely does require reading in RS, RA, and SH

that's ok - yes it reads RS, RA and SH. what it doesn't do is need to
switch off RB imm decoding.  at least, let me put it this way:
the unit test passes fine with that patch.

Comment 40 Luke Kenneth Casson Leighton 2020-05-11 17:03:25 BST

(In reply to Michael Nolan from comment #38)
> Ugh. I added support to the alu testbench for reading the registers
> specified in the instruction, instead of hardcoding them to 1, 2, and 3. I
> have it read the register select ports in the decoder in the order [3, 1,
> 2], and the first 2 with valid data end up in the a and b inputs of the ALU.
> I'm a bit annoyed about the weird order, but everything seems to work

hmmm hmmm hang on hang on - this is removing the order.  if there's 3 regs,
then 3 regs are needed.


what's needed is like this:

    inputs = []

    reg3_ok = yield dec2.e.read_reg3.ok
    reg3_sel = 0
    if reg3_ok:
        reg3_sel = yield dec2.e.read_reg3.data
        reg3_sel = (sim.gpr(reg3_sel).value
    inputs.append(reg3_sel, reg3_ok)

    ...
    ...

followed by:

    yield alu.p.data_i.a.eq(input[0][0]) # RA
    yield alu.p.data_i.b.eq(input[1][0]) # RB
    yield alu.p.data_i.c.eq(input[2][0]) # RC



def set_alu_inputs(alu, dec2, sim):
    inputs = []
    reg3_ok = yield dec2.e.read_reg3.ok
    if reg3_ok:
        reg3_sel = yield dec2.e.read_reg3.data
        inputs.append(sim.gpr(reg3_sel).value)
    reg1_ok = yield dec2.e.read_reg1.ok
    if reg1_ok:
        reg1_sel = yield dec2.e.read_reg1.data
        inputs.append(sim.gpr(reg1_sel).value)
    reg2_ok = yield dec2.e.read_reg2.ok
    if reg2_ok:
        reg2_sel = yield dec2.e.read_reg2.data
        inputs.append(sim.gpr(reg2_sel).value)

    print(inputs)

    if len(inputs) == 0:
        yield alu.p.data_i.a.eq(0)
        yield alu.p.data_i.b.eq(0)
    if len(inputs) == 1:
        yield alu.p.data_i.a.eq(inputs[0])
        yield alu.p.data_i.b.eq(0)
    if len(inputs) == 2:
        yield alu.p.data_i.a.eq(inputs[0])
        yield alu.p.data_i.b.eq(inputs[1])

Comment 41 Luke Kenneth Casson Leighton 2020-05-11 17:13:04 BST

ok - *sigh* - we _could_ define it this way however if you look in
the decode1/2.vhdl you'll see that anton's team defined RS == RC.
(reg3_data).

https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L124

output reg is defined as RA and RT (yes we need a 2nd one)
https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L135


in this way i believe RB *always* ends up as "the register that
can be replaced by an immediate at the input side"

https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L79

you see what i mean?

so it's because reg positions are lost that RA suddenly receives reg data
that was supposed to be in RB

Comment 42 Michael Nolan 2020-05-11 17:13:53 BST

(In reply to Luke Kenneth Casson Leighton from comment #40)
> hmmm hmmm hang on hang on - this is removing the order.  if there's 3 regs,
> then 3 regs are needed.
> 
> 
> what's needed is like this:
> <snip>...</snip> 
>     yield alu.p.data_i.a.eq(input[0][0]) # RA
>     yield alu.p.data_i.b.eq(input[1][0]) # RB
>     yield alu.p.data_i.c.eq(input[2][0]) # RC
> 
I went through the tables and double checked, the *only* ALU instructions that have 3 inputs are 2 registers and an immediate. It does not *need* 3 register inputs. 

However, some piece of hardware needs to order the inputs, I guess the trade off is whether to do it in the alu or somewhere else.

Comment 43 Luke Kenneth Casson Leighton 2020-05-11 19:15:26 BST

(In reply to Michael Nolan from comment #42)

> I went through the tables and double checked, the *only* ALU instructions
> that have 3 inputs are 2 registers and an immediate. It does not *need* 3
> register inputs. 

yehyeh.  ok.

really... normally, what would happen is that anything that's significantly
different like this would get put into its own Function Unit (into its own
ALU Pipeline).

then, there's no "switching about at the ALU side"


 
> However, some piece of hardware needs to order the inputs, I guess the trade
> off is whether to do it in the alu or somewhere else.

i think one of the reasons why processors go with a convention "B is
reg or immediate" is to save wires.  you've got 64 bits worth of wires for
B, B is not used all the time, so when it's not, use it to farm in
the "immediate" data.

so i think we'll find... 1 sec let's look at the original decode2.vhdl...

ah yes:
https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L89

ok so that's definitely putting the *actual* immediate data into reg b.
decode_input_reg_b() is where the decision is made to either put the
regfile data into the return result or put the immediate in it.

so given that we're using microwatt's "decoder" we really need to
follow the same conventions.  and here's where the execution has
*three* inputs:

https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl#L61

a_in, b_in, c_in.

so my feeling is we:

* have the ALU with the same 3 inputs (even though we know only 2 of them
  are used), and keep the same order

* where the immediate is availabe assume it is always dropped into the b_in
  64-bit wires

thus:

* remove the MUX from ALUInputStage (that means in the unit tests
  it has to be done there)

        with m.If(self.i.ctx.op.imm_data.imm_ok)
            comb += self.o.b.eq(self.i.ctx.op.imm_data.imm)
        with m.Else():
            comb += self.o.b.eq(self.i.b)

so, in set_alu_inputs it would be:

    inputs = []
    # C (or A?)

    reg3_ok = yield dec2.e.read_reg3.ok
    reg3_sel = 0
    if reg3_ok:
        reg3_sel = yield dec2.e.read_reg3.data
        reg3_sel = (sim.gpr(reg3_sel).value
    inputs.append(reg3_sel, reg3_ok)

    # B (or imm)
    with m.If(self.i.ctx.op.imm_data.imm_ok)
        reg2_sel = yield self.i.ctx.op.imm_data.imm)
    with m.Else():
         reg2_ok = yield dec2.e.read_reg2.ok
         reg2_sel = 0
         if reg2_ok:
            reg2_sel = yield dec2.e.read_reg2.data
            reg2_sel = (sim.gpr(reg2_sel).value
    inputs.append(reg2_sel, reg2_ok)
 
    # A (or C?)

and then still have 3 inputs, and in this way we save 64 bits
worth of wires because B is used to input the immediate.

one other key reason for doing it this way is because the FunctionUnit
front-end is where the decisions are made about reading from the
regfile (and acknowleding them).  so it has to decode the Operand, there,
anyway, because it must not even send a *request* for Register B if
it does not actually need it.

Comment 44 Luke Kenneth Casson Leighton 2020-05-11 19:23:29 BST

see this (yes it's LD/ST not an ALU)
https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg

you see how Op2 (RB) is selected through a mux, based on the Imm/Opcode?
the ALU (the weird "boat" symbol) doesn't get to make the decision, it
gets "told".

however if you follow the "imm_ok" wire from the Imm/Opcode latch, you'll
see it PREVENTS REQ_RD2 from even being activated, when "imm_ok" is true.

this is the primary purpose (role) of the Computation Units: make requests
for Register File Ports *that are needed*.

Comment 45 Luke Kenneth Casson Leighton 2020-05-12 10:24:33 BST

Ran 1 test in 1.662s

that's a little bit better :)

Comment 46 Luke Kenneth Casson Leighton 2020-05-12 11:03:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #43)
> (In reply to Michael Nolan from comment #42)
> 
> > I went through the tables and double checked, the *only* ALU instructions
> > that have 3 inputs are 2 registers and an immediate. It does not *need* 3
> > register inputs. 
> 
> yehyeh.  ok.
> 
> really... normally, what would happen is that anything that's significantly
> different like this would get put into its own Function Unit (into its own
> ALU Pipeline).

https://libre-soc.org/openpower/isa/fixedshift/

... actually... you know what?  all of those are M-Form and MD-Form

whereas, quickly analysing these:
https://libre-soc.org/openpower/isa/fixedarith/

they're D-Form, DX-Form, XO-Form, and XO-Form.

some of the weird ones (addex) are Z23-Form, and the MAC ones are VA-Form

on balance therefore it kinda makes sense to split fixedshift out into
its own separate ALU / pipeline.  aside from anything, it may turn out
to be the case that whilst add, sub etc. can be done in a single clock,
shift has a strong possibility of needing 2 (in really high speed
designs).  that would be easier to do if they were separate pipelines.

Comment 47 Luke Kenneth Casson Leighton 2020-05-12 22:33:32 BST

ok let's put the forms side-by-side, from fields.txt

#1.6.4 D-FORM
   | OPCD |    RT      |   RA|   D        |

# V3.0B 1.6.6 DX-FORM
   | OPCD|  RT|   d1|   d0|   XO|d2

# 1.6.16 XO-FORM
   | OPCD |  RT|   RA|   RB  |OE |   XO |Rc  |


and the shift ones are:

# 1.6.18 M-FORM
   | OPCD |  RS |   RA |   RB |   MB |   ME |Rc|

# 1.6.19 MD-FORM
   | OPCD |  RS |   RA |   sh |   mb |XO|sh|Rc|

# 1.6.7 X-FORM
   | OPCD |       RT      |    RA       |     ///     |   XO |  / |


so... well... i'm not actually seeing any 3-operand-input forms, there.
they're all 2-operand or 1-operand-plus-immediate.

note: 3-operand *input* not 3-operand *total including output operand*.

so i'm now wondering what on earth is going on, because we shouldn't need
to be reading reg3 at all.

any clues?

Comment 48 Michael Nolan 2020-05-12 23:02:08 BST

(In reply to Luke Kenneth Casson Leighton from comment #47)
> ok let's put the forms side-by-side, from fields.txt
> 
> #1.6.4 D-FORM
>    | OPCD |    RT      |   RA|   D        |
> 
> # V3.0B 1.6.6 DX-FORM
>    | OPCD|  RT|   d1|   d0|   XO|d2
> 
> # 1.6.16 XO-FORM
>    | OPCD |  RT|   RA|   RB  |OE |   XO |Rc  |
> 
> 
> and the shift ones are:
> 
> # 1.6.18 M-FORM
>    | OPCD |  RS |   RA |   RB |   MB |   ME |Rc|
> 
> # 1.6.19 MD-FORM
>    | OPCD |  RS |   RA |   sh |   mb |XO|sh|Rc|
> 
> # 1.6.7 X-FORM
>    | OPCD |       RT      |    RA       |     ///     |   XO |  / |
> 
> 
> so... well... i'm not actually seeing any 3-operand-input forms, there.
> they're all 2-operand or 1-operand-plus-immediate.
> 
> note: 3-operand *input* not 3-operand *total including output operand*.
> 
> so i'm now wondering what on earth is going on, because we shouldn't need
> to be reading reg3 at all.
> 
> any clues?

We only need two register operands and an immediate for all ALU instructions. The only ones with 3 register operands are store indexed instructions

Comment 49 Luke Kenneth Casson Leighton 2020-05-13 02:05:04 BST

ok.  i think we need to look more closely at decoder2.py compared to decode2.vhdl
https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl

i'm not seeing any reason though why the operands should need "swapping over":
why RC is needing to go into a.

the alignment should definitively be:

* reg3_data into c
* reg2_data (or imm) into b
* reg1_data into a

ah hang on...

https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl

see lines 183-184:

    logical_0: entity work.logical
	port map (
	    rs => c_in,
	    rb => b_in,

and here, lines 166-169:

    rotator_0: entity work.rotator
	port map (
	    rs => c_in,
	    ra => a_in,

that's it.  that's what's going on.  the different pipelines have *different*
register map/alignment, *depending on function-group*.

the *adder* takes A as A, and B as B.  rotator crosses over, logical crosses
over.

what a mess.

ok what i suggest, then, here is: split out Logical ALU operations as well.
the ADD ALU will deal with OP_ADD and OP_CMP.  i don't yet know if we should
allow the ADD ALU to change the PC (handle OP_TRAP), i'm inclined to suggest
that that one be handled in the Branch Computation Unit.

by splitting out:

* ADD/CMP into one pipeline
* Logical into another
* Rotator into another

basically on the face of it we need to use the FunctionOp to determine the
re-routing of the operands.

but... it looks like you've discovered a correlation, where if reg_data3 is
available it can go *first* (into out.a), otherwise if reg_data1 is available, 
*that* goes first (into out.a)

however B i think definitely goes into out.b, or the immediate definitely
goes into out.b, and i think we'll find that there is no "immediate" form
with a register B.

fields.txt

    RB (16:20)
        Field used to specify a GPR to be used as a
        source.
        Formats: A, M, MDS, VA, X, XO

ok so strictly speaking MDS-Form has immediates in it, however if you look
at what anton's done, in minor-opcode table 30:
https://libre-soc.org/openpower/isatables/

he's marked rdlcr as *not* reading the immediate (bit 21, mb/me) as an
immediate, but instead operand2 is marked as "RB".  which explains why
we have to decode insn to get at dec2.mb and dec2.me.

so i think we're good to go with reg2_data (or imm_data) always going into out.b
and perhaps to have an assert that checks, for these 2-operand pipelines,
that reg3_ok and reg1_ok can never be set simultaneously, and to always put
reg3_data or reg1_data into out.a.  perhaps even as an OR rather than a Mux.

Comment 50 Michael Nolan 2020-05-13 17:46:39 BST

I tried using the rotator.py taken from microwatt, and it seems it doesn't work. After fixing some syntax errors, I set the inputs up for a left shift, and it did not output what I expected. I spent around 45 minutes setting up a testbench for microwatt's rotator.vhdl, and it behaves correctly for the same inputs.

Considering that I have already written a working shifter, is it worthwhile trying to get this one to work?

Comment 51 Luke Kenneth Casson Leighton 2020-05-13 18:22:53 BST

(In reply to Michael Nolan from comment #50)
> I tried using the rotator.py taken from microwatt, and it seems it doesn't
> work. After fixing some syntax errors, I set the inputs up for a left shift,
> and it did not output what I expected. I spent around 45 minutes setting up
> a testbench for microwatt's rotator.vhdl, and it behaves correctly for the
> same inputs.

i may have messed things up in the translation from VHDL to nmigen, i didn't
write a unit test for it.

> Considering that I have already written a working shifter, is it worthwhile
> trying to get this one to work?

there are some significant elegant aspects of the microwatt-based one, namely
that it covers *all* shift operations in that short and compact code.  it uses
the trick of having only 1 64-bit shifter (ROTL) instance, putting the 32-bit
data in *twice* - once into the top 32-bit and again into the bottom 32-bit -
reducing the amount of hardware needed to perform both 32 and 64 bit shifts
by at least 30%.

however if you feel that you can get something operational and compliant quicker
than by debugging rotator.py then go for it.  we can look at clean-up and merge,
later.

Comment 52 Luke Kenneth Casson Leighton 2020-05-13 18:29:36 BST

(In reply to Michael Nolan from comment #50)
> I tried using the rotator.py taken from microwatt, and it seems it doesn't
> work. After fixing some syntax errors, I set the inputs up for a left shift,
> and it did not output what I expected.

can you commit what you have, i'm background-curious to see the output.

Comment 53 Luke Kenneth Casson Leighton 2020-05-13 19:46:13 BST

you got it working, cool!

Comment 54 Luke Kenneth Casson Leighton 2020-05-13 19:48:32 BST

oops soc.shift_rotate.input_stage missing.  remember use "git status" to check :)

Comment 55 Luke Kenneth Casson Leighton 2020-05-13 19:57:54 BST

(In reply to Luke Kenneth Casson Leighton from comment #54)
> oops soc.shift_rotate.input_stage missing.  remember use "git status" to
> check :)

got it.... :)

Comment 56 Luke Kenneth Casson Leighton 2020-05-13 22:54:27 BST

you probably noticed, i split out Logical operations into a separate
pipeline, mirroring logical.vhdl
https://github.com/antonblanchard/microwatt/blob/master/logical.vhdl

popcount, parity, i've added TODO.

Comment 57 Luke Kenneth Casson Leighton 2020-05-14 01:20:45 BST

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=de6e2485dff057322f6686ddeba6458b63a47159

ah well spotted, the correct field

Comment 58 Luke Kenneth Casson Leighton 2020-05-14 15:48:35 BST

left some TODOs in the code this morning, UK time:

@@ -41,6 +41,12 @@ class LogicalMainStage(PipeModBase):
             # TODO with m.Case(InternalOp.OP_POPCNT):
             ###### parity #######
             # TODO with m.Case(InternalOp.OP_PRTY):
+            ###### cmpb #######
+            # TODO with m.Case(InternalOp.OP_CMPB):
+            ###### cntlz #######
+            # TODO with m.Case(InternalOp.OP_CNTZ):
+            ###### bpermd #######
+            # TODO with m.Case(InternalOp.OP_BPERM): - not in microwatt

and cmp because it is so similar to ADD should go in ALU

  37         with m.Switch(self.i.ctx.op.insn_type):
  38             #### CMP, CMPL ####
  39             # TODO with m.Case(InternalOp.OP_CMP):

i've made the corresponding changes to power_enum Function (and to
the CSV files) already.

Comment 59 Michael Nolan 2020-05-14 16:34:27 BST

*headdesk*
cmp is BACKWARDS! 

The following will run both a subtract from and cmp and save their respective updates to cr0 to r2 and r1 respectively:

	li 6, 0x10
	li 7, 0x05     # set up inputs
	li 1, 0
	mtcr 1         # clear cr
	subf. 5, 6, 7  
	mfcr 2         # save cr to r2
	mtcr 1         # clear cr again
	cmpl cr0, 1, 6, 7   # This *should* give the same result as the subf above
	mfcr 1         # save cr to r1


(gdb) p/x $r1
$1 = 0x40000000
(gdb) p/x $r2
$2 = 0x80000000

Why IBM, why did you do this?

Comment 60 Luke Kenneth Casson Leighton 2020-05-14 17:17:37 BST

sub pseudocode:

* subf RT,RA,RB (OE=0 Rc=0)

    RT <- ¬(RA) + (RB) + 1

which is... b-a

and in alu/output_stage:

comb += self.o.cr0.eq(Cat(so, is_zero, is_positive, is_negative))

so bit 0 of CR0 is "is zero"
bit 1 is "is +ve"
bit 2 is "is -ve"

cmp pseudocode:

    if L = 0 then
        a <- EXTS((RA)[32:63] )
        b <- EXTS((RB)[32:63])
    else
        a <- (RA)
        b <- (RB)
    if      a < b then c <-  0b100
    else if a > b then c <-  0b010
    else               c <-  0b001
    CR[4*BF+32:4*BF+35] <-  c || XER[SO]

and cmpi pseudocode:

    if L = 0 then a <-  EXTS((RA)[32:63])
    else a <-  (RA)
    if      a < EXTS(SI) then c <- 0b100
    else if a > EXTS(SI) then c <- 0b010
    else                      c <- 0b001
    CR[4*BF+32:4*BF+35] <- c || XER[SO]

so that is *a* less than B/immediate, bit 2 (-ve) is set)
*a* greater than B/immediate, bit 1 (+ve) is set)

which is the opposite of add/subtract.

i presume this is best implemented by calculating "a-b" (rather than b-a).  note
in microwatt, there's comments at line 540 about this:

https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl

we can't invert the operands outside of the pipeline: that means
that the register port allocation system needs to have a "decoder" in it.

the inversion needs to be done *inside* the pipeline, not outside...
ok just checking alu/input_stage.py, ok yes that looks good.

i say "good", i mean "functional but messy as hell" :)

i wonder... i wonder if, rather than mess up the input stage *and* the
main stage, which involves two 64-bit MUXes, if it would not be better
to swap is_positive and is_negative in the *output* stage?

this would only be QTY 2 single-bit MUXes

Comment 61 Luke Kenneth Casson Leighton 2020-05-14 17:21:38 BST

(In reply to Luke Kenneth Casson Leighton from comment #60)

> i wonder... i wonder if, rather than mess up the input stage *and* the
> main stage, which involves two 64-bit MUXes, if it would not be better
> to swap is_positive and is_negative in the *output* stage?
> 
> this would only be QTY 2 single-bit MUXes

i'll give this a shot, see what happens.

Comment 62 Luke Kenneth Casson Leighton 2020-05-14 17:35:21 BST

(In reply to Luke Kenneth Casson Leighton from comment #61)

> i'll give this a shot, see what happens.

it worked!  i'll do the formal proof output stage modifications

Comment 63 Luke Kenneth Casson Leighton 2020-05-14 17:54:25 BST

(In reply to Michael Nolan from comment #59)
> *headdesk*

move the keyboard to the left, first...

> cmp is BACKWARDS!

:)

> (gdb) p/x $r1
> $1 = 0x40000000
> (gdb) p/x $r2
> $2 = 0x80000000

are ya having fun, yet?
 
> Why IBM, why did you do this?

hm i dunno... i mean... it could be justified if it's less code to check
+ve condition quicker than it is to check -ve condition or vice-versa.

then you would use sub for one and cmp for the other.

honestly: am clueless :)

Comment 64 Luke Kenneth Casson Leighton 2020-05-14 18:14:34 BST

(In reply to Luke Kenneth Casson Leighton from comment #63)

> honestly: am clueless :)

ah i think i might know why.  as we found out in the PartitionedSignal
GT / LT (etc) code, a comparator is much simpler than a full adder.

it's only *because* we are using the output from the adder (actually,
subtracter) - purely because it's there - that we're noticing the
inversion.

if however we had a totally separate Function Unit for CMP, we'd use
the less-gates solution, and consequently could swap A and B over.
then you'd never notice the difference :)

Comment 65 Michael Nolan 2020-05-14 18:29:15 BST

(In reply to Luke Kenneth Casson Leighton from comment #64)
> (In reply to Luke Kenneth Casson Leighton from comment #63)
> 
> > honestly: am clueless :)
> 
> ah i think i might know why.  as we found out in the PartitionedSignal
> GT / LT (etc) code, a comparator is much simpler than a full adder.
> 
> it's only *because* we are using the output from the adder (actually,
> subtracter) - purely because it's there - that we're noticing the
> inversion.
> 
> if however we had a totally separate Function Unit for CMP, we'd use
> the less-gates solution, and consequently could swap A and B over.
> then you'd never notice the difference :)

That makes a little more sense. Though if they're going to the trouble of making subf the way it is, you'd think they'd make this the same way. 

Oh well

Comment 66 Luke Kenneth Casson Leighton 2020-05-14 19:42:58 BST

(In reply to Michael Nolan from comment #65)

> That makes a little more sense. Though if they're going to the trouble of
> making subf the way it is, you'd think they'd make this the same way. 
> 
> Oh well

sigh :)

i added popcount (because).  it looks like it works except test_cmpb is
not working (which stops the simulation early), and if i comment it out,
the simulation popcount returns an integer, not a SelectableInt, so goes
skew-y.  could you take a look?

Comment 67 Luke Kenneth Casson Leighton 2020-05-14 20:06:50 BST

(In reply to Luke Kenneth Casson Leighton from comment #66)

> i added popcount (because).  it looks like it works except test_cmpb is
> not working

after updating submodules (will commit in a sec)

  File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py", line 193, in op_cmpb
    if eq(RS[8 * n:8 * n + 7 + 1], RB[8 * n:8 * n + 7 + 1]):
NameError: name 'RS' is not defined

Comment 68 Luke Kenneth Casson Leighton 2020-05-14 20:10:19 BST

  File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py", line 223, in op_prtyd
    s = s / RS[i % 8 + 7]

erm... ermermerm... that doesn't look right.  divide?  instead of ^ (XOR)?

TypeError: unsupported operand type(s) for /: 'int' and 'SelectableInt'

i think, really, we need to do the same trick done with eq, ge etc, turn all
operations from using python "op1 + op2" into "add(op1, op2)" and sort out
the operands - convert to SelectableInt - in there.

Comment 69 Michael Nolan 2020-05-14 20:21:03 BST

(In reply to Luke Kenneth Casson Leighton from comment #67)
> (In reply to Luke Kenneth Casson Leighton from comment #66)
> 
> > i added popcount (because).  it looks like it works except test_cmpb is
> > not working
> 
> after updating submodules (will commit in a sec)
> 
>   File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py",
> line 193, in op_cmpb
>     if eq(RS[8 * n:8 * n + 7 + 1], RB[8 * n:8 * n + 7 + 1]):
> NameError: name 'RS' is not defined

Oops, needed a patch (or figuring out why the compiler didn't include RS)

Comment 70 Michael Nolan 2020-05-14 20:27:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #68)
>   File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py",
> line 223, in op_prtyd
>     s = s / RS[i % 8 + 7]
> 
> erm... ermermerm... that doesn't look right.  divide?  instead of ^ (XOR)?
> 
> TypeError: unsupported operand type(s) for /: 'int' and 'SelectableInt'
> 
> i think, really, we need to do the same trick done with eq, ge etc, turn all
> operations from using python "op1 + op2" into "add(op1, op2)" and sort out
> the operands - convert to SelectableInt - in there.

There's no way they actually mean division there... If so, then every time that bit is 0 you get a divide by 0...

Comment 71 Luke Kenneth Casson Leighton 2020-05-14 20:48:50 BST

(In reply to Michael Nolan from comment #70)
> (In reply to Luke Kenneth Casson Leighton from comment #68)
> >   File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py",
> > line 223, in op_prtyd
> >     s = s / RS[i % 8 + 7]
> > 
> > erm... ermermerm... that doesn't look right.  divide?  instead of ^ (XOR)?

> There's no way they actually mean division there... If so, then every time
> that bit is 0 you get a divide by 0...

yep it's a manual transcription error from when i extracted the text from the
PDF.  i've git pushed ^ instead of /, done the submodule update too.

Comment 72 Michael Nolan 2020-05-15 15:00:10 BST

(In reply to Luke Kenneth Casson Leighton from comment #68)
>   File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py",
> line 223, in op_prtyd
>     s = s / RS[i % 8 + 7]
> 
> erm... ermermerm... that doesn't look right.  divide?  instead of ^ (XOR)?
> 
> TypeError: unsupported operand type(s) for /: 'int' and 'SelectableInt'
> 

The modulus doesn't fit with the description of what the instruction does or with the behavior on qemu. I think they meant for that to be a multiplication.

Should we report this to IBM? It looks like they corrected the xor part in the 3.1 spec, but they're still using modulus for the bit indices.

Comment 73 Luke Kenneth Casson Leighton 2020-05-15 15:13:18 BST

this is from V3.0B - page 98:

s <- 0
t <- 0
do i = 0 to 3
   s <- s / (RS)[i%8+7]
do i = 4 to 7
   t <- t / (RS)[i%8+7]

so yes that's wrong (divide).


microwatt ppc_fx_insns.vhdl has this:

    function ppc_prtyd (rs: std_ulogic_vector(63 downto 0)) return std_ulogic_vector is
        variable tmp : std_ulogic;
        variable ret : std_ulogic_vector(63 downto 0);
    begin
        ret := (others => '0');

        tmp := '0';
        for i in 0 to 7 loop
            tmp := tmp xor rs(i*8);
        end loop;

        ret(0) := tmp;
        return ret;
    end;

and that's definitely "times 8" there.

note that they didn't end up using that function, the loop was unrolled as
follows:

        par0 <= rs(0) xor rs(8) xor rs(16) xor rs(24);
        par1 <= rs(32) xor rs(40) xor rs(48) xor rs(56);

so yes, please do correct the pseudo-code and report it as a bug in
the spec to IBM (good idea to cross-reference this discussion)

Comment 74 Michael Nolan 2020-05-15 15:19:15 BST

(In reply to Luke Kenneth Casson Leighton from comment #73)
> 
> so yes, please do correct the pseudo-code and report it as a bug in
> the spec to IBM (good idea to cross-reference this discussion)

Cool. I've not interacted with IBM before, how do I go about reporting a bug?

Comment 75 Luke Kenneth Casson Leighton 2020-05-15 15:23:19 BST

(In reply to Michael Nolan from comment #74)
> (In reply to Luke Kenneth Casson Leighton from comment #73)
> > 
> > so yes, please do correct the pseudo-code and report it as a bug in
> > the spec to IBM (good idea to cross-reference this discussion)
> 
> Cool. I've not interacted with IBM before, how do I go about reporting a bug?

oh, sorry: just send a message to openhdl-cores... you may have to subscribe
http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/
http://lists.mailinglist.openpowerfoundation.org/mailman/listinfo/openpower-hdl-cores

i can do it if you like, so you can focus on coding, etc. up to you, i
don't mind.

Comment 76 Michael Nolan 2020-05-15 16:07:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #75)

> oh, sorry: just send a message to openhdl-cores... you may have to subscribe
> http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-
> cores/
> http://lists.mailinglist.openpowerfoundation.org/mailman/listinfo/openpower-
> hdl-cores
> 
> i can do it if you like, so you can focus on coding, etc. up to you, i
> don't mind.

Reported and the pseudocode has been fixed

Comment 77 Luke Kenneth Casson Leighton 2020-05-15 16:29:08 BST

(In reply to Michael Nolan from comment #76)

> Reported and the pseudocode has been fixed

great.  ran test_prty and it worked great.  hmm at some point we could do with
an option to run tests against qemu ppc[32/64][le/be]

next on the planning / TODO list, michael:

* OP_B* plus OP_TRAP, OP_RFID (anything that sets the PC by setting NIA)
* condition register stuff: OP_MFCR, OP_CROP, OP_MTCRF
* SPR stuff: OP_MTSPR and OP_MFSPR

probably good idea to raise separate bugreports for these.  i have no idea
where OP_ISEL or OP_MTMSRD would fit :)

Comment 78 Luke Kenneth Casson Leighton 2020-05-18 02:40:12 BST

michael i am going to do a codeshuffle moving all pipelines to a subdir called pipe.

common code can sit at that level including a README, as well as the multi inout construction (based on the ReservationStation class).

also at that level would be a specification object that constructs multiple Function Units just like in the diagram from Mitch's book.

some of these will be multiple pipes (the ones that only do one clock) and some will be N > 1 ReservationStations but only the one pipe (DIV in particular)

also then in front of those will be the relevant CompUnits, some of which we need to construct.  ALUCompUnit will be the basis for most of them.

it starts with moving the pipelines to a pipe subdir, the rest needs its iwn bugreport.

Comment 79 Luke Kenneth Casson Leighton 2020-05-18 05:07:13 BST

(In reply to Luke Kenneth Casson Leighton from comment #78)
> michael i am going to do a codeshuffle moving all pipelines to a subdir
> called pipe.

called it "fu" because strictly speaking these pipelines will be managed
by Function Unit front-ends.

Comment 80 Luke Kenneth Casson Leighton 2020-05-18 06:19:45 BST

https://libre-soc.org/openpower/isa/comparefixed/

hmm hmmm the cmpeqb code is stopping ALUIntermediateData from not having cr0 in it.

i suggest instead that we put the 8 bits into self.o.o, then special case cmpeqb to test o.any() and only set cr0[2]

also hmm, cmp* seem to want to store the comparison in a BR indexed register, not cr0.

which sigh tends to suggest we need to move cmp operations to their own pipeline, one that takes the full 32 bits of CR.

alternatives are that we expand ALU cr to the full bit width.

however if going that route i would like to suggest we pass an index (and mask) saying which CR(s) have been modified.

the mask would be sent as write enable lines to the CR regfile.

this would remove the need to pass *in* the full CR and stop an unnecessary read hazaelrd being created.

Comment 81 Luke Kenneth Casson Leighton 2020-05-20 15:01:33 BST

Created attachment 61 [details]
ALU main stage

hilarious.  the actual OP_ADD, OP_CMPEQB, OP_CMP and OP_EXTS operations
are ridiculously small compared to the support infrastructure: regfile,
DMs, everything basically.

i'd heard that routing of data is by far and above the majority of
a CPU: it's kind-of illuminating to actually see how small an ALU
can be.

Comment 82 Luke Kenneth Casson Leighton 2020-05-21 11:31:40 BST

apologies, michael, i've introduced a bug somewhere in ALU:

adde. 5, 6, 7
['RA', 'RB']
reading reg 6
reading reg 7
[SelectableInt(value=0xc97cd5181acb17fc, bits=64), SelectableInt(value=0x9de943b0d85daaf9, bits=64)]
(SelectableInt(value=0x676618c8f328c2f6, bits=64),)
concat 1 SelectableInt(value=0x4, bits=4)
[SelectableInt(value=0x1, bits=1), SelectableInt(value=0x1, bits=1)]
__eq__ SelectableInt(value=0x0, bits=1) SelectableInt(value=0x1, bits=1)
__eq__ SelectableInt(value=0x0, bits=1) SelectableInt(value=0x1, bits=1)
writing reg 5 SelectableInt(value=0x676618c8f328c2f6, bits=64)
expected 676618c8f328c2f6, actual: 676618c8f328c2f6

  File "fu/alu/test/test_pipe_caller.py", line 255, in check_extra_alu_outputs
    self.assertEqual(cr_expected, cr_actual, code)
AssertionError: 4 != 5 : adde. 5, 6, 7

Comment 83 Luke Kenneth Casson Leighton 2020-05-27 17:06:19 BST

um... um... it just occurred to me: if we detect that XER sticky overflow
is 1 in the ALU output stage, and if it is set (and required - insn.OE etc.)
only then do we even request a write to the XER.so register, then because
it is only setting, i do not believe we even need to read the XER.so at all.

as in: i think we can actually *remove* XER.so from ALUInputData, entirely.

this is because we don't actually care what the former value was - at all -
we are merely setting it.  and because that new value is *not* dependent
on the old value - at all - we don't need to waste that read port.

does this sound like a reasonable analysis?

Comment 84 Jacob Lifshay 2020-05-27 18:46:10 BST

(In reply to Luke Kenneth Casson Leighton from comment #83)
> um... um... it just occurred to me: if we detect that XER sticky overflow
> is 1 in the ALU output stage, and if it is set (and required - insn.OE etc.)
> only then do we even request a write to the XER.so register, then because
> it is only setting, i do not believe we even need to read the XER.so at all.
> 
> as in: i think we can actually *remove* XER.so from ALUInputData, entirely.
> 
> this is because we don't actually care what the former value was - at all -
> we are merely setting it.  and because that new value is *not* dependent
> on the old value - at all - we don't need to waste that read port.
> 
> does this sound like a reasonable analysis?

sounds mostly correct to me, as long as the logic for setting CR0's 4th bit still reads the previous value when overflow isn't triggered. SO is a prime candidate for some extra logic to allow multiple instructions to modify it each cycle.

Comment 85 Luke Kenneth Casson Leighton 2020-05-27 18:56:04 BST

(In reply to Jacob Lifshay from comment #84)
> (In reply to Luke Kenneth Casson Leighton from comment #83)
> > um... um... it just occurred to me: if we detect that XER sticky overflow
> > is 1 in the ALU output stage, and if it is set (and required - insn.OE etc.)
> > only then do we even request a write to the XER.so register, then because
> > it is only setting, i do not believe we even need to read the XER.so at all.
> > 
> > as in: i think we can actually *remove* XER.so from ALUInputData, entirely.
> > 
> > this is because we don't actually care what the former value was - at all -
> > we are merely setting it.  and because that new value is *not* dependent
> > on the old value - at all - we don't need to waste that read port.
> > 
> > does this sound like a reasonable analysis?
> 
> sounds mostly correct to me, as long as the logic for setting CR0's 4th bit
> still reads the previous value when overflow isn't triggered.

that's the fly in the ointment.  i started removing SO and found that because it is CR0's 4th bit, and because that 4th bit takes XER.SO as input unconditionally as far as the calculation is concerned, XER.SO cannot be removed from the input data structure for ALU.

now, it *can* be switched off by the *instruction* - Rc=1/0 - but not by the ALU itself.


> SO is a prime
> candidate for some extra logic to allow multiple instructions to modify it
> each cycle.

as SO is being treated like any other register (despite it being only 1 bit wide) it should be picked up by standard WaW (write after write) and unused values thrown away.

Comment 86 Luke Kenneth Casson Leighton 2020-05-27 19:24:09 BST

i think the best we can do is, when Rc=0, disable reading of XER.so

Rc=1 basically makes a hell of a mess of the dependencies.

Comment 87 Jacob Lifshay 2020-05-27 19:33:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #86)
> i think the best we can do is, when Rc=0, disable reading of XER.so
> 
> Rc=1 basically makes a hell of a mess of the dependencies.

sounds good. In my very limited experience with disassembling Power code, compilers don't usually enable Rc or overflow writing.

Comment 88 Luke Kenneth Casson Leighton 2020-05-27 19:38:26 BST

(In reply to Jacob Lifshay from comment #87)

> sounds good. In my very limited experience with disassembling Power code,
> compilers don't usually enable Rc or overflow writing.

that meshes with observations that the spec says performance is adversely affected.

Comment 89 Luke Kenneth Casson Leighton 2020-07-12 23:42:44 BST

this one is reasonably complete, unit tests ok, formal proof ok

Comment 90 Luke Kenneth Casson Leighton 2020-08-18 10:15:10 BST

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=29d0566fd104ab8b9ffd0c460fc7c588a72a340b

turns out we had missed 32-bit sign-extension from checking the "L" bit.

at some point it may be worthwhile replacing the manual decoding of "L" with a separate CSV file entry which sets the op.is_32bit flag.