Bug 314 - Create POWER9 Condition Register pipeline
Summary: Create POWER9 Condition Register pipeline
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Mac OS
: --- enhancement
Assignee: Michael Nolan
URL:
Depends on: 356 332 333 360
Blocks: 383
  Show dependency treegraph
 
Reported: 2020-05-15 16:33 BST by Luke Kenneth Casson Leighton
Modified: 2021-11-29 21:41 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.10.043.Wishbone
total budget (EUR) for completion of task and all subtasks: 300
budget (EUR) for this task, excluding subtasks' budget: 300
parent task for budget allocation: 383
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
lkcl = { amount = 100, paid = 2020-08-21 } donated = { amount = 200, submitted = 2021-04-24, paid = 2021-05-01 }


Attachments
table of mtocrf mask, reg results from the talos server (7.30 KB, text/plain)
2020-05-23 21:30 BST, Michael Nolan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-05-15 16:33:01 BST
we need a Condition Register pipeline

* https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl
* https://libre-soc.org/openpower/isa/condition/
* https://git.libre-soc.org/?p=soc.git;a=tree;f=src/soc/fu/cr;hb=HEAD

* OP_MFCR                          done
* OP_CROP (which includes MCRF)    done
* OP_MTCRF                         done
* OP_ISEL                          done
* OP_SETB                          done
Comment 1 Michael Nolan 2020-05-15 22:53:18 BST
Are we planning to do crand, cror, crxor, and friends? They're not implemented in microwatt as far as I can tell
Comment 2 Luke Kenneth Casson Leighton 2020-05-15 23:05:12 BST
(In reply to Michael Nolan from comment #1)
> Are we planning to do crand, cror, crxor, and friends? They're not
> implemented in microwatt as far as I can tell

https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl
line 719.

oink.

am i reading this correctly: they actually use the cr operand as if
it was a *table*??  moo?

so the instruction field 0-15 is treated as a 4-bit number (a 4-bit truth-table)

you then take ba and bb and create a 2-bit index.

then use that 2-bit address to get one of the bits of the 4-bit
instruction field.

wooow.

i'm deeply impressed.

um... answer... yes they do implement crand, cror etc :)

this must be how FPGAs do things, and is probably why LUTs are 4-bit.
Comment 3 Michael Nolan 2020-05-16 19:00:05 BST
The hell? IBM separately defines mtcrf and mtocrf, and reserves a bit to differentiate between the two. Except there's no functional difference between the two that I can tell, and the assembler will even assemble some instances of mtcrf as mtocrf. 


Basically mtcrf fxm, reg uses fxm to construct a 32 bit mask, where each bit in fxm corresponds to 4 bits in the mask (so mtcrf 0xf1 would correspond to a mask of 0xffff000f). It then uses this mask to choose between bits of CR and the input register, and writes them back to CR

mtocrf fxm, reg assumes fxm is a one hot encoded index into the bits of reg. It grabs those bits from reg and inserts them into cr, leaving the rest untouched. If fxm is not onehot, then the result is undefined.

Why these are different, I don't know.
Comment 4 Luke Kenneth Casson Leighton 2020-05-16 19:27:04 BST
btw apologies i did a code-munge on cr/main_stage.py - it did work prior to
a rebase-conflict... am investigating, will keep in touch if i spot the
discrepancy.
Comment 5 Luke Kenneth Casson Leighton 2020-05-16 19:28:43 BST
(In reply to Luke Kenneth Casson Leighton from comment #4)
> btw apologies i did a code-munge on cr/main_stage.py - it did work prior to
> a rebase-conflict... am investigating, will keep in touch if i spot the
> discrepancy.

oh: i think it's just that you added the unit test before adding the code.
Comment 6 Luke Kenneth Casson Leighton 2020-05-16 19:38:04 BST
(In reply to Michael Nolan from comment #3)
> The hell? IBM separately defines mtcrf and mtocrf, and reserves a bit to
> differentiate between the two. Except there's no functional difference
> between the two that I can tell, and the assembler will even assemble some
> instances of mtcrf as mtocrf. 

bizarre.

* mtcrf FXM,RS

    mask <- ([FXM[0]]*4 || [FXM[1]]*4 || [FXM[2]]*4 || [FXM[3]]*4 ||
             [FXM[4]]*4 || [FXM[5]]*4 || [FXM[6]]*4 || [FXM[7]]*4)
    CR <- ((RS)[32:63] & mask) | (CR & ¬mask)

* mtocrf FXM,RS

    count <- 0
    do i = 0 to 7
      if FXM[i] = 1 then
        n <- i
        count <- count + 1
    if count = 1 then
        CR[4*n+32:4*n+35] <- (RS)[4*n+32:4*n+35]
    else CR <- undefined


> Basically mtcrf fxm, reg uses fxm to construct a 32 bit mask, where each bit
> in fxm corresponds to 4 bits in the mask (so mtcrf 0xf1 would correspond to
> a mask of 0xffff000f). It then uses this mask to choose between bits of CR
> and the input register, and writes them back to CR
 
> mtocrf fxm, reg assumes fxm is a one hot encoded index into the bits of reg.
> It grabs those bits from reg and inserts them into cr, leaving the rest
> untouched. If fxm is not onehot, then the result is undefined.
> 
> Why these are different, I don't know.

clueless.  suggest to use nmutil PriorityPicker on FXM (see nmutil) because it
can be used do the one-hot detection very efficiently.

if the output is not exactly equal to the input from a PriorityPicker
(and is not zero) then this is the condition for detection that FXM
is one-hot.

p = PriorityPicker
p.i.eq(FXM)

is_onehot.eq(p.i == p.o & p.i != 0)
Comment 7 Luke Kenneth Casson Leighton 2020-05-16 19:38:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)

sorry:

p = PriorityPicker(8)
comb += p.i.eq(FXM)
comb += is_onehot.eq(p.i == p.o & p.i != 0)
Comment 8 Michael Nolan 2020-05-16 19:42:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)

> 
> clueless.  suggest to use nmutil PriorityPicker on FXM (see nmutil) because
> it
> can be used do the one-hot detection very efficiently.
> 
> if the output is not exactly equal to the input from a PriorityPicker
> (and is not zero) then this is the condition for detection that FXM
> is one-hot.
> 
> p = PriorityPicker
> p.i.eq(FXM)
> 
> is_onehot.eq(p.i == p.o & p.i != 0)

For mtocrf I don't think this is necessary. If the onehot condition is not satisfied, it lists the result as being undefined. This should mean that the current behavior is acceptable

For mfocrf, the situation is similar, the extra bits of the output register are undefined, unless the input is onehot, in which case they are set to 0. Setting them to 0 for all bits not selected by the mask should be acceptable here too
Comment 9 Luke Kenneth Casson Leighton 2020-05-16 19:46:01 BST
ugh, ugh.

i have a horrible feeling about "undefined" behaviour.  we need to properly
investigate and find out exactly what happens during that "undefined"
behaviour by running native assembler on the TALOS-II workstation, with
a full range of FXM bits (only 256 instructions), and a random set of
CR and RS (times 256).

this will tell us exactly what happens.

my concern is that if we rely on the "undefined" behaviour to say "ok we
choose to make mtcrf and mtocrf the same", some assembly writers may
have chosen *specifically* to assume that CR is set to ZERO (or some
other value) after noting that that behaviour occurs on POWER9...

you follow where that goes?

i really do not like "undefined" behaviour in specifications.  it's always
a problem.
Comment 10 Luke Kenneth Casson Leighton 2020-05-16 19:47:23 BST
(In reply to Michael Nolan from comment #8)

> For mtocrf I don't think this is necessary. If the onehot condition is not
> satisfied, it lists the result as being undefined. This should mean that the
> current behavior is acceptable
> 
> For mfocrf, the situation is similar, the extra bits of the output register
> are undefined, unless the input is onehot, in which case they are set to 0.
> Setting them to 0 for all bits not selected by the mask should be acceptable
> here too

except (cross-over) this may bite us due to incompatibility with the "undefined"
behaviour that assembly writers use these for - in *direct* but de-facto violation
of the standard.

royal nuisance.  let's not hyper-prioritise it though.
Comment 11 Michael Nolan 2020-05-16 20:30:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #9)
> ugh, ugh.
> 
> i have a horrible feeling about "undefined" behaviour.  we need to properly
> investigate and find out exactly what happens during that "undefined"
> behaviour by running native assembler on the TALOS-II workstation, with
> a full range of FXM bits (only 256 instructions), and a random set of
> CR and RS (times 256).
> 
> this will tell us exactly what happens.


Could you send me the login for the talos server so I can test this?
Comment 12 Michael Nolan 2020-05-16 21:07:36 BST
# NOTE: we really should be doing the field decoding which
# selects which bits of CR are to be read / written, back in the
# decoder / insn-isue, have both self.i.cr and self.o.cr
# be broken down into 4-bit-wide "registers", with their
# own "Register File" (indexed by bt, ba and bb),
# exactly how INT regs are done (by RA, RB, RS and RT)
# however we are pushed for time so do it as *one* register.

We can do this for crand and friends, as well as mcrf, but it doesn't work for mtcrf or mfcr because they can read or write the whole cr register
Comment 13 Michael Nolan 2020-05-16 21:30:43 BST
(In reply to Luke Kenneth Casson Leighton from comment #9)
> ugh, ugh.
> 
> i have a horrible feeling about "undefined" behaviour.  we need to properly
> investigate and find out exactly what happens during that "undefined"
> behaviour by running native assembler on the TALOS-II workstation, with
> a full range of FXM bits (only 256 instructions), and a random set of
> CR and RS (times 256).
> 
> this will tell us exactly what happens.

Qemu seems to do nothing when it's not onehot
	lis 1, 0xdead
	ori 1, 1, 0xbeef # r1 = 0xdeadbeef
	lis 2, 0x1234 
	ori 2, 2, 0x5678 # r2 = 0x12345678
	mtcr 1           # cr = 0xdeadbeef
	mtcrf 0xf, 2     
	mfcr 3           # r3 should be 0xdead5678
	mtcr 1           # cr = 0xdeadbeef again
        # mtocrf 0xf, 2
	.byte 0x7c, 0x50, 0xf1, 0x20
	mfcr 4           # who knows

(gdb) p/x $r3
$1 = 0xdead5678
(gdb) p/x $r4
$2 = 0xdeadbeef
Comment 14 Luke Kenneth Casson Leighton 2020-05-16 23:06:29 BST
(In reply to Michael Nolan from comment #12)

> We can do this for crand and friends, as well as mcrf, but it doesn't work
> for mtcrf or mfcr because they can read or write the whole cr register

ah good catch.  updated.  yes you're right.  this would be where the "multi-bit
hazard" detection would come into play.  the bit-vector would be

0   1   ... 7   8
CR0 CR1 ... CR7 {ALLCRs}

for crand the decoding of BA (etc) would set the corresponding CR register
hazard *and* on vector bit 8 {ALLCRs}... but not the other bits.

thus it would be possible to have multiple parallel overlapping crand(s) (etc)
as long as the read/write dependencies on CR0-CR7 do not clash

however mtcrf would set bit 8 {ALLCRs} *AND* it would set *ALL* other bits 0 thru 7.

as it's an optimisation it's off the TODO list for now.
Comment 15 Luke Kenneth Casson Leighton 2020-05-16 23:14:54 BST
(In reply to Michael Nolan from comment #11)

> Could you send me the login for the talos server so I can test this?

oh - err, yes, sorry.  except my laptop segfaulted and although i have a
mosh-screen session running that will still be open, i can't remember the
server's IP address!  it doesn't have a domain name associated with it yet.
will find it tomorrow - 23:00 hrs here now.

(In reply to Michael Nolan from comment #13)

> Qemu seems to do nothing when it's not onehot

interesting.  i think i have a hypothesis that fits: it's possible
that an early version of POWERPC had one version, they realised they
needed the other, but couldn't really retire the old one as it was
too commonly used.
Comment 16 Luke Kenneth Casson Leighton 2020-05-23 02:21:18 BST
hmmmm there are not many good choices for putting this instruction in.  it requires RA RC INT regs, and full CR's BI field as input.

for output it needs INT RS

CR is the closest least disruptive candidate, if a 2nd INT reg is added: RB.



    when OP_ISEL =>
	crbit := to_integer(unsigned(insn_bc(e_in.insn)));
	if e_in.cr(31-crbit) = '1' then
	    result := a_in;
	else
            result := b_in;
        end if;
	result_en := '1';
Comment 17 Luke Kenneth Casson Leighton 2020-05-23 02:25:48 BST
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/cr/pipe_data.py;h=2894144fd814444a7bdb75ae6c695af8519ae084;hb=HEAD

can add b to CRInputData.

https://libre-soc.org/openpower/isa/fixedtrap/

what on earth isel is doing in the trap instruction group is beyond me.  could be i just put the pseudocode in the wrong file
Comment 18 Luke Kenneth Casson Leighton 2020-05-23 11:36:06 BST
i'm so excited, i'm so happy, i added an instruction (isel).
Comment 19 Michael Nolan 2020-05-23 21:30:18 BST
Created attachment 62 [details]
table of mtocrf mask, reg results from the talos server

So I decided to test mtocrf with non-onehot masks on the talos server, and it appears to me that it uses the most significant bit of the mask and ignores the rest (except for mask=0xf, dunno what's up there). Very interesting.
Comment 20 Luke Kenneth Casson Leighton 2020-05-23 22:01:34 BST
hm might be good to add the test assembly code to the repo, put it somewhere around soc.fu.cr
Comment 21 Luke Kenneth Casson Leighton 2020-05-24 14:11:48 BST
unit test is not testing reg output (MFCRF, ISEL)
Comment 22 Luke Kenneth Casson Leighton 2020-05-24 21:47:22 BST
(In reply to Luke Kenneth Casson Leighton from comment #21)
> unit test is not testing reg output (MFCRF, ISEL)

now it is...

commit c2a94b3e640c6c3a8e33834f8d073c795ce66815 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sun May 24 21:46:56 2020 +0100

    add test of reg output, for MFCRF and ISEL