Bug 533 - design new CR instructions suitable for predication
Summary: design new CR instructions suitable for predication
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Jacob Lifshay
URL: https://libre-soc.org/openpower/sv/cr...
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2020-11-28 09:16 GMT by Luke Kenneth Casson Leighton
Modified: 2023-04-25 10:13 BST (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.10.046.Standards
total budget (EUR) for completion of task and all subtasks: 1000
budget (EUR) for this task, excluding subtasks' budget: 1000
parent task for budget allocation: 213
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
[jacob] amount = 200 submitted = 2022-06-19 paid = 2022-07-21 [lkcl] amount = 800 submitted = 2022-06-16 paid = 2022-07-21


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-11-28 09:16:34 GMT
design page: https://libre-soc.org/openpower/sv/cr_int_predication/

initial discussion:
http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001335.html

if using CRs, beq/bgt/blt makes decisions based on a single bit analysis (and inversion) of a CR, which frequently requires a crand/cror op prior to get branch conditions that otherwise do not exist in single-instruction form.

the idea here is to create a combined CR analysis operation that takes multiple bits [0-3] of a CR and creates a single bit yes/no, extending what crand/cror (etc) currently do.

whilst useful in its own right vectorisation of the same gives sometjing that is useful for SV Predication computation.
Comment 1 Luke Kenneth Casson Leighton 2020-11-28 18:54:04 GMT
hmm i am curious.  crand and cror etc. if they are used to target one bit of the same CR as src1 and another bit of the same CR as src2, one is the "eq zero" bit and the other say the "+ve" bit, and the dest is say the OV bit of the same CR, that computes GE in the case of cror and err... GT in the case of crand.

now if those are vectorised that would produce a vector of GT/GE operations, would it not?

jacob did you have any other type of CR operations in mind that do not fit into this category?
Comment 2 Jacob Lifshay 2020-11-29 03:11:40 GMT
What I was thinking of is more like an instruction that reads a single 4-bit CR field and writes a 1 or a 0 to an integer register based on if the 4-bit CR field matches some condition. That instruction when vectorized would produce a bit-vector in the integer register with 1 bit per element.

The instruction would have a 4-bit immediate with bits a, b, c, and d. The output of the instruction would be:
(a & cr_lt) | (b & cr_eq) | (c & cr_gt) | (d & cr_unordered)

(icr if that's the right order for cr bits, but you get the idea).

This allows producing any pattern of ones and zeros assuming the cr is set to the result of an integer or fp compare op. For int compares, we set d = 0. For int/fp compares, the rest of the bits select the output value that should be generated for a particular compare result:
lt -> output = a
eq -> output = b
gt -> output = c
unordered (fp only) -> output = d
Comment 3 Jacob Lifshay 2020-11-29 03:18:14 GMT
we could add another instruction to generate a cr vector from a bit-vector: we could set each cr to gt if the integer bit is 1 and to eq if the integer bit is 0. This should be sufficient for 99.9% of the cases where we just want values in the cr fields, but don't care too much what exact values are put there, the rest can use mtcr or this new instruction combined with cr-logic ops.
Comment 4 Luke Kenneth Casson Leighton 2020-11-29 12:22:19 GMT
got it.  i had a feeling this might be what you meant.  ok so SV doesn't have bit-level element as a concept, appropriate as that might be for isel, setb and so on.

from an SV perspective it falls into the category of "weird" because it needs to be considered to be "scalar at the 32 bit CR level" i.e. treat the whole 32-bit CR as "the element" and consequently treat the int src/dest as 64-bit at the "element" level for
scalar purposes.

             op   op   op   op   op   op   op   op
    src1:    CR0  CR1  CR2  CR3  CR4  CR5  CR6  CR7
    src2:    CR0  CR1  CR2  CR3  CR4  CR5  CR6  CR7
    dest:    i[0] i[1] i[2] i[3] i[4] i[5] i[6] i[7]

even weirder is that it will need an implicit elwidth override on ints when SV vectorised, because strictly speaking it's a normal arithmetic op (2 CR src, 1 int dest), and normal arithmetic SV ops are treated as single-predicated and
single elwidth override.

if however it was *only* a 1-CR 1-int op then that is a standard SV twin-predication operation, and standard SV twin predication operations should take dual element-widths (one for src and one for dest)... argh except there's not enough room in SV Prefix, that was part of VBLOCK, argh, argh, argh.

so we'd have to treat this one as an implicit "integer gets a default
automatic elwidth override to 8-bit" operation which, actually, is quite
reasonable if the entire operation is sort-of treated as an
8 bit operation at the scalar level.  urr :)
Comment 5 Luke Kenneth Casson Leighton 2020-11-29 12:43:39 GMT
(In reply to Jacob Lifshay from comment #2)
> What I was thinking of is more like an instruction that reads a single 4-bit
> CR field and writes a 1 or a 0 to an integer register based on if the 4-bit
> CR field matches some condition. That instruction when vectorized would
> produce a bit-vector in the integer register with 1 bit per element.
> 
> The instruction would have a 4-bit immediate with bits a, b, c, and d. The
> output of the instruction would be:
> (a & cr_lt) | (b & cr_eq) | (c & cr_gt) | (d & cr_unordered)

an enhancement of that - taking cues from branch - is to use a mask-and-eq
(mask-and-xor)

   (mask0 & (a == cr_lt))

the only thing being it requires 8 bits, which would have to be checked
if that's ok.

bear in mind that CRs are not just used for eq/lt/gt, they're used for
chains of complex bitwise operations.

however... question is: should this new instruction be made more complex,
substituting for multiple crand/or/other ops?

> (icr if that's the right order for cr bits, but you get the idea).

yehyeh.

> This allows producing any pattern of ones and zeros assuming the cr is set
> to the result of an integer or fp compare op. For int compares, we set d =
> 0. For int/fp compares, the rest of the bits select the output value that
> should be generated for a particular compare result:
> lt -> output = a
> eq -> output = b
> gt -> output = c
> unordered (fp only) -> output = d

(In reply to Luke Kenneth Casson Leighton from comment #4)

> SV vectorised, because strictly speaking it's a normal arithmetic op (2 CR
> src, 1 int dest), and normal arithmetic SV ops are treated as

correction: it is only read=1 CR, write=1 int.

what about the other way round?  what about when writing?  mfcr and mfocr
don't do "spreadout" like this.  what about using the same 4-bit (8-bit?)
mask to take 1 bit of int and target multiple bits in CR?

also, argh: it can't be done on the full 32-bit CR because it violates the
whole thing of SV Vector Length (targetting the full 32-bit CR means multiplying
by 8).
Comment 6 Luke Kenneth Casson Leighton 2020-11-29 13:02:20 GMT
i notice that the CR 10-bit XO field column has 8 slots free.
Appendix C Book I-III v3.0B table 20 EXT19 p1156.

although oddly encoded, that would give the 8 possible bits
(mask and eq compares).

    crweird.eq RT, BB, mask
    crweird.lt RT, BB, mask
    crweird.ge RT, BB, mask
    crweird.un RT, BB, mask

and the bits from 11 thru 15 would be "ignore" (i'm looking at p41 v3.0B,
seeing in the tables per instruction how the bitfields are laid out).

then, moving them back would... errr... would it be good to have the same
mask+op 8-bit?  and to do something similar to clear/set?

   mtcrweird.eq RA, BA, mask
   ...
   ...

CR = CRfile[BA]
if mask[0]:
  CR[0] = a ^ RT[0]
if mask[1]
  CR[1] = b ^ RT[0]
if mask[2]
  CR[2] = c ^ RT[0]
if mask[3]
  CR[3] = d ^ RT[0]

bit 11 could be used to indicate that the instruction is crweird
or mtcrweird.

thoughts?
Comment 7 Luke Kenneth Casson Leighton 2020-11-29 19:48:05 GMT
hmmm just running with this one, to see where it goes, comment 0 contains link to page with some pseudocode, also included some CR to CR ops as well
Comment 8 Luke Kenneth Casson Leighton 2020-11-30 00:26:12 GMT
jacob it occurred to me that if we either:
* use a different XO set for crops
* use bit 31 to indicate

then under these circumstances bits 6-10 could be taken to be RT (int regfile).

note that the current crops actually use 4 bits of XO as a truth table lookup, using 2 bits each from BA and BB. very smart, very elegant.

then whenever crops are used the last one can be to target an integer.

alternative ideas: allow targetting of both int and CR.
Comment 9 Luke Kenneth Casson Leighton 2020-12-08 19:17:22 GMT
i'm also wondering:

src1 ← VSR[VRA+32]
src2 ← VSR[VRB+32]
mask ← VSR[VRC+32]
VSR[VRT+32] ← (src1 & ~mask) | (src2 & mask)

is there a way that, hypothetically, the new CR operations could be leveraged to achieve that style of mask operation (on integers) but using less bits in instructions to specify the required src regs?
Comment 10 Luke Kenneth Casson Leighton 2022-03-23 11:52:17 GMT
in order to match with the pmovmaskb concept i'm recommending that
sv.crweird and sv.crweirder be even less consistent with the usual
SV conformity/paradigm and allow the elwidth to specify that 1, 2, 4 or 8
bits of CR-based results be packed into INTs.  excluding zeroing of
upper bits, and using MSB0 numbering:

    for i in range(VL):
        result = some_function_of(CRfield[i])
        if RT.elwidth == 0b00:
            iregs[RT+i][63] = result # sets LSB to result
        if RT.elwidth == 0b01:
            iregs[RT+i//2][63-(i%2)] = result
        if RT.elwidth == 0b10:
            iregs[RT+i//4][63-(i%4)] = result
        if RT.elwidth == 0b11:
            iregs[RT+i//8][63-(i%8)] = result

in combination with sv.ori./ew=8,16,32 or grevlut it will be possible to
transfer (two-way) between integers and CRs

see grevlut https://libre-soc.org/openpower/sv/bitmanip/
Comment 11 Jacob Lifshay 2022-03-23 15:41:52 GMT
(In reply to Luke Kenneth Casson Leighton from comment #10)
> in order to match with the pmovmaskb concept i'm recommending that
> sv.crweird and sv.crweirder be even less consistent with the usual
> SV conformity/paradigm and allow the elwidth to specify that 1, 2, 4 or 8

do note that pmovmaskb actually copies 16 or 32 bits of MSBs to an integer register, rather than 1, 2, 4, or 8. (pmovmaskb technically has a version that does 8-bits, but nobody uses that anymore since the source is a MMX register -- shared with x87 registers in a painful manner)
Comment 12 Luke Kenneth Casson Leighton 2022-05-19 16:09:47 BST
(In reply to Jacob Lifshay from comment #11)
 
> do note that pmovmaskb actually copies 16 or 32 bits of MSBs to an integer
> register, rather than 1, 2, 4, or 8. (pmovmaskb technically has a version
> that does 8-bits, but nobody uses that anymore since the source is a MMX
> register -- shared with x87 registers in a painful manner)

the elwidth override on the source (which would be the CRs) is meaningless
and thus can be "repurposed".

by default however the single-bit result of testin each CR from
a vector of CRs will all go into the *one* integer register:

    for i in range(VL):
         GPR(RT)[i] = crrweird_test(CRfield[BA+i])

this is *very* different from normal SVP64 which would do:

    for i in range(VL):
         GPR(RT+i) = crrweird_test(CRfield[BA+i])
Comment 13 Luke Kenneth Casson Leighton 2022-05-19 16:12:37 BST
i've added a couple extra instructions: these can be used to
pack *multiple selected* CR Fields into a GPR as 4-bit quantities.
mfcr and mfocr do 8 CR Fields but they do not take SVP64 Predication
into account.
Comment 14 Luke Kenneth Casson Leighton 2022-05-21 16:00:52 BST
i also removed crweird because its functionality is covered by
mcrfm