design page: https://libre-soc.org/openpower/sv/cr_int_predication/ initial discussion: http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001335.html if using CRs, beq/bgt/blt makes decisions based on a single bit analysis (and inversion) of a CR, which frequently requires a crand/cror op prior to get branch conditions that otherwise do not exist in single-instruction form. the idea here is to create a combined CR analysis operation that takes multiple bits [0-3] of a CR and creates a single bit yes/no, extending what crand/cror (etc) currently do. whilst useful in its own right vectorisation of the same gives sometjing that is useful for SV Predication computation.
hmm i am curious. crand and cror etc. if they are used to target one bit of the same CR as src1 and another bit of the same CR as src2, one is the "eq zero" bit and the other say the "+ve" bit, and the dest is say the OV bit of the same CR, that computes GE in the case of cror and err... GT in the case of crand. now if those are vectorised that would produce a vector of GT/GE operations, would it not? jacob did you have any other type of CR operations in mind that do not fit into this category?
What I was thinking of is more like an instruction that reads a single 4-bit CR field and writes a 1 or a 0 to an integer register based on if the 4-bit CR field matches some condition. That instruction when vectorized would produce a bit-vector in the integer register with 1 bit per element. The instruction would have a 4-bit immediate with bits a, b, c, and d. The output of the instruction would be: (a & cr_lt) | (b & cr_eq) | (c & cr_gt) | (d & cr_unordered) (icr if that's the right order for cr bits, but you get the idea). This allows producing any pattern of ones and zeros assuming the cr is set to the result of an integer or fp compare op. For int compares, we set d = 0. For int/fp compares, the rest of the bits select the output value that should be generated for a particular compare result: lt -> output = a eq -> output = b gt -> output = c unordered (fp only) -> output = d
we could add another instruction to generate a cr vector from a bit-vector: we could set each cr to gt if the integer bit is 1 and to eq if the integer bit is 0. This should be sufficient for 99.9% of the cases where we just want values in the cr fields, but don't care too much what exact values are put there, the rest can use mtcr or this new instruction combined with cr-logic ops.
got it. i had a feeling this might be what you meant. ok so SV doesn't have bit-level element as a concept, appropriate as that might be for isel, setb and so on. from an SV perspective it falls into the category of "weird" because it needs to be considered to be "scalar at the 32 bit CR level" i.e. treat the whole 32-bit CR as "the element" and consequently treat the int src/dest as 64-bit at the "element" level for scalar purposes. op op op op op op op op src1: CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7 src2: CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7 dest: i[0] i[1] i[2] i[3] i[4] i[5] i[6] i[7] even weirder is that it will need an implicit elwidth override on ints when SV vectorised, because strictly speaking it's a normal arithmetic op (2 CR src, 1 int dest), and normal arithmetic SV ops are treated as single-predicated and single elwidth override. if however it was *only* a 1-CR 1-int op then that is a standard SV twin-predication operation, and standard SV twin predication operations should take dual element-widths (one for src and one for dest)... argh except there's not enough room in SV Prefix, that was part of VBLOCK, argh, argh, argh. so we'd have to treat this one as an implicit "integer gets a default automatic elwidth override to 8-bit" operation which, actually, is quite reasonable if the entire operation is sort-of treated as an 8 bit operation at the scalar level. urr :)
(In reply to Jacob Lifshay from comment #2) > What I was thinking of is more like an instruction that reads a single 4-bit > CR field and writes a 1 or a 0 to an integer register based on if the 4-bit > CR field matches some condition. That instruction when vectorized would > produce a bit-vector in the integer register with 1 bit per element. > > The instruction would have a 4-bit immediate with bits a, b, c, and d. The > output of the instruction would be: > (a & cr_lt) | (b & cr_eq) | (c & cr_gt) | (d & cr_unordered) an enhancement of that - taking cues from branch - is to use a mask-and-eq (mask-and-xor) (mask0 & (a == cr_lt)) the only thing being it requires 8 bits, which would have to be checked if that's ok. bear in mind that CRs are not just used for eq/lt/gt, they're used for chains of complex bitwise operations. however... question is: should this new instruction be made more complex, substituting for multiple crand/or/other ops? > (icr if that's the right order for cr bits, but you get the idea). yehyeh. > This allows producing any pattern of ones and zeros assuming the cr is set > to the result of an integer or fp compare op. For int compares, we set d = > 0. For int/fp compares, the rest of the bits select the output value that > should be generated for a particular compare result: > lt -> output = a > eq -> output = b > gt -> output = c > unordered (fp only) -> output = d (In reply to Luke Kenneth Casson Leighton from comment #4) > SV vectorised, because strictly speaking it's a normal arithmetic op (2 CR > src, 1 int dest), and normal arithmetic SV ops are treated as correction: it is only read=1 CR, write=1 int. what about the other way round? what about when writing? mfcr and mfocr don't do "spreadout" like this. what about using the same 4-bit (8-bit?) mask to take 1 bit of int and target multiple bits in CR? also, argh: it can't be done on the full 32-bit CR because it violates the whole thing of SV Vector Length (targetting the full 32-bit CR means multiplying by 8).
i notice that the CR 10-bit XO field column has 8 slots free. Appendix C Book I-III v3.0B table 20 EXT19 p1156. although oddly encoded, that would give the 8 possible bits (mask and eq compares). crweird.eq RT, BB, mask crweird.lt RT, BB, mask crweird.ge RT, BB, mask crweird.un RT, BB, mask and the bits from 11 thru 15 would be "ignore" (i'm looking at p41 v3.0B, seeing in the tables per instruction how the bitfields are laid out). then, moving them back would... errr... would it be good to have the same mask+op 8-bit? and to do something similar to clear/set? mtcrweird.eq RA, BA, mask ... ... CR = CRfile[BA] if mask[0]: CR[0] = a ^ RT[0] if mask[1] CR[1] = b ^ RT[0] if mask[2] CR[2] = c ^ RT[0] if mask[3] CR[3] = d ^ RT[0] bit 11 could be used to indicate that the instruction is crweird or mtcrweird. thoughts?
hmmm just running with this one, to see where it goes, comment 0 contains link to page with some pseudocode, also included some CR to CR ops as well
jacob it occurred to me that if we either: * use a different XO set for crops * use bit 31 to indicate then under these circumstances bits 6-10 could be taken to be RT (int regfile). note that the current crops actually use 4 bits of XO as a truth table lookup, using 2 bits each from BA and BB. very smart, very elegant. then whenever crops are used the last one can be to target an integer. alternative ideas: allow targetting of both int and CR.
i'm also wondering: src1 ← VSR[VRA+32] src2 ← VSR[VRB+32] mask ← VSR[VRC+32] VSR[VRT+32] ← (src1 & ~mask) | (src2 & mask) is there a way that, hypothetically, the new CR operations could be leveraged to achieve that style of mask operation (on integers) but using less bits in instructions to specify the required src regs?
in order to match with the pmovmaskb concept i'm recommending that sv.crweird and sv.crweirder be even less consistent with the usual SV conformity/paradigm and allow the elwidth to specify that 1, 2, 4 or 8 bits of CR-based results be packed into INTs. excluding zeroing of upper bits, and using MSB0 numbering: for i in range(VL): result = some_function_of(CRfield[i]) if RT.elwidth == 0b00: iregs[RT+i][63] = result # sets LSB to result if RT.elwidth == 0b01: iregs[RT+i//2][63-(i%2)] = result if RT.elwidth == 0b10: iregs[RT+i//4][63-(i%4)] = result if RT.elwidth == 0b11: iregs[RT+i//8][63-(i%8)] = result in combination with sv.ori./ew=8,16,32 or grevlut it will be possible to transfer (two-way) between integers and CRs see grevlut https://libre-soc.org/openpower/sv/bitmanip/
(In reply to Luke Kenneth Casson Leighton from comment #10) > in order to match with the pmovmaskb concept i'm recommending that > sv.crweird and sv.crweirder be even less consistent with the usual > SV conformity/paradigm and allow the elwidth to specify that 1, 2, 4 or 8 do note that pmovmaskb actually copies 16 or 32 bits of MSBs to an integer register, rather than 1, 2, 4, or 8. (pmovmaskb technically has a version that does 8-bits, but nobody uses that anymore since the source is a MMX register -- shared with x87 registers in a painful manner)
(In reply to Jacob Lifshay from comment #11) > do note that pmovmaskb actually copies 16 or 32 bits of MSBs to an integer > register, rather than 1, 2, 4, or 8. (pmovmaskb technically has a version > that does 8-bits, but nobody uses that anymore since the source is a MMX > register -- shared with x87 registers in a painful manner) the elwidth override on the source (which would be the CRs) is meaningless and thus can be "repurposed". by default however the single-bit result of testin each CR from a vector of CRs will all go into the *one* integer register: for i in range(VL): GPR(RT)[i] = crrweird_test(CRfield[BA+i]) this is *very* different from normal SVP64 which would do: for i in range(VL): GPR(RT+i) = crrweird_test(CRfield[BA+i])
i've added a couple extra instructions: these can be used to pack *multiple selected* CR Fields into a GPR as 4-bit quantities. mfcr and mfocr do 8 CR Fields but they do not take SVP64 Predication into account.
i also removed crweird because its functionality is covered by mcrfm