Bug 527 - SV Predication needs to be defined
Summary: SV Predication needs to be defined
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2020-11-06 10:36 GMT by Luke Kenneth Casson Leighton
Modified: 2022-09-30 21:01 BST (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-11-06 10:36:57 GMT
Predication needs to be defined for inclusion in the ISA Standard to be
proposed to OPF.

https://libre-soc.org/openpower/sv/predication/
Comment 1 Luke Kenneth Casson Leighton 2020-11-15 22:42:11 GMT
continuing from https://bugs.libre-soc.org/show_bug.cgi?id=238#c12

PowerISA decision-making is fundamentally designed around Condition registers (with outliers such as the ternary FP select instruction).

cmp, all Rc=1 instructions: all of them create CRs which give zero, nonzero, GE, LT, GT, LE at the time that the result is created.  CR operations allow those to be combined to produce more compkex comparisions. Branch and isel can then be used to make decisions.

the first observation: to try to work "outside" of this 25 year established framework seems... unwise.

the simplest logical way forward would be to simply extend the pre-existing use of CR - with little to no modification - into the parallel (vector) domain, esp. given the use of a multi-issue OOO Engine.

* vectorisation of crand (etc) provides a means to not have to flip to INT regs, saving on instruction count.
* NOT vectorising CR operations at all would actually require special-casing in the decoder!
* vectorisation of cmp produces a vector of CRs: conceptually this is a no-brainer and fits perfectly with the SV paradigm.
* vectorisation of isel likewise fits perfectly.  a vector of CRs chooses elements of a vector of INTs.

by contrast getting CR results out of a vector of 64 CRs and into an integer y4 bitpredicate is actually quite complex: the data routing has to involve up to 64x4 bits of data, selecting every 4th bit and putting them into one int:

* VL=4 cmp ra, rb, cr8 produces 4 CRs
* 4 sets of CRs contain 4 sets of eq/+ve/-ve
* to turn that into a scalar int predicate requires selecting every 4th bit
  (but which bit? bit 0 of every CR? bit 3?)
* performed multiple times if vectorised LE/GT/GE/LT is needed, which because Rc=1 has been destroyed and replaced with a meagre 1-bit CR per element effectively means that the arithmetic op has to be repeated just to get a 2nd bit

modifying instructions to abandon 25 years of Rc=1 and try to force-fit one bit per element into a scalar integer is a nonstarter (which of the 4 bits? bear in mind that 2 bits are used to discern "LE" from "LT")

everything about trying to abandon CRs and shove things into a scalar int reg (then specialcase that int as a new regtype) adds extra work, has problems, unanswered questions and introduces data bottlenecks that get worse and worse the more they are investigated.

by total contrast the use of crand etc which are *specifically* designed for the purpose of manipulating CRs at the bitlevel are dead easy to vectorise and dead easy to use to produce "combined" (complex) vector decisions. 

by using those same CRs as predicates where one bit of each CR is applied (bit 0) per element operation, crand etc. can be used as-is: zero modifications or additions required other than SV vector for-looping.

CRs are so obviously easy to fit onto predication (in the easy case) it's actually difficult to justify any other option.

this just leaves leveraging of operations suxh as cntlz and other bitmanip operations, which is where things come unstuck, no matter what scheme is used.

* standard Cray-style predication requires new opcodes
* RVV style predication (LSB of standard vectors) is very wasteful and also requires new opcodes
* no opcodes exist in PowerISA to select individual CR bits to jam into int scalar regs (or back): mfocr is mask-based (FXM)

even if opcodes are created which transfer every 4th bit of the main CR reg (its full vector version) into a scalar INT (and back) the use of popcount (etc) bitmanip scalar opcodes prevents Vector Chaining
(although to be honest this is probably true in other Vector ISAs as well)

two hypothetical new instructions wouls do this job:

   (VL=5) mfcr4 r1, CR8, bit0
   =>
      r1[0] <= CR8[0]
      r1[1] <= CR9[0]
      r1[2] <= CR10[0]
      r1[3] <= CR11[0]
      r1[4] <= CR12[0]

i.e. up to 64 CR bit0s get copied into the one scalar intreg (and mtcr4 copies them back).

implementation-wise the full 32x CR regfile read/write ports would be used (just like they are for mfcr/mtcr).  it would be a variant of mfocr / mtocr using an alternative meaning for the FXM field.

let us imagine that there is a scenario where complex multi-CR data selection is required.  first element must be selected from a vector within a certain range:

* vector compare greater than 50 produces CR vector 1
* vector compare less than 100 produces CR vector 2
* vector crand (and a few other cr ops) on CRV1 and CRV2 produces CRV3

then to find which is the first of those:

* CRV3 is transferred into a scalar int using mfcr4
* scalar INT "findfirst" locates the 1st bit set to 1 in that intreg
* mtcr4 transfers the scalar int back to CRs, spreading the bits out to be applied as predicate bits (one CR, one element).

not that i actually like the number of instructions needed here, to the extent where i am seriously considering proposing mask operations based on CRs.  which i like even less due to their serious specialisation and that they would still be operating on every 4th bit.

bitmanip "bitselect" operations (shuffle/bperm) hypothetically could help here although even efficient implementations are known to be as expensive as shift/rot.

bottom line is, there really are no good options here.