https://bugs.libre-soc.org/show_bug.cgi?id=1092#c18 the idea is that any sequence of immediates may be repeated *as a vector* in the instruction stream. some serious caveats are needed as this is right at the Decode Phase. task list: * DONE: basic concept (comment #6) * TODO: specification writeup (Andrey to find if there is a suitable bug under OPF-ISA-WG Grant, "stage 2" qns&feedback) * TODO: add the relevant SVSTATE bit to openpower-isa * TODO: add sv.bc to svbranch.mdwn (which implicitly sets the relevant SVSTATE bit), see link below * TODO: add "redirect" similar to sv.bcctr * TODO: add read-cache of immediates when sv.bc is called, "special-case" in ISACaller to read all 16-bit blocks into a python list * TODO: when relevant SVSTATE bit is set, get immediate inside loop where if index=0 the Suffix-immediate is used otherwise from-python-list. read BEFORE the REMAP engine is active, sort out REMAP later. WARNING THIS REQUIRES AN OVERRIDE PARAMETER IN PowerDecode2. an additional "const" source is needed where D DS SI UI all constant sources from DecodeAImm are OVERRIDDEN if a bitflag is set HIGH. ISAcaller needs to "yield eq" that flag from SVSTATE. * TODO: when the Loop ends (PC moves on, in HF, reset in VF) then the relevant SVTATE bit is CLEARED. see persistent SVSTATE bit for an inkling of how that works. * TODO: Horizontal-First Unit test * TODO: Vertical-First Unit test https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/svbranch.mdwn;hb=HEAD
(In reply to Luke Kenneth Casson Leighton from https://bugs.libre-soc.org/show_bug.cgi?id=1092#c18 ) > https://libre-soc.org/openpower/sv/normal/ > > | 0-1 | 2 | 3 4 | description | > | ------ | --- |---------|----------------------------------| > | 0 0 | 0 | dz sz | simple mode | > | 0 0 | 1 | RG 0 | scalar reduce mode (mapreduce) | > | 0 0 | 1 | / 1 | reserved | > | 1 0 | N | dz sz | sat mode: N=0/1 u/s | > | VLi 1 | inv | CR-bit | Rc=1: ffirst CR sel | > | VLi 1 | inv | zz RC1 | Rc=0: ffirst z/nonz | > > there's room in that (just) for a bit that says > "immediates are Vectorised". ok: using mode[4] > says "immediates are Vectorised". and given that no immediates are greater than 16-bit, it is possible to just ignore elwidth overrides here > that still leaves mode[3] for some sort of decision. or another mode in future. best to have mode[3:4]=0b01 and reserve other combinations. > the neat thing about this is that even sv.addi can load > an array of immediates. oris as well. the *entire pattern* of 5 instructions to load 64-bit immediates can be Vectorized addi rt,0,#nnnn addis rt,0,#nnnn rldicl rt, 32 ori rt,0,#nnnn oris rt,0,#nnnn becomes: sv.addi/vi rt,0,#nnnn ... for sv.fli/vi (see https://bugs.libre-soc.org/show_bug.cgi?id=1092#c19) it is a simple matter of inlining multiple instructions. i would strongly suggest though *not* trying to piss about with binutils syntax, just have ".long 0xnnnnnnnn" after it. > as we discussed yesterday it requires an "Unconditional > Branch" effect, and i'd recommend it be on MAXVL not VL. > also to round-up to the nearest 4-bytes. MAXVL allows for dynamic code to *change the number of immediates loaded* which is extremely important given that this is compile-time static. > if RM."immediate-mode": > > NIA = CIA + CEIL(MAXVL * sizeof(immediate), 4) forgot that of course the 1st immediate is already in the instruction. and set hardcoded to 16 if RM.normal."vector-immediate-mode": NIA = CIA + CEIL((MAXVL-1) * 16, 4) i think not having to read elwidth here will be *really* important, otherwise the Decoder has a hell of a job. it is going to be tough enough to identify that this is "Unconditional Branch": not only does the suffix need identifying (to find out if it is RM.normal) but the "vector-immediate-mode" itself needs decoding... ... oh and *then* the new PC can be calculated. to that end this is DEFINITELY something that goes into the "Upper" Compliancy Levels. > jacob you mentioned during the meeting that this would > be "slow" i.e. dependent on Architectural State (SVSTATE), > if someone modified SVSTATE with mtspr then things get > slow: this is *already* in the spec. it's that some implementations will have caches of where SVSTATE was, but others will not.
(In reply to Luke Kenneth Casson Leighton from comment #1) > NIA = CIA + CEIL((MAXVL-1) * 16, 4) that's wrong, if you want a vector of 16-bit immediates you want: extra_immediates = MAXVL - 1 extra_bytes = extra_immediates * 2 extra_words = -((-extra_bytes) // 4) # ceil div NIA = CIA + 8 + 4 * extra_words that said, if we're going to have vector immediates at all, they should also account for subvl. it would also be very nice to account for elwid too since you have to decode a bunch of the prefix and suffix anyway (see note below): XLEN = max(sw, dw) # TODO: account for f16/bf16 being 16/16-bit not 8/16-bit bytes = (XLEN // 8) * subvl * MAXVL bytes -= 2 # first 2 bytes potentially encoded in 32-bit insn bytes += 8 # sv prefix + 32-bit insn bytes = (bytes + 3) & ~3 # round up to words NIA = CIA + bytes this allows trivially loading a vector of 64-bit immediates in one instruction -- better than any fli proposed so far. decoding note: i expect cpus to generally treat a vector load immediate as a unconditional jump -- this means they don't try to read instructions after the load immediate in the same cycle as the load immediate so taking longer to decode the length is perfectly fine since the instruction start prefix-sum tree can just treat it as a 64-bit instruction and clear out all attempted instructions after it, leaving time for the full decoder to decode the correct length and redirect fetch to the correct location. it can be treated like a jump so the next instruction address gets added to the branch target buffer and the next-pc logic will speculatively fetch from the correct location on the next cycle, even before decoding has started. demo program: 0x08: ori r10, r10, 5 0x0c: and r10, r11, r10 0x10: sv.addi/w=32 *r3, 0, [0x12345678, 0x9abcdef0] # vector immediate 0x20: sv.add/w=32 *r3, *r3, *r3 ... demo pipeline with 64-bit fetch width | cycle | next-pc/BTB | fetch | len-decode/tree | decode | |-------|--------------|----------|-----------------|-----------------------| | 0 | 0x08 | | | | | | | | | | | 1 | 0x10 | 0x08 ori | | | | | BTB has 0x20 | 0x0c and | | | | 2 | 0x20 | 0x10 sv. | 0x08 ori len=4 | | | | | 0x14 addi| 0x0c and len=4 | | | 3 | 0x28 | 0x20 sv. | 0x10 sv. len=8 | 0x08 ori NIA=0x0c | | | | 0x24 add | 0x14 addi len=4 | 0x0c and NIA=0x10 | | 4 | ... | ... | 0x20 sv. len=8 | 0x10 sv.addi NIA=0x20 | | | | | 0x24 add len=4 | | | 5 | ... | ... | ... | 0x20 sv.add NIA=0x28 | | | | | | |
(In reply to Jacob Lifshay from comment #2) > it can be treated like a jump so the next instruction address gets added to > the branch target buffer and the next-pc logic will speculatively fetch from > the correct location on the next cycle, even before decoding has started. actually, that doesn't work since it needs to fetch all the immediate bytes too, not just skip over them.
(In reply to Jacob Lifshay from comment #2) > that said, if we're going to have vector immediates at all, they should also > account for subvl. *click* yes of course. hmmm is that in 2 bits that is not affected by anything? (as in: can it be picked up from the prefix and *guaranteed* to be easy to get? yes it can!) > it would also be very nice to account for elwid too since > you have to decode a bunch of the prefix and suffix anyway (see note below): that's exactly why i'm *not* recommending elwidth be part of it, precisely because it requires the prefix-suffix combination. the only "decode" needed is "is this instruction Arithmetic type" and a special (small) PowerDecode can be used (in our implementation). it's dead-easy to do: just put filters onto a PowerDecode instance (in this case "unit" from the CSV files) and voila "if unit == ALU/LOGICAL" you have the information needed (right at the critical point) at which point some *further* decode of elwidth is needed. subvl on the other hand is dead-easy. in the interests of not hampering max CPU speed i'm quite happy for space to be "wasted" here. which would make it: extra_immediates = MAXVL - 1 extra_bytes = extra_immediates * 2 * subvl extra_words = -((-extra_bytes) // 4) # ceil div NIA = CIA + 8 + 4 * extra_words > XLEN = max(sw, dw) # TODO: account for f16/bf16 being 16/16-bit not 8/16-bit exactly the kind of nightmare that will punish multi-issue :) that would need *even more* decoding - now detecting FP-Arithmetic as separate from Logical/ALU - to work out even how to get the elwidth there is enough dependency already between prefix and suffix, making both me (and the ISA WG) jittery. > this allows trivially loading a vector of 64-bit immediates in one > instruction -- better than any fli proposed so far. remember that it is absolutely critical that the scalar instructions remain orthogonal to "when Vectorised". we *cannot* have "if Scalar then instruction means something else if Vector it's different". this is a HARD inviolate rule (where sv.bc is seriously pushing our luck on that one, and the only way i can think to sell it is that bc is "subset" behaviour of sv.bc) what you are suggesting would involve *different* pseudocode for *all* impacted instructions: if "sv.addi" then do something different from addi because the immediate is bigger else the v3.0/v3.1 addi pseudocode here i just went through that with Paul, took ages to work out what i meant https://bugs.libre-soc.org/show_bug.cgi?id=1056#c69 changes to the "meaning" of an instruction - requiring "if sv.xxx else" i am putting my foot down HARD on that. the consequence is that with neither operands nor scalar-instruction being any different the *same Decode* may be used for both scalar and vector, and that's absolutely critical when it comes to high-performance (speculative) parallel decode. > decoding note: i expect cpus to generally treat a vector load immediate as a > unconditional jump -- yyep. which means it has to be REALLY simple. > this means they don't try to read instructions after > the load immediate in the same cycle as the load immediate so taking longer > to decode the length is perfectly fine since the instruction start > prefix-sum tree can just treat it as a 64-bit instruction and clear out all > attempted instructions after it, leaving time for the full decoder to decode > the correct length and redirect fetch to the correct location. a neat trick: parallel speculative decode can be carried out, and if "constants are misinterpreted as "instructions" they are skipped-over once they are identified. everything can be done in parallel and the actual decision deferred. if fetch is in 64-byte aligned chunks and performs some parallel decode, then we have to be careful crossing that boundary: Power v3.1 public Book I Section 1.6 p11 : Prefixed instructions do not cross 64-byte instruction address boundaries. When a prefixed instruction crosses a 64-byte boundary, the system alignment error handler is invoked. so, assuming that the vector-immediate instruction is within such blocks, if VL is ever greater than 31 we're "in trouble" and at *that* point the scheme you describe would be activated, but otherwise some speculative decode is perfectly fine. > it can be treated like a jump so the next instruction address gets added to > the branch target buffer and the next-pc logic will speculatively fetch from > the correct location on the next cycle, even before decoding has started. awesome :) > demo program: > 0x08: ori r10, r10, 5 > 0x0c: and r10, r11, r10 > 0x10: sv.addi/w=32 *r3, 0, [0x12345678, 0x9abcdef0] # vector immediate sv.addi/vi/w=32 ... > 0x20: sv.add/w=32 *r3, *r3, *r3 > ... this is a really nice illustrative example. needs expanding so that it's clear that the 2nd immediate is in 0x18. and setvl MAXVL=4? 8? 0x10: PO9 sv prefix 0x14: addi (prefixed, contains SI=0x12345678) 0x18: 0x0000_0000 0x0000_0000 0x0000_0000 0x9abcdef0 0x1c: 0x0000_0000 0x0000_0000 0x0000_0000 0x0000_0000 0x20: PO9 sv prefix 0x24: add *r3, *r3, *r3 ok so there's room there for up to 8 additional constants. so MAXVL=9 is perfectly fine (on this example). > demo pipeline with 64-bit fetch width IBM has been doing 64 *byte* wide decode!! (likely a clean aligned chunk of a L1 cache line) fetch-width will be mad: the POWER9 and POWER10 have those OpenCAPI 25 gigabit SERDES (quantity lots!)
(In reply to Jacob Lifshay from comment #3) > (In reply to Jacob Lifshay from comment #2) > > it can be treated like a jump so the next instruction address gets added to > > the branch target buffer and the next-pc logic will speculatively fetch from > > the correct location on the next cycle, even before decoding has started. > > actually, that doesn't work since it needs to fetch all the immediate bytes > too, not just skip over them. hence why i'm scared of introducing extra dependencies, increasing gate-latency, in parallel-decode (such as elwidth, *especially* BF8/FP16 differences). and why i said that this should only be in "advanced" implementations, and/or at lower-clock-rate machines such as 3D GPUs, where ~1.5 ghz max clock rate will not be a problem.
i have some ideas on this to reduce decode complexity. 1. overload sv.b yes really sv.b so that it sets a bit in SVSTATE "if the next instruction is prefixed then get the immediate-constants from CIA-MAXVL//2" 2. if that bit is set (which should be transient) then next instruction clears it. a small unconditional relative-branch would leave a hole in the instruction stream, and the nice thing is, it's an *actual* branch not an implicit one. widths of immediates are fixed at 16-bit and any unused bits are IGNORED. no compression or attempts at variable-length are made.
jacob=1000 markos=300 vantosh=300 andrey=200 lkcl=200 these budgets are not balanced given the amount of input i will likely need into this task. the budget was set at EUR 2,000 at the time for one person (me) to complete this task relatively quickly. there is to be no further discussion of potential ideas or alternatives: this is an *implementation and delivery* task best done by one person and reviewed by others. i have updated the TODO list in comment #0, you can see how much is involved. best done as a branch. do NOT attempt to suggest how REMAP should work here. that is a BIG task on its own that needs significant thought AT THE RIGHT TIME which is NOT NOW.
(In reply to Luke Kenneth Casson Leighton from comment #1) > sv.addi/vi rt,0,#nnnn NOT to be implemented this way. see comment #6 which **IS** the implementation. the bit that is added to SVSTATE is to be added similar to the "persistence" bit. the FOLLOWING INSTRUCTION and the following instruction ONLY shall have its immediates extended to a Vector, implicitly. there shall no **NO** changes to Normal/LDST/CRops Modes. application of REMAP to Vector-Immediates shall be prohibited at this time (but allowed as usual on registers used)
on second thoughts there *may* be a place to squeeze in documentation budget from the OPF-ISA-WG Grant and leave the "implementation" side here as outlined at the top of comment #17/