we completely missed that bc uses CR fields, and thus could be SVP64 Vectorised. http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html
* design pretty much done * added SVP64 24-bit RM decoder * added sv_analysis to include bc and bclr SVP64 EXTRA modes
commit 4ad3c373c7fbd826447de0c64b558eeb2b53d174 (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Sun Aug 8 22:06:45 2021 +0100 add start of SVP64ASM encoder for sv.bc and sv.bclr TODO, sv.bca, sv.bclrl etc.
svstep mode looks like it is too "CISC-like". although remarkably similar to CTR auto-decrement, svstep auto-increment involves predicate skipping as well as REMAP. realistically this is too much, unfortunately. currently implementing sv.bc in ISACaller, this is the first time that CR fields have been involved. it would have been much better to have started with sv.crand (etc) before trying to do sv.bc because then the infrastructure for read/write of CR Fields would already be in place. realistically, the pseudocode needs to change from CR[BI+32] to CRF(BI[0:2])[BI[3:4]
i've added a mode LRu (Link Register Update) which only gets LR updated if the branch condition succeeds. actually there is also the option to only set LR if the branch condition *fails* however the reason for today's thought is inspired from "why does LR even exist in the first place" and it is down to making function calls and returns from function calls possible. in core.py there were two pieces of info identified (3) * PC * MSR * SVSTATE these three are what get saved/restored on contextswitch, which i mention because it helps mentally get a handle on the execution context. MSR is global and not involved in functions. PC gets saved in LR on function calls... but what about SVSTATE? that needs saving as well. therefore, the idea is: * to add an extra SPR, LSVR Link SVState Register * to extend svp64-branches to include save/swap of SVSTATE into/from LSVR * to utilise more bits from RM (some of RM.EXTRA) to add an equivalent of the LK field, for LSVR, and corresponding variant of LRu. if not included then any function call or loop involving SVP64 will have to perform explicit backups of SVSTATE (mtspr, mfspr) which will get very tedious very quickly. this starts to matter a lot more in VerticalFirst Mode than in Horizontal because there is the potential for calling functions from within VerticalFirst loops.
Pseudo-code: if AA then NIA <-iea EXTS(LI || 0b00) else NIA <-iea CIA + EXTS(LI || 0b00) if LK then LR <-iea CIA + 4 becomes: if AA then NIA <-iea EXTS(LI || 0b00) else NIA <-iea CIA + EXTS(LI || 0b00) if LK then LR <-iea CIA + 4 if LSV then LSVR <- SVSTATE
bclr pseudocode: if (mode_is_64bit) then M <- 0 else M <- 32 if ¬BO[2] then CTR <- CTR - 1 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3]) cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00 if LK then LR <-iea CIA + 4 has to become a little more sophisticated: lr_ok <- LK lsv_ok <- LSVR if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00 if SVRMmode.LSV then SVSTATE.next <- LSVR if SVRMmode.LRu then lr_ok <- ¬lr_ok if SVRMmode.LSVu then lsv_ok <- ¬lsv_ok if lr_ok then LR <-iea CIA + 4 if lsv_ok then LSVR <- SVSTATE * SVSTATE as well as LR is being captured in LR and LSVR * SVSTATE as well as LR has the option to be updated if the condition test succeeds
imho adding sv context switching within a sv vertical loop is too complex...
(In reply to Jacob Lifshay from comment #7) > imho adding sv context switching within a sv vertical loop is too complex. my feeling here it's going to be attempted whether we like it or not. therefore, it's down to us to explore that space, in advance, and to make it a smooth rather than a rough ride. if a lot of mfspr/mtspr instructions on SVSTATE have to be used, that's a rough ride. mfspr SVSTATE bclr mtspr SVSTATE repeat repeat repeat repeat. the concept comes from the ZOLC "stack", and with the low-level HDL blocks being literally identical in ZOLC as they are for Matrix REMAP it's conceptually not that hard. full ZOLC on the other hand: *that's* hard, and i definitely don't want to go that route, it'll take months to understand (just ZOLC alone) saving of SVSTATE for stack/nesting purposes during a Vertical-First function, in order to perform a localised Horizontal-First localised piece of work, was always inevitable. normally it's "only" the PC that's saved on-stack, because that's the primary context. now SVSTATE is part of that primary context. it's not that hard to conceptualise, especially given that SVSTATE is effectively a sub-program-counter: this has been in the documentation and in presentations on SV for over two years as an introductory statement.
(In reply to Luke Kenneth Casson Leighton from comment #8) > (In reply to Jacob Lifshay from comment #7) > > imho adding sv context switching within a sv vertical loop is too complex. > > my feeling here it's going to be attempted whether we like it or not. > therefore, it's down to us to explore that space, in advance, and to > make it a smooth rather than a rough ride. > > if a lot of mfspr/mtspr instructions on SVSTATE have to be used, that's > a rough ride. mfspr SVSTATE bclr mtspr SVSTATE repeat repeat repeat repeat. wait, so you mean a function call inside a sv vertical-first loop? hmm... anyway: imho we should limit SV vertical mode to just single loops, otherwise we'll gain more features and end up with something that takes 500k gates just in the sv decoder...we need to stop adding every feature we can think of to SV otherwise x86 will look at us and be proud that their spec is *only* 5000 pages... (idk the real number, but you get the idea)
(In reply to Jacob Lifshay from comment #9) > wait, so you mean a function call inside a sv vertical-first loop? hmm... uh-huhn :) at least, if not a public one (ABI compliant) a static one. bare-minimum just above "some stuff went on the stack" > anyway: > imho we should limit SV vertical mode to just single loops, ironically and hilariously doing so is more costly. this comes down to the strict insistence of being barely above scalar and on having context-switching rules MyISA 66000 VVM on the other hand, written by Mitch Alsup, that *is* a single ZOLC Vertical-First explicit concept. it's an explicit instruction (a pair), only LD/STs may be Vectorised, the LOOP instruction tags the counter, loop invariants, and the registers to be used for LD/ST. once in-flight the hw is permitted to analyse the RS allocations and perform automatic SIMD merging of multiple Vertically-allocated loops. that can only be safely done thanks to the tagging. now *that's* complicated (and awesome, despite the limitations) :) SVP64 is a completely different paradigm, with the absolute bare minimum in hardware. Vertical-First in SVP64 is literally one bit that says "no" to incrementing srcstep/dststep, and lets just one instruction get issued with GPR(RA+srcstep) rather than GPR(RA). context-switching whether it be a function or it be an interrupt is literally a matter of saving PC and SVSTATE. (a bit more if REMAP is engaged, obviously, but that's software's problem to solve) > otherwise we'll > gain more features and end up with something that takes 500k gates just in > the sv decoder... surprisingly adding LSVR is not that gate-hungry. it is completely independent, and already the branch instructions have a different RM Decode path, but the bit that's equivalent to LK is not tied *to* LK, it's independent. > we need to stop adding every feature we can think of to SV > otherwise x86 will look at us and be proud that their spec is *only* 5000 > pages... (idk the real number, but you get the idea) 120 for everything including 32-bit instructions from bitmanip etc. most of the effort involved here is stopping brain-melt. Cray (Horizontal-First) *nobody* does function calls, it's too much. Vertical-First is so close to scalar it's actually harder to prevent fn calls, and would need explicit hardware and pretty much a redesign to prevent. now, ZOLC, that *is* going to be a whole new bundle-o-fun. ZOLC is definitely on the other side of a "large funding" fence.
(In reply to Luke Kenneth Casson Leighton from comment #10) > surprisingly adding LSVR is not that gate-hungry. it is completely > independent, and already the branch instructions have a different RM > Decode path, there are 4 different decode paths where the 24-bit RM field is interpreted differently: 1) arithmetic 2) CRs 3) LDST 4) branches. yes it is annoying that partial decode of the suffix is required in order to begin decoding the prefix but like in v3.1 prefix more bits woukd be required to provide the type identification (MTRR, 8LS etc) and we would need to take yet another major opcode to do it (26 bits needed)