parallel prefix sum (also known as `scan`) is a commonly useful operation, since there's space in remap modes (mode 0b11) now that some were removed I think we should add it. some of the applications of parallel prefix sum: https://en.wikipedia.org/wiki/Prefix_sum#Applications * Counting sort/Radix sort (the most common sorting algorithm used on GPUs) * Decoding Grey code * Parallel Polynomial Interpolation * Computing Factorials (prefix-product of 1,2,3,4,...) * WIP: jacob: add SVSHAPE.skip driven update of remap_preduce_yield.py done, except as a new function iterate_indices2, only thing left is to merge into the original function after deciding if we want to delete `steps` reversing or not. See comment #8 D3CISION MADE: NO. REVERSING STAYS. END OF DISCUSSION. * DONE: jacob: add *really* short demo2() bare minimum test * DONE: jacob: add unit tests verifying that remap_preduce_yield.py's prefix-sum matches known working code in nmutil.prefix_sum. * DONE: jacob: integrate into svshape pseudocode * DONE: luke: update wiki * DONE: lkcl: add into ISACaller svshape.py * DONE: jacob: add some unit tests into test_caller_svp64_preduce.py these should be "showcaseable" i.e. if presented on a conference they should be easy to understand. * TODO jacob: put pseudocode into spec/RFC https://libre-soc.org/openpower/sv/remap/appendix/ * TODO jacob: restore reversing mode.
we can derive the algorithm from the already-known-working nmutil prefix_sum.py which implements both the work-efficient and the low-depth algorithms: https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l35
how is it different from parallel prefix reduction bug #864? https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;hb=HEAD
(In reply to Luke Kenneth Casson Leighton from comment #2) > how is it different from parallel prefix reduction bug #864? > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/ > decoder/isa/remap_preduce_yield.py;hb=HEAD parallel reduction is a reduction aka has only one output, the rest of the output vector is intermediate results that aren't a prefix-sum. parallel reduction is basically 1/2 a prefix-sum operation.
https://libre-soc.org/openpower/sv/remap/ 3.1 Parallel Reduction Mode Creates the Schedules for Parallel Tree Reduction. submode=0b00 selects the left operand index submode=0b01 selects the right operand index it's *just* about possible to squeeze in alternative modes without going overboard - svshape's 3rd dimension (zdim) could in theory be used to indicate an alternative reduction algorithm.
basically the parallel reduction algorithm we use is similar to only the first loop in work-efficient prefix-sum, the second loop is missing https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l71
(In reply to Jacob Lifshay from comment #5) > basically the parallel reduction algorithm we use is similar to only the > first loop in work-efficient prefix-sum, the second loop is missing > https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py; > h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l71 for a visual diagram, see: https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=HEAD#l164
(In reply to Jacob Lifshay from comment #3) > parallel reduction is a reduction aka has only one output, the rest of the > output vector is intermediate results that aren't a prefix-sum. parallel > reduction is basically 1/2 a prefix-sum operation. ahh that second half of the tree in the image on the wikipedia page. okaaay. right. yes, it's doable. it's pretty seriously late in the day but i like it. ok. so if SVyd is free... which it is... https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=c350b3cf86fadcfdd22663c7119249e00328ff4a;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l281 then at line 301 and 305 "if SVyd == 1 then set submode=0b10 0b11 instead of submode = 0b00 0b01 in SVSHAPE0 and SVSHAPE1 respectively. then in svshape.py: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/svshape.py;h=8b4533755a214d243b509ca20994a7b3ff651cdf;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l166 line 170 needs sub-mode detection... or does it... no. iterate_preduce_indices() is where sub-mode would be looked at and the second half done. do you want to have a go at that? keep it *real* simple - one step at a time, can you add support for * SVSHAPE.skip == 0b10 -> submode 10 (LH argument) parallel-prefix * SVSHAPE.skip == 0b11 -> submode 11 (RH argument) parallel-prefix line 30 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;h=a8093803da46ef81a1254b2b2c5b00f153568578;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l34 and you *should* just be able to add the 2nd half onto the end (line 43 onwards) and do "if SVSHAPE.skip in [0b00, 0b01]: return" and that should basically be it. add a *real* simple 2nd "demo()" function, please *don't* go massively overboard, or make significant changes to this code ok? one step at a time, remap_preduce_yield.py first.
as mentioned on irc, please add a todo list in the top comment. (In reply to Luke Kenneth Casson Leighton from comment #7) > iterate_preduce_indices() is where sub-mode would be > looked at and the second half done. > > do you want to have a go at that? keep it *real* simple yeah, i can do that, probably later today
(In reply to Luke Kenneth Casson Leighton from comment #7) > and you *should* just be able to add the 2nd half onto the end (line 43 > onwards) and do > > "if SVSHAPE.skip in [0b00, 0b01]: return" turns out that reduce and prefix-sum are different enough that merging their loops will be pretty confusing, so I just stuck all the prefix-sum code in a big if. If/when we really need to merge them, we can refactor it then. I think the pseudo-code will also benefit from having prefix-sum completely separate from reduction, since it makes the loops much easier to understand. > > and that should basically be it. add a *real* simple 2nd "demo()" function, > please *don't* go massively overboard, or make significant changes to this > code ok? While I was reading the existing code, I noticed you have: if SVSHAPE.invxyz[1]: steps.reverse() which makes the function much more complex and isn't actually useful (it flips the reduction tree vertically, exchanging the root end with the leaves end), so I think that should be removed and steps merged into the loop rather than being a separate loop generating a list. Therefore, rather than modifying the existing function, I copied the function to iterate_indices2, removed steps reversing, and added prefix-sum to the copy. since prefix-sum is in an if all by itself, it can be easily copied into the existing code if we don't want to remove steps reversing. I added demo_prefix_sum as requested. I also added test_remap_reduce_yield.py to verify that the prefix-sum functionality matches nmigen.prefix_sum
oh, I forgot to link to the commit: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0b6592c574f814d81cfede4c74c50b583590db13 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Apr 26 22:09:19 2023 -0700 add scan/prefix-sum support to copy of parallel-reduce remap iterator
commit d40763cd6e186ad9b17ce6f974a38b4c4877965e (HEAD -> master, origin/master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Thu Apr 27 10:23:27 2023 +0100 link in new parallel-prefix REMAP schedule so that's now linked in to SVSHAPE
other thing I noticed, we need a svshape instruction that can take a dynamic length and produce the prefix-sum or reduce schedules, currently svshape only supports constant lengths.
(In reply to Jacob Lifshay from comment #12) > other thing I noticed, we need a svshape instruction that can take a dynamic > length and produce the prefix-sum or reduce schedules, currently svshape > only supports constant lengths. they're all very deliberately immediate-based because it otherwise means that the establishment of REMAP Schedules is dependent on GPR registers. the sole exception is Indexed REMAP and even that is an "indicator to trigger the process of cacheing GPRs" rather than "actual Hazard" what's the story there? why is dynamic length needed?
(In reply to Luke Kenneth Casson Leighton from comment #13) > (In reply to Jacob Lifshay from comment #12) > > other thing I noticed, we need a svshape instruction that can take a dynamic > > length and produce the prefix-sum or reduce schedules, currently svshape > > only supports constant lengths. > > they're all very deliberately immediate-based because it otherwise > means that the establishment of REMAP Schedules is dependent on > GPR registers. the sole exception is Indexed REMAP and even that > is an "indicator to trigger the process of cacheing GPRs" rather than > "actual Hazard" I was expecting the input to be VL rather than a GPR... > > what's the story there? why is dynamic length needed? because, for code that is trying to reduce or prefix-sum a large array, it needs to be able to have a dynamic length for the vector tail (or similar scenarios): (not the most efficient algorithm, but needs to be possible for non-associative ops): chunk_size = 64 sum = 0 for i in range(0, length, chunk_size): sum += reduce_chunk(VL=chunk_size) sum += reduce_tail(VL=length % chunk_size) # dynamic VL here or, for prefix-sum: chunk_size = 64 sum = 0 for i in range(0, len(array), chunk_size): vec = load(array[i:][:chunk_size],VL=chunk_size) vec = vec.prefix_sum(length=chunk_size) vec += splat(sum) sum = vec[-1] store(vec,array[i:][:chunk_size],VL=chunk_size) remainder = len(array) % chunk_size vec = load(array[i:][:remainder],VL=remainder) vec = vec.prefix_sum(length=remainder) # dynamic VL here vec += splat(sum) store(vec,array[i:][:remainder],VL=remainder)
(In reply to Jacob Lifshay from comment #14) > (In reply to Luke Kenneth Casson Leighton from comment #13) > > what's the story there? why is dynamic length needed? > > because, for code that is trying to reduce or prefix-sum a large array, it > needs to be able to have a dynamic length for the vector tail (or similar > scenarios): > for i in range(0, length, chunk_size): > sum += reduce_chunk(VL=chunk_size) > sum += reduce_tail(VL=length % chunk_size) # dynamic VL here drat, you're right. of course. okaaay sigh so this involves creating a new Form, with Parallel Reduction similar to svshape2 "carving out" its own niche... urrr...
(In reply to Luke Kenneth Casson Leighton from comment #15) > drat, you're right. :) > of course. okaaay sigh so this involves creating a new Form, with Parallel > Reduction similar to svshape2 "carving out" its own niche... urrr... as mentioned on irc, I think we should use one more XO bit and support a GPR for X (not Y or Z) for all svshape modes. SV Shape Dynamic: svshaped RA, SVyd, SVzd, SVrm, vf if RA = 0 then x_len <- VL else x_len <- (RA) VL <- remapped_len(x_len, ...) # since we don't want to set MAXVL from a GPR: MAXVL <- remapped_len(x_len=MAXVL, ...)
I finished adding prefix-sum to svshape, and added some prefix-sum tests which pass! (both for add and sub) The tests I added are pretty short and to-the-point, but not quite what you were asking for, luke, so if you're happy with their showcase-ability we can mark that as DONE. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b6a45579add1491f23a06ca5437c5e5e5989e75e commit b6a45579add1491f23a06ca5437c5e5e5989e75e Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Apr 28 01:49:30 2023 -0700 prefix-sum remap works! commit 1f451e7846d4561a36a67d38814d814ea59f3802 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Apr 28 01:48:03 2023 -0700 change order to tuple in remap preduce tests/demos to match rest of simulator commit 7facebc376274d533c49f34f0b296154ab30f044 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Apr 28 01:47:12 2023 -0700 fix <u and >u with int arguments
(In reply to Jacob Lifshay from comment #17) > I finished adding prefix-sum to svshape, and added some prefix-sum tests > which pass! (both for add and sub) > > The tests I added are pretty short and to-the-point, but not quite what you > were asking for, luke, so if you're happy with their showcase-ability we can > mark that as DONE. yes looks great. can you cut/paste the pseudocode into the Appendix of REMAP? https://libre-soc.org/openpower/sv/remap/appendix/ it goes into the RFC and constitutes "questions feedback" so i can justify putting budget your way for it.
(In reply to Luke Kenneth Casson Leighton from comment #15) > (In reply to Jacob Lifshay from comment #14) > > for i in range(0, length, chunk_size): > > sum += reduce_chunk(VL=chunk_size) > > sum += reduce_tail(VL=length % chunk_size) # dynamic VL here > > drat, you're right. > > of course. okaaay sigh so this involves creating a new Form, with Parallel > Reduction similar to svshape2 "carving out" its own niche... urrr... ah. realised there's a complication. VL and MAXVL determines the number of operations carried out, not the width of the reduction. for now the tail will have to be done by recursive macros using immediates manually.
(In reply to Luke Kenneth Casson Leighton from comment #19) > (In reply to Luke Kenneth Casson Leighton from comment #15) > > (In reply to Jacob Lifshay from comment #14) > > > > for i in range(0, length, chunk_size): > > > sum += reduce_chunk(VL=chunk_size) > > > sum += reduce_tail(VL=length % chunk_size) # dynamic VL here > > > > drat, you're right. > > > > of course. okaaay sigh so this involves creating a new Form, with Parallel > > Reduction similar to svshape2 "carving out" its own niche... urrr... > > ah. realised there's a complication. VL and MAXVL determines the number > of operations carried out, not the width of the reduction. > > for now the tail will have to be done by recursive macros using > immediates manually. that's why I'm proposing a svshape with GPR/VL input, so it can calculate the proper VL (which is a little less than 2x bigger but doesn't follow any simple formula I could find). MAXVL would be adjusted following the same algorithm but using MAXVL as the input rather than VL or a GPR.
(In reply to Jacob Lifshay from comment #20) > that's why I'm proposing a svshape with GPR/VL input, so it can calculate > the proper VL (which is a little less than 2x bigger but doesn't follow any > simple formula I could find). > > MAXVL would be adjusted following the same algorithm but using MAXVL as the > input rather than VL or a GPR. for details, see https://bugs.libre-soc.org/show_bug.cgi?id=1071#c16
(In reply to Jacob Lifshay from comment #16) > if RA = 0 then x_len <- VL > else x_len <- (RA) > VL <- remapped_len(x_len, ...) > # since we don't want to set MAXVL from a GPR: > MAXVL <- remapped_len(x_len=MAXVL, ...) this will fail on the next iteration unless preceded by a setvl MAXVL=n instruction. my feeling is that this should be done after the first round, investigating EXT1xx variants of SVP64 Management instructions.
(In reply to Luke Kenneth Casson Leighton from comment #22) > (In reply to Jacob Lifshay from comment #16) > > > if RA = 0 then x_len <- VL > > else x_len <- (RA) > > VL <- remapped_len(x_len, ...) > > # since we don't want to set MAXVL from a GPR: > > MAXVL <- remapped_len(x_len=MAXVL, ...) > > this will fail on the next iteration unless preceded by a > setvl MAXVL=n instruction. yup, that's the idea -- run setvl first -- hence why I was initially thinking it would only read from VL and not a GPR. maybe just reserve one uncommon SVxd 8mmediate (59?) to mean to instead read SVxd from VL? that way you don't need 2x the encoding space. though i also just now thought of a way around needing setvl -- have 2 MAXVL fields or SPRs, tentatively named OP_MAXVL (used as a limit on element/subvector operations) and REG_MAXVL (used as a backup for OP_MAXVL and for whenever doing register offsetting like RS=RT+MAXVL, since there we care about register counts not how many operations our random svshape needs) both set by setvl when setting MAXVL, but svshape and friends only set OP_MAXVL, computing it from REG_MAXVL or the immediates. it may be also useful to likewise split VL into two, but not as necessary.
(In reply to Jacob Lifshay from comment #23) > (In reply to Luke Kenneth Casson Leighton from comment #22) > > (In reply to Jacob Lifshay from comment #16) > > this will fail on the next iteration unless preceded by a > > setvl MAXVL=n instruction. > > yup, that's the idea -- run setvl first -- hence why I was initially > thinking it would only read from VL and not a GPR. maybe just reserve one > uncommon SVxd 8mmediate (59?) to mean to instead read SVxd from VL? maaybeee... although realistically this is an entirely new instruction so in theory could do anything. > that way you don't need 2x the encoding space. irony: 64-bit instruction encodings (for e.g. svshape) are just as long as 2 32-bit instructions. sigh. > though i also just now thought of a way around needing setvl -- have 2 MAXVL > fields or SPRs, tentatively named OP_MAXVL (used as a limit on > element/subvector operations) and REG_MAXVL (used as a backup for OP_MAXVL > and for whenever doing register offsetting like RS=RT+MAXVL, DCT and FFT rely on RS=RT+MAXVL being separate > since there we > care about register counts not how many operations our random svshape needs) > both set by setvl when setting MAXVL, but svshape and friends only set > OP_MAXVL, computing it from REG_MAXVL or the immediates. it may be also > useful to likewise split VL into two, but not as necessary. very reluctant to go to a 2nd SPR for SV state. especially right now.
(In reply to Luke Kenneth Casson Leighton from bug #1155 comment #43) > not happening, i have said this before and it is inviolate and not > to be discussed further under this bugreport as it is firmly out of > scope. they can write directly to SVSHAPEs and take the consequences > after reading the spec's warnings. by choosing to not allow dynamic size on svshape instructions you are forcing parallel reduction to be much less powerful than RVV (RVV supports dynamic reduction sizes), since computing the right values of VL from the length is complex enough that it really shouldn't be done by software (it would be maybe 10-20 instructions), hardware can be substantially more efficient since computing the length is basically a sum of several simple-ish terms, where carry-save adders shine. Also, setvl reads from registers, if svshape reads from registers it behaves very similarly in how the sv-prefix loop expander could have to wait for the value to be computed, and you had no problems with setvl...
(In reply to Jacob Lifshay from comment #25) > (In reply to Luke Kenneth Casson Leighton from bug #1155 comment #43) > > not happening, i have said this before and it is inviolate and not > > to be discussed further under this bugreport as it is firmly out of > > scope. they can write directly to SVSHAPEs and take the consequences > > after reading the spec's warnings. > > by choosing to not allow dynamic size on svshape instructions READ WHAT I WROTE. the consequences are too damaging on the hardware, making it impractical. you are > forcing parallel reduction to be much less powerful than RVV (RVV supports > dynamic reduction sizes), i don't care what RVV does, here. they do not have precise interrupt guarantees as a hard requirement for a start. > since computing the right values of VL from the > length is complex enough that it really shouldn't be done by software (it > would be maybe 10-20 instructions), then you should explain and illustrate that instead of saying "it MUST be dynamic" with no justification or explanation, given that you have no idea of how to properly think through the consequences. svindex and svoffset already compute a 2nd dimension based on VL, however it is done from immediates (not GPR Hazards). > hardware can be substantially more > efficient since computing the length is basically a sum of several > simple-ish terms, where carry-save adders shine. then when you have explained the problem, which i have no idea of its scope, potential solutions can be investigated *without* catastrophically damaging SV. > Also, setvl reads from registers, if svshape reads from registers it behaves > very similarly in how the sv-prefix loop expander could have to wait for the > value to be computed, and you had no problems with setvl... it's the one and only entrypoint, and as such is limited in Register Hazards. svshape has FIVE register hazards as outputs already. the absolute last thing we need is to make that SIX, one of which is a GPR as input. the consequences would be catastrophic on the Hazard Mamagement. HARD NO means HARD NO, jacob. so. please describe *exactly* the algorithm, or illustrate it clearly, so that i can assess it properly.
ok, so jacob, you need to examine the registers in and out on all of the sv management instructions, and memorise them, and start to understand - and accept - fully, the consequences for implementability. * setvl has GPR in, GPR out, SVSTATE in and out, and CTR in and out. both SVSTATE and CTR are "state" at the same peer level as PC and MSR so are exceptional: they are not entirely normal Hazards (just like PC is not a norml hazard). *and* Rc=1. *SEVEN* registers. that's insane. some implementations will just have a hard time keeping the IPC to 1, here. it is what it is. * svstep is GPR out and SVSTATE in and out (i added GPR in recently but will likely back it out) *and* Rc=1 * with the exception of svindex the REMAP instructions are ONLY modifying "state" at the level of PC and MSR: SVSTATE and SVSHAPE*. svindex had to have some very special conditions placed on it to not cause catastrophic damage to implementability (read the spec on this please).
(In reply to Luke Kenneth Casson Leighton from comment #26) > the consequences are too damaging on the hardware, making it impractical. I have read what you wrote and I disagree. > > you are > > forcing parallel reduction to be much less powerful than RVV (RVV supports > > dynamic reduction sizes), > > i don't care what RVV does, here. they do not have precise interrupt > guarantees as a hard requirement for a start. if we choose to have a dynamic svshape instruction, it shouldn't affect precise interrupt guarantees, since either the svshape instruction is finished or it's not and the outputs aren't yet written. the hard part is the SV loop afterwards, which will have precise interrupts regardless of if the svshape registers are set from a svshape instruction's immediates, from a dynamic svshape instruction, or from a mtspr. > > > since computing the right values of VL from the > > length is complex enough that it really shouldn't be done by software (it > > would be maybe 10-20 instructions), > > then you should explain and illustrate that instead of saying > "it MUST be dynamic" with no justification or explanation, given that > you have no idea of how to properly think through the consequences. the current algorithm: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=33a02e6612065f290d840e15a596dfc2177de5e5;hb=fa603a1e9f2259d86acf4e9451937a000d099289#l309 step <- 0b0000001 i <- 0b0000000 do while step <u itercount newstep <- step[1:6] || 0b0 j[0:6] <- 0b0000000 do while (j+step <u itercount) j <- j + newstep i <- i + 1 step <- newstep # VL in Parallel-Reduce is the number of operations vlen[0:6] <- i now, obviously that's not very suited to hardware in that form, so we refactor it: replace inner loop with expression: i <- i + (itercount + step - 1) / newstep replace step and newstep with shifting (since they're always powers of 2): shift <- 0 i <- 0b0000000 do while (1 << shift) <u itercount i <- i + (itercount + (1 << shift) - 1) >> (shift + 1) shift <- shift + 1 # VL in Parallel-Reduce is the number of operations vlen[0:6] <- i assume itercount is in [0,127], change loop to fixed number of iterations (all extra iterations will add 0, so are no-ops): i <- 0b0000000 do shift = 0 to 6 i <- i + (itercount + (1 << shift) - 1) >> (shift + 1) # VL in Parallel-Reduce is the number of operations vlen[0:6] <- i unroll: i <- itercount >> 1 i <- i + (itercount + 1) >> 2 i <- i + (itercount + 3) >> 3 i <- i + (itercount + 7) >> 4 i <- i + (itercount + 15) >> 5 i <- i + (itercount + 31) >> 6 i <- i + (itercount + 63) >> 7 # VL in Parallel-Reduce is the number of operations vlen[0:6] <- i now, we finally have something suitable for HW! in software, this would be 12 adds, and 7 shifts, so 19 instructions. in HW, using carry-save adders for the final sum, this could likely be done in 1 cycle, 2 at the very most. > svshape has FIVE register hazards as outputs already. the absolute > last thing we need is to make that SIX, one of which is a GPR as input. > the consequences would be catastrophic on the Hazard Mamagement. 3 points: 1. for prefix-sum/reduction, it is sufficient to read from VL, so that makes hazard-tracking that much simpler since it's one possible register, and not 128. 2. I see no reason why all SVSTATE registers can't be hazard-tracked as one big register (since they're almost always used together anyway), which reduces the number of register hazards for dynamic svshape to 2-in (VL and MAXVL) 3-out (VL and MAXVL and the SVSTATE* register). 3. as you mentioned while I was writing this, SVSTATE, VL, and MAXVL are similar to PC, so, assuming we're talking about a big OoO core, they don't need a huge hazard-tracking matrix, instead there's basically a branch, cancelling all following speculative instructions and then re-issuing them with the correct state.
(In reply to Jacob Lifshay from comment #28) > 2. I see no reason why all SVSTATE registers can't be hazard-tracked as one > big register (since they're almost always used together anyway), which > reduces the number of register hazards for dynamic svshape to 2-in (VL and > MAXVL) 3-out (VL and MAXVL and the SVSTATE* register). I meant SVSHAPE registers, sorry.
(In reply to Luke Kenneth Casson Leighton from comment #27) > svindex had to have some very special conditions placed on it to > not cause catastrophic damage to implementability (read the spec on > this please). I have, recall I tried to convince you to add out-of-range zeroing...
(In reply to Jacob Lifshay from comment #28) > in software, this would be 12 adds, and 7 shifts, so 19 instructions. in HW, > using carry-save adders for the final sum, this could likely be done in 1 > cycle, 2 at the very most. note, 19 instructions isn't even including the extra instructions to set SVSHAPE*.
(In reply to Jacob Lifshay from comment #28) > (In reply to Luke Kenneth Casson Leighton from comment #26) > > the consequences are too damaging on the hardware, making it impractical. > > I have read what you wrote and I disagree. you can disagree all you like on ths one: it's not happening. any *other* way is perfectly fine, but adding GPR dependencies to REMAP SVSTATE/SVSHAPE is an absolute inviolate hard NO and that is the end of the discussion, no matter how much you disagree, or do not like it, or how much it degrades "performance". a separate instruction that carries out the job of computing the contents of SVSHAPE, such that it can be moved in with mtspr? not a problem at all. changing of sv* REMAP instructions to add a GPR as a Hazard? get it and accept it: the answer is NO. > if we choose to have a dynamic svshape instruction, it shouldn't affect > precise interrupt guarantees, it destroys implementability by creating critical linkage between GPRs at Decode and Issue. it is not going to happen. get this through your head, STOP ARGUING, and start questioning WHY until you do actually understand why, ok? > the hard part is the SV loop afterwards, which will have precise interrupts > regardless of if the svshape registers are set from a svshape instruction's > immediates, from a dynamic svshape instruction, or from a mtspr. and dozens of other instructions with outstanding GPR Hazards. if there was a separate register file for Indices this would not be a problem. transfer from GPRs to "REMAP-Indices-regfile" would be a single 1-in 1-out instruction. we have been through this already with the CR regfile and with the GPR-FPR instructions. GPR-FPR instructions are *very deliberately* only 1-in 1-out as anything else is catastrophic for the size of the Hazard Matrices. > the current algorithm: > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/ > simplev.mdwn;h=33a02e6612065f290d840e15a596dfc2177de5e5; > hb=fa603a1e9f2259d86acf4e9451937a000d099289#l309 ok let me take a look, try to understand what is going on.