parallel prefix sum (also known as `scan`) is a commonly useful operation, since there's space in remap modes (mode 0b11) now that some were removed I think we should add it. some of the applications of parallel prefix sum: https://en.wikipedia.org/wiki/Prefix_sum#Applications * Counting sort/Radix sort (the most common sorting algorithm used on GPUs) * Decoding Grey code * Parallel Polynomial Interpolation * Computing Factorials (prefix-product of 1,2,3,4,...) * WIP: jacob: add SVSHAPE.skip driven update of remap_preduce_yield.py done, except as a new function iterate_indices2, only thing left is to merge into the original function after deciding if we want to delete `steps` reversing or not. See comment #8 * DONE: jacob: add *really* short demo2() bare minimum test * DONE: jacob: add unit tests verifying that remap_preduce_yield.py's prefix-sum matches known working code in nmutil.prefix_sum. * DONE: jacob: integrate into svshape pseudocode * DONE: luke: update wiki * DONE: lkcl: add into ISACaller svshape.py * DONE: jacob: add some unit tests into test_caller_svp64_preduce.py these should be "showcaseable" i.e. if presented on a conference they should be easy to understand. * TODO jacob: put pseudocode into spec/RFC https://libre-soc.org/openpower/sv/remap/appendix/
we can derive the algorithm from the already-known-working nmutil prefix_sum.py which implements both the work-efficient and the low-depth algorithms: https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l35
how is it different from parallel prefix reduction bug #864? https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;hb=HEAD
(In reply to Luke Kenneth Casson Leighton from comment #2) > how is it different from parallel prefix reduction bug #864? > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/ > decoder/isa/remap_preduce_yield.py;hb=HEAD parallel reduction is a reduction aka has only one output, the rest of the output vector is intermediate results that aren't a prefix-sum. parallel reduction is basically 1/2 a prefix-sum operation.
https://libre-soc.org/openpower/sv/remap/ 3.1 Parallel Reduction Mode Creates the Schedules for Parallel Tree Reduction. submode=0b00 selects the left operand index submode=0b01 selects the right operand index it's *just* about possible to squeeze in alternative modes without going overboard - svshape's 3rd dimension (zdim) could in theory be used to indicate an alternative reduction algorithm.
basically the parallel reduction algorithm we use is similar to only the first loop in work-efficient prefix-sum, the second loop is missing https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l71
(In reply to Jacob Lifshay from comment #5) > basically the parallel reduction algorithm we use is similar to only the > first loop in work-efficient prefix-sum, the second loop is missing > https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py; > h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l71 for a visual diagram, see: https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=HEAD#l164
(In reply to Jacob Lifshay from comment #3) > parallel reduction is a reduction aka has only one output, the rest of the > output vector is intermediate results that aren't a prefix-sum. parallel > reduction is basically 1/2 a prefix-sum operation. ahh that second half of the tree in the image on the wikipedia page. okaaay. right. yes, it's doable. it's pretty seriously late in the day but i like it. ok. so if SVyd is free... which it is... https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=c350b3cf86fadcfdd22663c7119249e00328ff4a;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l281 then at line 301 and 305 "if SVyd == 1 then set submode=0b10 0b11 instead of submode = 0b00 0b01 in SVSHAPE0 and SVSHAPE1 respectively. then in svshape.py: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/svshape.py;h=8b4533755a214d243b509ca20994a7b3ff651cdf;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l166 line 170 needs sub-mode detection... or does it... no. iterate_preduce_indices() is where sub-mode would be looked at and the second half done. do you want to have a go at that? keep it *real* simple - one step at a time, can you add support for * SVSHAPE.skip == 0b10 -> submode 10 (LH argument) parallel-prefix * SVSHAPE.skip == 0b11 -> submode 11 (RH argument) parallel-prefix line 30 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;h=a8093803da46ef81a1254b2b2c5b00f153568578;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l34 and you *should* just be able to add the 2nd half onto the end (line 43 onwards) and do "if SVSHAPE.skip in [0b00, 0b01]: return" and that should basically be it. add a *real* simple 2nd "demo()" function, please *don't* go massively overboard, or make significant changes to this code ok? one step at a time, remap_preduce_yield.py first.
as mentioned on irc, please add a todo list in the top comment. (In reply to Luke Kenneth Casson Leighton from comment #7) > iterate_preduce_indices() is where sub-mode would be > looked at and the second half done. > > do you want to have a go at that? keep it *real* simple yeah, i can do that, probably later today
(In reply to Luke Kenneth Casson Leighton from comment #7) > and you *should* just be able to add the 2nd half onto the end (line 43 > onwards) and do > > "if SVSHAPE.skip in [0b00, 0b01]: return" turns out that reduce and prefix-sum are different enough that merging their loops will be pretty confusing, so I just stuck all the prefix-sum code in a big if. If/when we really need to merge them, we can refactor it then. I think the pseudo-code will also benefit from having prefix-sum completely separate from reduction, since it makes the loops much easier to understand. > > and that should basically be it. add a *real* simple 2nd "demo()" function, > please *don't* go massively overboard, or make significant changes to this > code ok? While I was reading the existing code, I noticed you have: if SVSHAPE.invxyz[1]: steps.reverse() which makes the function much more complex and isn't actually useful (it flips the reduction tree vertically, exchanging the root end with the leaves end), so I think that should be removed and steps merged into the loop rather than being a separate loop generating a list. Therefore, rather than modifying the existing function, I copied the function to iterate_indices2, removed steps reversing, and added prefix-sum to the copy. since prefix-sum is in an if all by itself, it can be easily copied into the existing code if we don't want to remove steps reversing. I added demo_prefix_sum as requested. I also added test_remap_reduce_yield.py to verify that the prefix-sum functionality matches nmigen.prefix_sum
oh, I forgot to link to the commit: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0b6592c574f814d81cfede4c74c50b583590db13 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Apr 26 22:09:19 2023 -0700 add scan/prefix-sum support to copy of parallel-reduce remap iterator
commit d40763cd6e186ad9b17ce6f974a38b4c4877965e (HEAD -> master, origin/master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Thu Apr 27 10:23:27 2023 +0100 link in new parallel-prefix REMAP schedule so that's now linked in to SVSHAPE
other thing I noticed, we need a svshape instruction that can take a dynamic length and produce the prefix-sum or reduce schedules, currently svshape only supports constant lengths.
(In reply to Jacob Lifshay from comment #12) > other thing I noticed, we need a svshape instruction that can take a dynamic > length and produce the prefix-sum or reduce schedules, currently svshape > only supports constant lengths. they're all very deliberately immediate-based because it otherwise means that the establishment of REMAP Schedules is dependent on GPR registers. the sole exception is Indexed REMAP and even that is an "indicator to trigger the process of cacheing GPRs" rather than "actual Hazard" what's the story there? why is dynamic length needed?
(In reply to Luke Kenneth Casson Leighton from comment #13) > (In reply to Jacob Lifshay from comment #12) > > other thing I noticed, we need a svshape instruction that can take a dynamic > > length and produce the prefix-sum or reduce schedules, currently svshape > > only supports constant lengths. > > they're all very deliberately immediate-based because it otherwise > means that the establishment of REMAP Schedules is dependent on > GPR registers. the sole exception is Indexed REMAP and even that > is an "indicator to trigger the process of cacheing GPRs" rather than > "actual Hazard" I was expecting the input to be VL rather than a GPR... > > what's the story there? why is dynamic length needed? because, for code that is trying to reduce or prefix-sum a large array, it needs to be able to have a dynamic length for the vector tail (or similar scenarios): (not the most efficient algorithm, but needs to be possible for non-associative ops): chunk_size = 64 sum = 0 for i in range(0, length, chunk_size): sum += reduce_chunk(VL=chunk_size) sum += reduce_tail(VL=length % chunk_size) # dynamic VL here or, for prefix-sum: chunk_size = 64 sum = 0 for i in range(0, len(array), chunk_size): vec = load(array[i:][:chunk_size],VL=chunk_size) vec = vec.prefix_sum(length=chunk_size) vec += splat(sum) sum = vec[-1] store(vec,array[i:][:chunk_size],VL=chunk_size) remainder = len(array) % chunk_size vec = load(array[i:][:remainder],VL=remainder) vec = vec.prefix_sum(length=remainder) # dynamic VL here vec += splat(sum) store(vec,array[i:][:remainder],VL=remainder)
(In reply to Jacob Lifshay from comment #14) > (In reply to Luke Kenneth Casson Leighton from comment #13) > > what's the story there? why is dynamic length needed? > > because, for code that is trying to reduce or prefix-sum a large array, it > needs to be able to have a dynamic length for the vector tail (or similar > scenarios): > for i in range(0, length, chunk_size): > sum += reduce_chunk(VL=chunk_size) > sum += reduce_tail(VL=length % chunk_size) # dynamic VL here drat, you're right. of course. okaaay sigh so this involves creating a new Form, with Parallel Reduction similar to svshape2 "carving out" its own niche... urrr...
(In reply to Luke Kenneth Casson Leighton from comment #15) > drat, you're right. :) > of course. okaaay sigh so this involves creating a new Form, with Parallel > Reduction similar to svshape2 "carving out" its own niche... urrr... as mentioned on irc, I think we should use one more XO bit and support a GPR for X (not Y or Z) for all svshape modes. SV Shape Dynamic: svshaped RA, SVyd, SVzd, SVrm, vf if RA = 0 then x_len <- VL else x_len <- (RA) VL <- remapped_len(x_len, ...) # since we don't want to set MAXVL from a GPR: MAXVL <- remapped_len(x_len=MAXVL, ...)
I finished adding prefix-sum to svshape, and added some prefix-sum tests which pass! (both for add and sub) The tests I added are pretty short and to-the-point, but not quite what you were asking for, luke, so if you're happy with their showcase-ability we can mark that as DONE. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b6a45579add1491f23a06ca5437c5e5e5989e75e commit b6a45579add1491f23a06ca5437c5e5e5989e75e Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Apr 28 01:49:30 2023 -0700 prefix-sum remap works! commit 1f451e7846d4561a36a67d38814d814ea59f3802 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Apr 28 01:48:03 2023 -0700 change order to tuple in remap preduce tests/demos to match rest of simulator commit 7facebc376274d533c49f34f0b296154ab30f044 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Apr 28 01:47:12 2023 -0700 fix <u and >u with int arguments
(In reply to Jacob Lifshay from comment #17) > I finished adding prefix-sum to svshape, and added some prefix-sum tests > which pass! (both for add and sub) > > The tests I added are pretty short and to-the-point, but not quite what you > were asking for, luke, so if you're happy with their showcase-ability we can > mark that as DONE. yes looks great. can you cut/paste the pseudocode into the Appendix of REMAP? https://libre-soc.org/openpower/sv/remap/appendix/ it goes into the RFC and constitutes "questions feedback" so i can justify putting budget your way for it.
(In reply to Luke Kenneth Casson Leighton from comment #15) > (In reply to Jacob Lifshay from comment #14) > > for i in range(0, length, chunk_size): > > sum += reduce_chunk(VL=chunk_size) > > sum += reduce_tail(VL=length % chunk_size) # dynamic VL here > > drat, you're right. > > of course. okaaay sigh so this involves creating a new Form, with Parallel > Reduction similar to svshape2 "carving out" its own niche... urrr... ah. realised there's a complication. VL and MAXVL determines the number of operations carried out, not the width of the reduction. for now the tail will have to be done by recursive macros using immediates manually.
(In reply to Luke Kenneth Casson Leighton from comment #19) > (In reply to Luke Kenneth Casson Leighton from comment #15) > > (In reply to Jacob Lifshay from comment #14) > > > > for i in range(0, length, chunk_size): > > > sum += reduce_chunk(VL=chunk_size) > > > sum += reduce_tail(VL=length % chunk_size) # dynamic VL here > > > > drat, you're right. > > > > of course. okaaay sigh so this involves creating a new Form, with Parallel > > Reduction similar to svshape2 "carving out" its own niche... urrr... > > ah. realised there's a complication. VL and MAXVL determines the number > of operations carried out, not the width of the reduction. > > for now the tail will have to be done by recursive macros using > immediates manually. that's why I'm proposing a svshape with GPR/VL input, so it can calculate the proper VL (which is a little less than 2x bigger but doesn't follow any simple formula I could find). MAXVL would be adjusted following the same algorithm but using MAXVL as the input rather than VL or a GPR.
(In reply to Jacob Lifshay from comment #20) > that's why I'm proposing a svshape with GPR/VL input, so it can calculate > the proper VL (which is a little less than 2x bigger but doesn't follow any > simple formula I could find). > > MAXVL would be adjusted following the same algorithm but using MAXVL as the > input rather than VL or a GPR. for details, see https://bugs.libre-soc.org/show_bug.cgi?id=1071#c16
(In reply to Jacob Lifshay from comment #16) > if RA = 0 then x_len <- VL > else x_len <- (RA) > VL <- remapped_len(x_len, ...) > # since we don't want to set MAXVL from a GPR: > MAXVL <- remapped_len(x_len=MAXVL, ...) this will fail on the next iteration unless preceded by a setvl MAXVL=n instruction. my feeling is that this should be done after the first round, investigating EXT1xx variants of SVP64 Management instructions.
(In reply to Luke Kenneth Casson Leighton from comment #22) > (In reply to Jacob Lifshay from comment #16) > > > if RA = 0 then x_len <- VL > > else x_len <- (RA) > > VL <- remapped_len(x_len, ...) > > # since we don't want to set MAXVL from a GPR: > > MAXVL <- remapped_len(x_len=MAXVL, ...) > > this will fail on the next iteration unless preceded by a > setvl MAXVL=n instruction. yup, that's the idea -- run setvl first -- hence why I was initially thinking it would only read from VL and not a GPR. maybe just reserve one uncommon SVxd 8mmediate (59?) to mean to instead read SVxd from VL? that way you don't need 2x the encoding space. though i also just now thought of a way around needing setvl -- have 2 MAXVL fields or SPRs, tentatively named OP_MAXVL (used as a limit on element/subvector operations) and REG_MAXVL (used as a backup for OP_MAXVL and for whenever doing register offsetting like RS=RT+MAXVL, since there we care about register counts not how many operations our random svshape needs) both set by setvl when setting MAXVL, but svshape and friends only set OP_MAXVL, computing it from REG_MAXVL or the immediates. it may be also useful to likewise split VL into two, but not as necessary.
(In reply to Jacob Lifshay from comment #23) > (In reply to Luke Kenneth Casson Leighton from comment #22) > > (In reply to Jacob Lifshay from comment #16) > > this will fail on the next iteration unless preceded by a > > setvl MAXVL=n instruction. > > yup, that's the idea -- run setvl first -- hence why I was initially > thinking it would only read from VL and not a GPR. maybe just reserve one > uncommon SVxd 8mmediate (59?) to mean to instead read SVxd from VL? maaybeee... although realistically this is an entirely new instruction so in theory could do anything. > that way you don't need 2x the encoding space. irony: 64-bit instruction encodings (for e.g. svshape) are just as long as 2 32-bit instructions. sigh. > though i also just now thought of a way around needing setvl -- have 2 MAXVL > fields or SPRs, tentatively named OP_MAXVL (used as a limit on > element/subvector operations) and REG_MAXVL (used as a backup for OP_MAXVL > and for whenever doing register offsetting like RS=RT+MAXVL, DCT and FFT rely on RS=RT+MAXVL being separate > since there we > care about register counts not how many operations our random svshape needs) > both set by setvl when setting MAXVL, but svshape and friends only set > OP_MAXVL, computing it from REG_MAXVL or the immediates. it may be also > useful to likewise split VL into two, but not as necessary. very reluctant to go to a 2nd SPR for SV state. especially right now.