https://groups.google.com/g/comp.arch/c/5h6jzyESg5s create example SVP64 assembly for: int m2(int * const restrict a, int n) { int m, nm; int i; m = INT_MIN; nm = -1; for (i=0; i<n; i++) { if (a[i] > m) { m = a[i]; nm = i; } } return nm; }
* a data-dependent fail-first sv.cmp tests against a[i] <= m. * SVP64 Branch new CTR Mode would loop back over all elements that failed, decrementing CTR in the process. * a data-dependent ffirst sv.cmp a[i] > m creates a predicate mask where VL is truncated to exclude the first failed element.
int m2(int * const restrict a, int n) { int m, nm; int i; m = INT_MIN; nm = -1; i=0; while (i<n) { while (i<n && a[i]<=m) i++; if (a[i] > m) { m = a[i]; nm = i; } i++; } return nm; } bug https://groups.google.com/g/comp.arch/c/5h6jzyESg5s/m/FiMO3falAgAJ while (i<n) { while (i<n && a[i]<=m) i++; if (i < n) { m = a[i]; nm = i; } i++; } alternative? while (i<n) { // skip up to first max while (i<n && a[i]<=m) i++; // continue as long as picking new m while (i<n && a[i]>m) { m = a[i]; nm = i; i++; } }
changing milestone to match parent task for budget allocation
(In reply to Luke Kenneth Casson Leighton from comment #0) this one is remarkably simple: use the characteristics of mapreduce mode to overwrite nm from a parallel comparison > int m2(int * const restrict a, int n) > { > int m, nm; > int i; > > m = INT_MIN; > nm = -1; first prepare offsets setvl vl=16 sv.svstep *offs, srcstep # 0 1 2 ... vl-1 and nm-potential li nm, -1 li nmbase, 0 next use CTR mode > for (i=0; i<n; i++) > { setvl r1,0,16,0,1,1 # VL=r1=MIN(MAXVL,CTR) perform the load of the vector data, post-increment update: sv.ldux/pi *a, 8(a_ptr) > if (a[i] > m) > { get max comparison, overlap-loop into m where reg a=m+1 sv.max. *m, *a, *m next, these two both use mapreduce mode with predication > m = a[i]; sv.ori/mr/m=gt m,*a,0 > nm = i; add-overwriting base with vector-offset into nm, yes *scalar* nm, the "last winner" is the largest index sv.add/mr/m=gt nm, nmbase, *offs > } nmbase must be incremented by vl add nmbase, nmbase, r1 # r1 contains copy of vl branch and subtract VL from CTR if not end of loop sv.bc/all 16, *0, loop > } > return nm; mr r3, nmbase blr the trick here is the multiple overwrites of nm and m, even though they are scalar, the usual "termination of looping because scalar result" is switched *off* in mapreduce mode. thus due to the predication the last successful conditional overwrite "wins" the second trick was to use a base nm which increments by VL, from a vector of sequential constants. there will be better/other methods, using data-dependent failfirst will stop at the first compare-fail but will need "continuation" (2 nested loops) whereas the mapreduce one is really simple but relies on WaW hazards to be eliminated
we now have "sv.maxs." which should do nicely to get the maximum *and* notify (Rc=1 vector) whether the max occurred (combined cmp function). see bug #1057 and https://libre-soc.org/openpower/sv/rfc/ls013/ if predicated with a prior scalar-vector cmp, although there would be two vector compares it would reduce the amount of interdependencies (except on a pre-sorted list) see bug #915 about combined cmp function
also from strncpy, load-post-increment 48 # load VL bytes (update r10 addr) 49 "sv.lbzup *16, 1(10)",
there is another technique which will work here: parallel-reduction schedule in vertical-first mode. the reason that vertical-first is needed is because in the middle of the reduction the index has to be updated: it is 2 bits of work not 1.
continuing https://groups.google.com/g/comp.arch/c/5h6jzyESg5s/m/SNXgLyP8AgAJ
You can find the compiler options for SFFS PowerISA, in the Makefile: https://git.libre-soc.org/?p=microwatt.git;a=tree;f=hello_world;h=ad4f2ff25fb1ec8c3c0233c8d341a7e5ca6cdb94;hb=refs/heads/verilator_trace
(In reply to Andrey Miroshnikov from comment #9) > You can find the compiler options for SFFS PowerISA, in the Makefile: > https://git.libre-soc.org/?p=microwatt.git;a=tree;f=hello_world; > h=ad4f2ff25fb1ec8c3c0233c8d341a7e5ca6cdb94;hb=refs/heads/verilator_trace looks like this is exactly what shriya needs, she asked me in a private email (conversation moved here) how to compile a C version of MAXLOC for SFFS. so, basically, copy the hello_world directory, modify the C source to have the MAXLOC algorithm, and then run the Makefile, the Makefile already has all the -mno-vsx and similar options needed.
(In reply to Jacob Lifshay from comment #10) > (In reply to Andrey Miroshnikov from comment #9) > > You can find the compiler options for SFFS PowerISA, in the Makefile: > > https://git.libre-soc.org/?p=microwatt.git;a=tree;f=hello_world; > > h=ad4f2ff25fb1ec8c3c0233c8d341a7e5ca6cdb94;hb=refs/heads/verilator_trace total overkill andrey. shriya will not know a fraction of these. > looks like this is exactly what shriya needs, she asked me in a private > email (conversation moved here) how to compile a C version of MAXLOC for > SFFS. > > so, basically, copy the hello_world directory, too complex. > modify the C source to have > the MAXLOC algorithm, and then run the Makefile, the Makefile already has > all the -mno-vsx and similar options needed. waaay too complex. like the knuth examples. standalone command. no Makefile. gcc -Os -mno-vsx -mno-altivec maxloc.c jacob can you make sure i got that right. bare minimum options. no warnings no "pick this standard", nothing. absolute bare minimum options.
(In reply to Luke Kenneth Casson Leighton from comment #11) > like the knuth examples. honestly those need Makefiles (and comments saying which one we use), since they have been mis-interpreted before when people were trying to figure out which #define is used (since there's like 10 options). We're using the last one, but they read an earlier one and started basing instruction proposals off it... https://bugs.libre-soc.org/show_bug.cgi?id=1192#c1 Also because I had to spend several minutes trying to figure out how to build them with the right #defines since I forgot. > standalone command. no Makefile. I think it should have a Makefile even if it's very simple, since that tells others how to build it and they don't have to type or copy/paste in the full command every time. the following should be sufficient (make sure indentation is tabs, not spaces, otherwise it won't work): .PHONY: all clean all: maxloc maxloc: maxloc.c gcc -Os -mno-vsx -mno-altivec maxloc.c -o maxloc clean: rm -rf maxloc > > gcc -Os -mno-vsx -mno-altivec maxloc.c > > jacob can you make sure i got that right. that looks right, assuming you want a binary named a.out.
(In reply to Jacob Lifshay from comment #12) > (In reply to Luke Kenneth Casson Leighton from comment #11) > > like the knuth examples. > > honestly those need Makefiles (and comments saying which one we use), since > they have been mis-interpreted before when people were trying to figure out > which #define is used (since there's like 10 options). We're using the last > one, but they read an earlier one and started basing instruction proposals > off it... > https://bugs.libre-soc.org/show_bug.cgi?id=1192#c1 > > Also because I had to spend several minutes trying to figure out how to > build them with the right #defines since I forgot. > > > standalone command. no Makefile. > > I think it should have a Makefile even if it's very simple the command is going in the tutorial, but yes good idea. > .PHONY: all clean > > all: maxloc too complex. too many lines and targets. this is intended as a REAL drastic simple tutorial. shriya only use these two lines in a file named "Makefile". as jacob says make sure the indent is a tab not spaces. maxloc: maxloc.c gcc -Os -mno-vsx -mno-altivec maxloc.c -o maxloc > > gcc -Os -mno-vsx -mno-altivec maxloc.c > > > > jacob can you make sure i got that right. > > that looks right, assuming you want a binary named a.out. yes. short. simple.
Alright. Thanks
(In reply to Jacob Lifshay from comment #10) > (In reply to Andrey Miroshnikov from comment #9) > > You can find the compiler options for SFFS PowerISA, in the Makefile: > > https://git.libre-soc.org/?p=microwatt.git;a=tree;f=hello_world; > > h=ad4f2ff25fb1ec8c3c0233c8d341a7e5ca6cdb94;hb=refs/heads/verilator_trace > > looks like this is exactly what shriya needs, she asked me in a private > email (conversation moved here) how to compile a C version of MAXLOC for > SFFS. > > so, basically, copy the hello_world directory, modify the C source to have > the MAXLOC algorithm, and then run the Makefile, the Makefile already has > all the -mno-vsx and similar options needed. Okay, Thanks Jacob.
(In reply to shriya.sharma from comment #15) > Okay, Thanks Jacob. (shriya, reminder: please trim all excess context. we can all see and follow and read the entire thread: repetitions of exactly what is in the comment above is just noise. i have edited comment #15 to demonstrate what you should do, *and* have trimmed excess context above as well. can you please read this and absorb it, but pay special attention to the "trim" section https://people.kernel.org/tglx/notes-about-netiquette ). now, please re-read comment #11, i appreciate jacob is helping but he (sorry referring to you 3rd peron here jacob) tends to make things much more complex - and robust - which would be extremely good *if* that were necessary (which it often will be in many cases throughout the project) but not in this case. this is a tutorial/demo and the rules for a tutorial/demo are to keep it STRICTLY to the bare minimum in order to not have a high noise-to-signal ratio. * we want to demonstrate the algorithm in SVP64 assembler vs c and python. * the algorithm itself is only 12 lines long. * the Makefile that jacob wrote is *longer than the algorithm*! * the one that Andrey referred to is even longer and more complex. we are not demonstrating "how to write perfect Makefiles"!
link to the standalone c program for FORTRAN MAXLOC: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=852dde97c78508d0ebf6f30a9ceed7b1bc8dc9f1
(In reply to shriya.sharma from comment #17) > link to the standalone c program for FORTRAN MAXLOC: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=514a48499f05e1cbf444d35f1ab7f7fd863c5e9f Forgot to remove unused include. We were getting a random error where the output was: (ls2)am@tpam430:~/src/openpower-isa/maxloc$ ./maxloc Index of the maximum value in an array is: 8 Instead of: (ls2)am@tpam430:~/src/openpower-isa/maxloc$ ./maxloc Index of the maximum value in an array is: 6 (Which is the correct result) I only managed to get the incorrect result of 8 once; subsequent clean and compilation fixed the issue.
i got rid of a *lot* of unnecessary crap. to repeat, and please listen: this is intended as a *short* demonstration. const: crap. limits.h: crap. long "perfect" Makefile: i *already made it clear* to keep it to the bare minimum. i really, *really* do not like being ignored, it is very disrespectful.
Maxloc description: https://www.cenapad.unicamp.br/parque/manuais/Xlf/lr318.HTM#:~:text=Locates%20the%20first%20element%20of,element%20using%20a%20positive%20integer.
i noticed that changing the initial value of m to 0 instead of INT_MIN is incorrect, since the algorithm uses signed integers. m starting at 0 (or actually any int at all) is incorrect since the algorithm ignores all array elements <= m, thereby giving the wrong answer for input [-5, -7, -6] or even [-1, 0, -2] -- both of those return -1 even though the correct answer is 0 and 1 respectively. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=maxloc/maxloc.c;h=6becf3ef26fb8f71e50dbd0bcb774c6d49d307c2;hb=f9d2e36493632a179e08f5eb3569b1f9f05d05d1
(In reply to Jacob Lifshay from comment #21) > i noticed that changing the initial value of m to 0 instead of INT_MIN is > incorrect, since the algorithm uses signed integers. m starting at 0 (or > actually any int at all) a correct implementation starts with: if(n == 0) return -1; m = a[0]; nm = 0; if you assume the input isn't empty, you can remove the if(n == 0) return -1;
(In reply to Jacob Lifshay from comment #21) > i noticed that changing the initial value of m to 0 instead of INT_MIN is > incorrect, since the algorithm uses signed integers. not in the least bit bothered. this is a demo not a hardened library implementation. removing INT_MIN which should not have been added reduces the complexity (In reply to Jacob Lifshay from comment #22) > a correct implementation starts with: again not relevant in the least bit: it changes the scope which is not up for negotiation. i repeat again this a *demo*. please listen. there is no point in you providing "enhancements" as all they do is increase the number of lines which reduces readability FOR A DEMO. please LISTEN when i state very clearly what the scope is. if you are tempted to make suggestions that increase the number of lines *please refrain from writing anything at all* as all it does is aggravate. if on the other hand you find bugs or wish to help move this forward then you are most welcome.
intel example paper https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions.html https://www.intel.com/content/dam/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions/optimizing-maxloc-operation-using-avx-512-vector-instructions-code2.png [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
(In reply to Luke Kenneth Casson Leighton from comment #2) in-register version first prepare offsets setvl vl=8 sv.svstep *offs, srcstep # 0 1 2 ... vl-1 and nm-potential li nm, -1 li nmbase, 0 li idx, 0 # start at first > while (i<n) { > // skip up to first max > while (i<n && a[i]<=m) i++; get max comparison, overlap-loop into m where reg a=m+1, and terminate at the first sv.max./ff=gt/vli *m, *a, idx now the search will terminate at the first failure, truncating VL to point at it. next we must discard the numbers already analysed: sv.addi/sm=lt *m, *m, 0 next, these two both use mapreduce mode with predication. > // continue as long as picking new m > while (i<n && a[i]>m) { use the same max-trick, terminating this time at the greatest first number sv.max./ff=lt/vli *m, *a, idx > m = a[i]; > nm = i; add-overwriting base with vector-offset into nm, yes *scalar* nm, the "last winner" is the largest index. reverse-gear would make this easier (pick the first success) sv.add/mr/m=gt nm, nmbase, *offs > i++; > } > }
(In reply to Luke Kenneth Casson Leighton from comment #25) > use the same max-trick, terminating this time at the greatest first number > > sv.max./ff=lt *m, *a, idx arse, i just realised, this needs to be: sv.max./ff=lt idx, *a, idx and, jacob, just like what you spotted earlier, with the other algorithm, this will terminate at the first write into RT. bug #1183 https://bugs.libre-soc.org/show_bug.cgi?id=1183#c3
first unit test added which demonstrates new "single" loop capability on dd-ffirst https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b077590
first cut/paste of maxloc test from ddffirst one https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=465b96f13b6
https://www.intel.com/content/www/us/en/developer/articles/technical/the-maxloc-reduction-in-oneapi.html https://www.intel.com/content/dam/developer/articles/technical/the-maxloc-reduction-in-oneapi/the-maxloc-reduction-in-oneap-figure-4.png these references need adding, showing how parallel reduction is one standard technique. involves recursion.
maxloc has been merged to master, but whoever merged it forgot to check that it didn't break CI: https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/5226784 =========================== short test summary info ============================ [case_divmod_knuth_algorithm_d_512x256_to_256x256] (d='0x1_0000_0000', n='0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_0000_0000', q='0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff', r='0x0') SUBFAIL src/openpower/decoder/isa/test_aaa_caller_svp64_powmod.py::TestPowMod15::test - TypeError: '<=' not supported between instances of 'NoneType' and 'int' <another 9 instances of basically the same thing> !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 10 failures !!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 10 failures !!!!!!!!!!!! = 10 failed, 87 passed, 1 skipped, 2 xfailed, 170 warnings in 445.69s (0:07:25)
# low 2 LSBs (CR field selector) remain same, CR num extended > assert regnum <= 7, "sigh, TODO, 128 CR fields" E TypeError: '<=' not supported between instances of 'NoneType' and 'int' src/openpower/decoder/isa/caller.py:1666: TypeError
(In reply to Jacob Lifshay from comment #30) > maxloc has been merged to master, but whoever merged it forgot to check that > it didn't break CI: yes i couldn't. complicated. i needed work from another branch. rebase screwed up. a *lot*.
(In reply to Luke Kenneth Casson Leighton from comment #32) > (In reply to Jacob Lifshay from comment #30) > > maxloc has been merged to master, but whoever merged it forgot to check that > > it didn't break CI: > > yes i couldn't. yes you can, i'm not sure if you realized it but CI tests all branches and tags, so just push the changes to a branch and check it once it runs. e.g. CI on the maxloc_676 branch with essentially the same errors: https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/5226783
(In reply to Luke Kenneth Casson Leighton from comment #32) > i needed work from another branch. rebase one branch on the other, no need to mess with master. > rebase screwed up. a *lot*. I basically always run git log before pushing, it helps catch stuff like that.
(In reply to Jacob Lifshay from comment #34) > rebase one branch on the other, no need to mess with master. not when there are inter-related developments and conflicts in all three. > > rebase screwed up. a *lot*. > > I basically always run git log before pushing, it helps catch stuff like > that. long story. half a dozen conflicts spread across both branches.
(In reply to Jacob Lifshay from comment #30) > https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/5226784 An error occurred while fetching the job. i'll have a look tomorrow
(In reply to Luke Kenneth Casson Leighton from comment #36) > i'll have a look tomorrow sorted a lot of them - running on my laptop drained 80% of the inverter/battery as it was *over two hours* so i will not be running the powmod tests. jacob when adding unit tests you really do need to be mindful that i am in severely resource-limited circumstances? (no, remote-server-login is not "the solution")
hooray! used new crternlogi to reduce two cr field instructions down to one. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0ec8094
bit of a cleanup, but basically this is all good now. 11 instructions. not bad. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=1f5b60672
starting to update page, added assembler https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=986e7d0
done writeup of maxloc. pretty efficient inner loop, 11 instructions. outer loop would be another... 4? https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=0bd5a7b00ec
found a way to reduce the number of instructions. down to 10. https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=7f45d3