a demo implementation of a modular exponentiation function is needed as a unit test to be run by pypowersim and/or ISACaller primary uses are for Diffie-Hellman and RSA but there are many more https://github.com/TOPDapp/py-diffie-hellman/blob/master/diffiehellman/diffie_hellman.py -- * DONE: raise bignum REMAP bugreport bug #1155 and link here, move the following into the bugreport: - cross-reference to the svshape table "12 REMAP modes" so i can begin advising where to put it https://libre-soc.org/openpower/sv/remap/#svshape https://bugs.libre-soc.org/show_bug.cgi?id=1155 * DONE: extremely simple version as test_caller_svp64_pow.py * DONE: merge branch into master. * TODO: bug #1183 comment #2, the assembler is non-optimal (and the whole point of the grant exercise is to identify opportunities *to* optimise, not just blithely sit back and go "oh well this sub-optimal bunch of ops will do, ho hum" :) but we *identified* an opportunity, must document it at the very least, as a new bugreport see comment #78 for the test results.
this is already partially completed by my work on bigint-presentation-code (bug #942), which is currently blocked on compiler support for svp64 since luke correctly observed that the diy compiler backend in there is a major distraction. the code can be easily converted to emitting llvm ir or similar once svp64 is added to llvm. bug #942 needs to have the subtasks of this bug split out.
rsa uses a non-prime modulus, changing title
(In reply to Jacob Lifshay from comment #2) > rsa uses a non-prime modulus, changing title ah good catch i totally missed that. need to find algorithms now, shorter more compact abd fitting into as few instructions as possible without being overly long completion time is fine.
(In reply to Luke Kenneth Casson Leighton from comment #3) > (In reply to Jacob Lifshay from comment #2) > > rsa uses a non-prime modulus, changing title > > ah good catch i totally missed that. > > need to find algorithms now, shorter more compact > abd fitting into as few instructions as possible > without being overly long completion time is fine. how long is overly long? the reason i'm using fast mul algorithms is because they are closer to O(n^(1+ε)) rather than O(n^2), e.g. even toom-2 (karatsuba) multiplication is O(n^1.5...) so on 4096-bit mul is very approximately 512 64-bit mul ops, whereas naive O(n^2) is about 4096 64-bit mul ops. toom-3 and higher are even less than that. if we use an algorithm as our demo that is known to be >4x slower, even if our cpu has a 256-bit simd mul unit, it will take approximately 4x as much power for the same speed or 4x as long for the same power without the simd unit, so everyone will see our proposal as some kind of bad joke, or that we're just noobs who don't know what their doing. the approximately 4000 instruction version of fast mul is what you get when you recursively inline the functions, if we instead leave them as uninlined recursive functions the amount of code used is substantially smaller. in any case we'll still want montgomery multiplication (which is a relatively simple algorithm that we could just code in c to call our bigint mul algorithm), which transforms modular exponentiation from needing a 8192-bit bigint divmod at each step to only needing it once at the end, saving a substantial amount of time.
(In reply to Jacob Lifshay from comment #4) > how long is overly long? greater than fits into 3-4 cache lines, absolute maximum. the goal here is to eliminate TLB and L1 cache lookups. this is for an embedded scenario not a supercomputing scenario.
(In reply to Luke Kenneth Casson Leighton from comment #5) > (In reply to Jacob Lifshay from comment #4) > > > how long is overly long? > > greater than fits into 3-4 cache lines, absolute maximum. > the goal here is to eliminate TLB and L1 cache lookups. I assume you mean misses rather than lookups. > this is for an embedded scenario not a supercomputing > scenario. ok, using O(n^2) multiplication could work since embedded cares less about speed and we can assume rsa is infrequently run so power doesn't really matter. I also already have code for O(n^2) mul in bigint-presentation-code.
*** Bug 1153 has been marked as a duplicate of this bug. ***
jacob i am assigning this to you, the REMAP bignum should be set up as a sub-bug shriya apologies i have to reassign this as the cryptoprimitives grant is under pressure of ending, and you are finishing your dissertation.
(In reply to Luke Kenneth Casson Leighton from comment #8) > jacob i am assigning this to you, the REMAP bignum should > be set up as a sub-bug > > shriya apologies i have to reassign this as the cryptoprimitives > grant is under pressure of ending, and you are finishing your > dissertation. Nw. Thankyou.
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=e044403f4c243b3248df0872a661756b6cc8a984 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Sep 13 23:24:12 2023 -0700 add SVP64 256x256->512-bit multiply https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ec1196cb3abedd28492be29546e52959dee1a030 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Sep 13 23:22:33 2023 -0700 generalize assemble() fn so other test cases can easily use it
(In reply to Jacob Lifshay from comment #10) > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=e044403f4c243b3248df0872a661756b6cc8a984 > > Author: Jacob Lifshay <programmerjake@gmail.com> > Date: Wed Sep 13 23:24:12 2023 -0700 > > add SVP64 256x256->512-bit multiply > > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=ec1196cb3abedd28492be29546e52959dee1a030 all of these can go in a vertical-first loop, using predication 0b1000100010001000 and sv.addi/pm=nn a,b,c to "pretend" there is a pair of nested loops. 30 "sv.adde *7, *7, *21", 31 "addi 24, 0, 0", 32 "sv.maddedu *20, *12, 19, 24", # final partial-product a * b[3] 33 "addc 7, 7, 20", 34 "sv.adde *8, *8, *21", i.e. sv.addi with scalars does exactly the same thing as the non-prefixed equivalent except you get predication and elwidth overrides the extra sv.adde can add a zero'd out source, making the first iteration orthogonal with all other iterations and consequently the VF loop can be applied. and you can use Indexed REMAP on the sv.maddubrs and sv.adde. only pain being, you have to enable/disable svremap before each so i suggest using "non-persistent" svremap also same as Andrey, all files require a copyright notice, you know this https://bugs.libre-soc.org/show_bug.cgi?id=1126#c13 > Author: Jacob Lifshay <programmerjake@gmail.com> > Date: Wed Sep 13 23:22:33 2023 -0700 > > generalize assemble() fn so other test cases can easily use it like it.
sigh, would be great to investigate this in a future grant https://www.google.com/url?q=https://digital.library.adelaide.edu.au/dspace/bitstream/2440/101502/2/02whole.pdf
I started changing the register assignments to conform to the ABI regarding volatile/nonvolatile registers, since I want the modular exponentiation algorithm to be able to store temporaries in registers while calling the multiply algorithm, however, I discovered a scalar EXTRA2 encoding bug: https://bugs.libre-soc.org/show_bug.cgi?id=1161 I spent the rest of the day trying to figure that bug out. (In reply to Luke Kenneth Casson Leighton from comment #11) > all of these can go in a vertical-first loop, using > predication 0b1000100010001000 and sv.addi/pm=nn a,b,c > to "pretend" there is a pair of nested loops. I'll work on figuring that out and implementing it when I start working on the REMAP bug, other than changing the registers used, the current multiply algorithm is imho good enough for the purposes of this bug. > > also same as Andrey, all files require a copyright notice, you > know this https://bugs.libre-soc.org/show_bug.cgi?id=1126#c13 added. https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=ca22d9687c91e764bb463ee776fb2f6efbf6eae9 commit ca22d9687c91e764bb463ee776fb2f6efbf6eae9 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Sep 14 20:07:18 2023 -0700 change registers used to avoid r13-31 which are reserved/nonvolatile broken -- blocked on https://bugs.libre-soc.org/show_bug.cgi?id=1161 commit 2bfd298a3844af197d265b74c38094a5edc397b1 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Sep 14 20:06:59 2023 -0700 pass in stack pointer commit e329df6cbcda7752ab68db8c2713bbc795aa03ef Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Sep 14 20:04:55 2023 -0700 log more register read/writes to LogKind.InstrInOuts commit 4996bc893db92aa80cefb8cb603739b8c5b9b859 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Sep 14 17:28:42 2023 -0700 add copyright stuff [skip ci]
I started adding the divmod algorithm: "extremely slow and simplistic shift and subtract algorithm" I completed the algorithm, but there are some bugs I haven't fixed yet. WIP (edit: add link) https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=8444282d4c29bbeb19e84612c5f075d078d27f3b commit 8444282d4c29bbeb19e84612c5f075d078d27f3b Author: Jacob Lifshay <programmerjake@gmail.com> Date: Tue Sep 26 21:41:58 2023 -0700 working on adding divmod 512x256 to 256x256 commit 27a6ab7218e0347107cef53ab2badf21bcd14572 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Tue Sep 26 21:39:31 2023 -0700 log writing CA[32]/OV[32] for OP_ADD
(In reply to Luke Kenneth Casson Leighton from bug #1175 comment #2) > (In reply to Jacob Lifshay from bug #1175 comment #0) > > since I want to use [mcrxrx]. > > ahh what for? can you put something about it on the mailing list > and as usual crossref to bugreport(s) I'm using mcrxrx to check if an unsigned bigint subtraction has a result that's less than zero (aka. it underflowed) .. the sv.subfe leaves XER.CA set if the result is >= 0, otherwise it's cleared. mcrxrx moves CA to a CR field, so I can branch on it. basically, it's a combined subtraction-comparison all-in-one. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=c0d5e3161cf0421f2c829aa9ae7b36fe837004dc;hb=8444282d4c29bbeb19e84612c5f075d078d27f3b#l112 see the corresponding lines in the python algorithm: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=c0d5e3161cf0421f2c829aa9ae7b36fe837004dc;hb=8444282d4c29bbeb19e84612c5f075d078d27f3b#l138
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=22e9003 jacob what i'm doing here is a mini-code-morphed-demo investigating the viability of using bigmul REMAP. i did the exact same thing with test_caller_svp64_chacha20.py, am helping sadoon on bug #1157 / bug #1159 and ed25519 will need the same thing: looking for the patterns and condensing them down so that bigmul REMAP (preferably without needing Vertical-First) can just go *BLAM*, one instruction, the whole lot. i don't think it's going to be possible to do without Vertical-First but one "trick" of Vertical-First is that there is "end of inner loop" notification when using "svstep." by checking the *correct bit* of CR0 you can do branch-conditional code that, say, performs that "extra add-with-carry" this will be quite fascinating to see if that trick works.
(In reply to Jacob Lifshay from comment #15) > basically, it's a combined subtraction-comparison all-in-one. love it. this is the key work: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=85abafd4 this is what we are aiming for - first with svindex and then using that to drive the development of bigmul REMAP and then come back and replace Indexed REMAP. loop-unrolling in SVP64 assembler is a "no-no". if it's "needed" then it means something's seriously wrong and we need to activate "redesign SVP64" procedures to make sure that the loop-unrolling is definitely, definitely *not* needed.
got rid of the addc - use adde but set CA to zero going in. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=d612cd562f inside the loop is now: li t[4], 0 sv.maddedu XER.CA=0 sv.adde if done as vertical-first it is possible to detect when the inner loop completes, and do branch-conditional CA=0/sv.adde
(In reply to Luke Kenneth Casson Leighton from comment #17) > this is the key work: > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=85abafd4 > > this is what we are aiming for - first with svindex and then using > that to drive the development of bigmul REMAP and then come back and > replace Indexed REMAP. that's all part of bug #1155 which has a separate budget that you should be using instead of this one...for this bug we need bare-minimum working mul, all improvements toward the goal of mul REMAP should be under bug #1155
(In reply to Jacob Lifshay from comment #19) > that's all part of bug #1155 which has a separate budget that you should be > using instead of this one...for this bug we need bare-minimum working mul, > all improvements toward the goal of mul REMAP should be under bug #1155 nvm, bug #1155's budget is too small for that.
(In reply to Jacob Lifshay from comment #19) > using instead of this one...for this bug we need bare-minimum working mul, with Indexed. loop-unrolling is prohibited.
I got divmod working! https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=78421a7d10dc21670c4f8359a4f005649470305f commit 78421a7d10dc21670c4f8359a4f005649470305f Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Sep 27 19:51:35 2023 -0700 fix divmod commit 33db8d3430b5608fc56473668a09746d04d941f9 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Sep 27 19:50:47 2023 -0700 in divmod algorithm log regexes that match against expected register values commit 018562039235cc223464c6e6754e0b85b2ab5e7f Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Sep 27 19:48:50 2023 -0700 test python_divmod_algorithm commit 6a0680af7318c73b9da2d7cc4995ad1792d1f3ed Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Sep 27 19:45:34 2023 -0700 format code commit 571acee1aa6859b708b9568538327e8e16b5f891 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Sep 27 19:25:57 2023 -0700 log asmop to LogKind.InstrInOuts too since only printing `.long 0xFOOBAR` isn't very useful
(In reply to Jacob Lifshay from comment #22) > I got divmod working! oleeee! :)
I wrote the powmod assembly, and got a python equivalent working, however, because divmod is really slow, running one powmod invocation is a few hundred thousand simulated instructions (powmod calls divmod 256-512 times, divmod loops 256 times, each inner loop is ~20 ops). Therefore, I think I should implement Knuth's algorithm D instead of the shift & subtract divmod algorithm we currently have, since I expect it to run 20-50x faster. I pushed to the programmerjake/powmod branch. https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=f8467f43e8e04f84280ac5586f92222916df0676 commit f8467f43e8e04f84280ac5586f92222916df0676 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Oct 5 19:57:29 2023 -0700 add WIP powmod_256 -- asm test is currently disabled since divmod is too slow commit f56b16a029c1e41b323b99f9514e04a42bba9c3a Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Oct 5 19:53:45 2023 -0700 fix assemble to properly look for whole symbols to replace previously if there were labels foo and foobar, it would partially replace foobar giving 0x<addr>bar, which is wrong. also optimized to use dict instead of linear search for label names
(In reply to Jacob Lifshay from comment #24) > I wrote the powmod assembly, and got a python equivalent working, however, > because divmod is really slow, running one powmod invocation is a few > hundred thousand simulated instructions (powmod calls divmod 256-512 times, > divmod loops 256 times, each inner loop is ~20 ops). actually, picking the average number of divmod calls of 384, it comes out to ~2 million operations! 384 * 256 * 20 = 1966080
(In reply to Jacob Lifshay from comment #25) > actually, picking the average number of divmod calls of 384, it comes out to > ~2 million operations! > 384 * 256 * 20 = 1966080 for a sense of scale, i'd guess that if we were to use all the optimization techniques and fastest algorithms (montgomery multiplication, maybe karatsuba), it could be done in under ~20k operations.
(In reply to Jacob Lifshay from comment #24) > I wrote the powmod assembly, and got a python equivalent working hooraay! that's really good. can estimate the number of assembler instructions now > however, > because divmod is really slow, running one powmod invocation is a few > hundred thousand simulated instructions (powmod calls divmod 256-512 times, > divmod loops 256 times, each inner loop is ~20 ops). but there is at least a first version, unit tests can be done. > Therefore, I think I should implement Knuth's algorithm D instead of the > shift & subtract divmod algorithm we currently have, since I expect it to > run 20-50x faster. yes i concur. nice thing is, that is a fixed (small) task. can i ask you a favour: make it "adaptable" i.e. parameteriseable as to what input output and temp regs are used, as well as their dimensions? (default to the ones you are using)? or, if that is a lot of work, just specify the size (bitlength) of X and Y? the reason i ask is that it becomes possible to crank down the sizes to run in sane times. we do *not* want people running tests to find that it is 300 minutes to completion, when if we put this in front of a customer for a demo (which is actually going to happen) it is ok to say "this is a 128 bit powmod, it completes quick but you get the idea". 192 bits is.... 3 64-bit GPRs which is more than enough to show carry-rollover. am tempted to suggest 128 but it is only 2 GPRs, i feel happier with 3 (like in the bigint shift/mul/div tests i wrote) > I pushed to the programmerjake/powmod branch. irony: this is leaf-node work so a branch is unnecessary unless you prefer it.
btw don't chuck away DIVMOD_512x256_TO_256x256_ASM because it is useful for illustrative purposes. i will be doing a customer demo 11th november.
(In reply to Luke Kenneth Casson Leighton from comment #28) > btw don't chuck away DIVMOD_512x256_TO_256x256_ASM > because it is useful for illustrative purposes. > i will be doing a customer demo 11th november. ok, i'll just rename it then, probably to DIVMOD_SHIFT_SUB_512x256_TO_256x256_ASM
(In reply to Luke Kenneth Casson Leighton from comment #27) > > however, > > because divmod is really slow, running one powmod invocation is a few > > hundred thousand simulated instructions (powmod calls divmod 256-512 times, > > divmod loops 256 times, each inner loop is ~20 ops). > > but there is at least a first version, unit tests can be done. not really, extrapolating from what i recall of divmod's timing, that's like 30hr for one call of the powmod algorithm on one of the fastest modern cpus.
(In reply to Luke Kenneth Casson Leighton from comment #27) > > Therefore, I think I should implement Knuth's algorithm D instead of the > > shift & subtract divmod algorithm we currently have, since I expect it to > > run 20-50x faster. > > yes i concur. nice thing is, that is a fixed (small) task. sounds good! > > can i ask you a favour: make it "adaptable" i.e. parameteriseable > as to what input output and temp regs are used, as well as their > dimensions? (default to the ones you are using)? afaict that needs a register allocator, which is the whole rabbit hole i went down for several months. > > or, if that is a lot of work, just specify the size (bitlength) > of X and Y? > > the reason i ask is that it becomes possible to crank down the > sizes to run in sane times. the crazy times are because shift-sub-based divmod is O(num_bits) and powmod is O(divmod_time * num_bits) which is 256^2 == 65536 -- huge. knuth's algo-D is O(num_words^2) which is 4^2 == 16, combined with powmod gives 256 * 4^2 == 4096, which is doable. > > we do *not* want people running tests to find that it is 300 minutes > to completion, with knuth's algo-D, i expect each 256-bit divmod to be <100 operations runtime, so powmod should be testable with a few min runtime. > when if we put this in front of a customer for a demo > (which is actually going to happen) it is ok to say "this is a 128 bit > powmod, it completes quick but you get the idea". > > 192 bits is.... 3 64-bit GPRs which is more than enough to show > carry-rollover. am tempted to suggest 128 but it is only 2 GPRs, > i feel happier with 3 (like in the bigint shift/mul/div tests i wrote) we really need > 128-bit, since algo-D has some special cases that aren't ever hit with just two words and maybe not three. 256-bit isn't very big.
started implementing Knuth's Algorithm D: https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=3094512b39f8e9e0e798b483a15f21fc152eb3de commit 3094512b39f8e9e0e798b483a15f21fc152eb3de Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Oct 6 17:08:35 2023 -0700 add WIP knuth algorithm D python implementation commit 4dbfd28bf6d4779f6e7582693f71ad276de12251 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Oct 6 15:21:07 2023 -0700 rename divmod algorithm -> divmod_shift_sub in prep for adding divmod based on Knuth's Algorithm D
(In reply to Jacob Lifshay from comment #32) > add WIP knuth algorithm D python implementation nice. clear. 362 if qhat * vn[n - 2] > (rhat << word_size) + un[j + n - 2]: can use 1<<r3 on that, where r3=j, temporarily, rather than Indexed REMAP. but really, Indexed REMAP is a necessity here. 380 # subfe 381 t += ~product[i] + un[j + i] just sv.subfe! but you need Indexed REMAP on un[j+i] and can replace it later with Bigmul (rhombic) REMAP. it is quite ridiculous how small this algorithm will end up being, esp. as the overflow condition you pre-designed divmod2du to give that value 352 if un[j + n] >= vn[n - 1]: 353 # division overflows word ultimately though this *is* going to need Vertical-First, with the "svstep." notification, as the loops j and i are nested.
https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=728b09c7f2785d0405a0cf5bc9c91460cd268f3b commit 728b09c7f2785d0405a0cf5bc9c91460cd268f3b Author: Jacob Lifshay <programmerjake@gmail.com> Date: Sun Oct 8 21:46:57 2023 -0700 finish writing python_divmod_knuth_algorithm_d commit 30915c1feeb515be74e62cf8b6f93ab7f5fb43aa Author: Jacob Lifshay <programmerjake@gmail.com> Date: Sun Oct 8 21:46:14 2023 -0700 fix generating invalid divmod tests commit f2fb0dd37712411459b540edd97005423e3f7162 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Sun Oct 8 21:43:00 2023 -0700 speed up divmod shift-sub tests by removing most test cases
(In reply to Luke Kenneth Casson Leighton from comment #33) > (In reply to Jacob Lifshay from comment #32) > > > add WIP knuth algorithm D python implementation > > nice. clear. > > 362 if qhat * vn[n - 2] > (rhat << word_size) + un[j + n - 2]: > > can use 1<<r3 on that, where r3=j, temporarily, rather than Indexed > REMAP. but really, Indexed REMAP is a necessity here. actually, I think using svoffset is a much better idea. (we need a dynamic instruction to set svoffset for dynamic bigint shifts anyway, bug #973) > > > 380 # subfe > 381 t += ~product[i] + un[j + i] > > just sv.subfe! yup! > but you need Indexed REMAP on un[j+i] > and can replace it later with Bigmul (rhombic) REMAP. > > ultimately though this *is* going to need Vertical-First, > with the "svstep." notification, as the loops j and i are > nested. I disagree, this algorithm isn't very suitable for vertical-first, because it has both operations in one and two loop nesting levels: e.g.: for j in ...: qhat = ... # scalar here for i in ...: # vector here mul/add/sub for i in ...: # more vector here with different offsets mul/add/sub this makes it not really work with rhombic REMAP. another reason vertical-first isn't very suitable: several sections are necessary for correctness but extremely rarely run (for random inputs on the order of 1 in 2^64), such as the qhat overflow and add-back sections. Therefore I'm planning on writing it to use horizontal-first with svoffset.
(In reply to Jacob Lifshay from comment #35) > I disagree, that's because you're not familiar with SVP64. https://bugs.libre-soc.org/show_bug.cgi?id=973#c5 > this algorithm isn't very suitable for vertical-first, because > it has both operations in one and two loop nesting levels: > this makes it not really work with rhombic REMAP. one trick is to ave SVSTATE in a temporary SPR then put it back and carry on. > Therefore I'm planning on writing it to use horizontal-first with svoffset. don't do that please. it will just need a rewrite to use vertical-first after you have spent the time doing HF. the purpose of the exercise here is not "get it done". the purpose of the entire grant is to be *efficient* in the ISA, to prove tht it is efficient, and if it is not then *modify* the ISA to *make* it efficient. if you have not listened to the primary designer of the ISA who knows all of the tricks that make algoriths efficient, thus meeting the criteria of the grant, instead doing something that is *not* efficient, and therefore does not meet the criteria of the grant, then how can i justify paying out the RFP? this is as much a learning exercise for you, jacob. you are still very unfamiliar with using REMAP. hell, everyone is as it is literally a new computer science concept.
ok a correction there. i *believe* that 1<<r3 and/or REMAP on a Scalar with VF mode would give us the level of efficiency. as this is your task it is down to you to attempt to use them. if they *don't work* then we can carry out a re-assessment, and design some new (future) SVP64 features. even if we run out of time to actually complete the algorithm the conditions of the grant have still been met *because we did the assessment*. but just going straight to something inefficient (such as the loop-unrolled mul256 algorithm you wrote, although it gets us one incremental step ahead), this is *not* satisfying the conditions of the grant.
(In reply to Luke Kenneth Casson Leighton from comment #36) > the purpose of the exercise here is not "get it done". > the purpose of the entire grant is to be *efficient* in the > ISA, to prove tht it is efficient, and if it is not then > *modify* the ISA to *make* it efficient. well, vertical-first is probably not beneficial because the algorithm needs to run a bunch of operations at different rates (1-8 mul/sub for every div, because the mul/sub are nested inside another loop and the div is not). svp64 works well when all of the operations can be done at the same rate, and works less well when the rates wildly differ because you need to jump through predication hoops to get it to work. scalar powerisa instructions are quite efficient for the operations that aren't uniform and aren't the highest rate. if you try to cram it all into a vertical first loop, it may be possible but won't be pretty and probably won't even be faster than scalar ops and horizontal-first loops partially because of all the complex setup code and standard OoO cpus already being able to quite efficiently handle those scalar and hf-svp64 data flow patterns. > > if you have not listened to the primary designer of the ISA > who knows all of the tricks that make algoriths efficient, convoluted and squished into one vertical-first loop with lots of remap isn't the same thing as efficiency, being vertical-first may make it *less* efficient, even if we add new remap modes. How about this: we write *both* a horizontal-first (since I find that straightforward and relatively obvious) and later attempt a vertical-first version? we can then compare them and see wether all the extra complexity is beneficial.
(In reply to Luke Kenneth Casson Leighton from comment #37) > but just going straight to something inefficient (such as the > loop-unrolled mul256 algorithm you wrote, although it gets us > one incremental step ahead), this is *not* satisfying the conditions > of the grant. which definition of efficiency are you using?
(In reply to Jacob Lifshay from comment #38) > How about this: > we write *both* a horizontal-first (since I find that straightforward and > relatively obvious) and later attempt a vertical-first version? we can then > compare them and see wether all the extra complexity is beneficial. another reason to start out with horizontal-first is that powmod is currently sitting effectively untestable until we have *any* fast enough divmod algorithm. I am quite confident I can write a horizontal-first divmod algorithm in a relatively short amount of time, I am not very confident I can write a vertical-first divmod at all and would probably need to go by way of writing a horizontal-first algorithm and then transforming it into a vertical-first algorithm anyway.
also bear in mind, if you have s[i+1] = s[i] then as long as you have 64 bit regs (no elwidth) you can do sv.whatever *r1, *r0, ... that works with 1<<r3 as well so sv.whatever/dm=1<<r3 *r1, *r0 and works with REMAP just as well.
(In reply to Jacob Lifshay from comment #39) > (In reply to Luke Kenneth Casson Leighton from comment #37) > > but just going straight to something inefficient (such as the > > loop-unrolled mul256 algorithm you wrote, although it gets us > > one incremental step ahead), this is *not* satisfying the conditions > > of the grant. > > which definition of efficiency are you using? the one that meets customer requirements which i repeated many times: top priority on code size. number of regs second. it is down to the hardware to merge VF and HF elements into "issue batches". which is here repeatedly everyone including you keeps assuming VF is incapable of doing that "thrrefore it musy be inefficient performance wise". this assumption is 100% false. (In reply to Jacob Lifshay from comment #40) > algorithm in a relatively short amount of time, I am not very confident I > can write a vertical-first divmod at all and would probably need to go by > way of writing a horizontal-first algorithm and then transforming it into a > vertical-first algorithm anyway. i don't have a problrm with that at all. good call, go for it.
btw just be aware, on the 3-in 2-out instructions with implicit RS, the pairs (hi lo) are put into separate blocks. RT (lo half) is consecutive, RS (hi half) starts at RT+MAXVL. can i sugggest not worrying about how many regs are used, first. there are... enough :)
(In reply to Luke Kenneth Casson Leighton from comment #42) > (In reply to Jacob Lifshay from comment #39) > > (In reply to Luke Kenneth Casson Leighton from comment #37) > > > but just going straight to something inefficient (such as the > > > loop-unrolled mul256 algorithm you wrote, although it gets us > > > one incremental step ahead), this is *not* satisfying the conditions > > > of the grant. > > > > which definition of efficiency are you using? > > the one that meets customer requirements which i repeated many times: > top priority on code size. number of regs second. ok. > > it is down to the hardware to merge VF and HF elements into > "issue batches". which is here repeatedly everyone including > you keeps assuming VF is incapable of doing that "thrrefore it > musy be inefficient performance wise". I was basing my efficiency claims on both: * the complexity I expect will be required to get a vertical-first divmod to work at all. I fully expect it to take *more* (and more complex) instructions than the horizontal-first version, because afaict it doesn't cleanly map to VF mode. this is bad for both code size and power and probably performance. * it will most likely require lots of dynamic predicates (more than just 1<<r3) with *large* amounts of bits that are zeros, this inherently is rather inefficient from a performance perspective, because I'm assuming either: * the predicate will have to be handed to the decode pipe before the predicated operations can be issued. this is bad for performance because you're forced to stall the entire fetch/decode pipe for several cycles while waiting for the predicate to be computed. * the predicate is not known at decode/issue time, so the full set of element operations are issued, potentially blocking issue queues, only to later find out that most of them were wasted. this is bad for both power and performance. the predicate not being known at issue time also means that propagating results to registers and/or any following instructions is also blocked for any instructions that use twin-predication, since the cpu needs to wait until it knows which registers to write to.
(In reply to Luke Kenneth Casson Leighton from comment #42) > i don't have a problrm with that at all. good call, go for it. Thanks! I did add a system so the user can adjust which registers and vector sizes are chosen, as you requested in comment #27 -- it doesn't need a register allocator, since the *user* is the register allocator :) all it does is some cursory checks to ensure the passed-in registers aren't completely borked (e.g. where an assignment is omitted by having both the source and dest be the same register, I assert that they actually are the same register). I added an early WIP assembly version, as well as added register assignments to the python version for logging. https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=d7c8086452d26a17a6643cb7858fa1cfa51416b9 commit d7c8086452d26a17a6643cb7858fa1cfa51416b9 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 9 21:05:17 2023 -0700 add WIP Knuth's algorithm D assembly commit f1f40a302cd40dfd953cd71e438ed9f9321562a5 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 9 20:38:30 2023 -0700 divmod: assign registers to variables commit 49ab8a331bce33def6cd21c637845adfe6c38b95 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 9 20:35:40 2023 -0700 adapt divmod algorithm for putting variables in registers commit b134f188788c1dcb5614244cb4f6fdceb645c543 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 9 20:30:43 2023 -0700 add more features for _DivModRegsRegexLogger commit b7be514750e59adbaf3fe543d2977d81f16fd322 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 9 18:25:43 2023 -0700 finish moving Knuth algorithm D into a class commit 592e2bfeb196ed06a92f476e936e6322ef91a945 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 9 18:21:40 2023 -0700 merely indent function [skip ci] commit 015c71a871760955a9e9588720ef7ad8b3cd258d Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 9 18:19:25 2023 -0700 start adding DivModKnuthAlgorithmD class commit 136e6f26ae6296d781e2a5d4ca4342a14cff2158 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 9 18:18:28 2023 -0700 format code
(In reply to Jacob Lifshay from comment #44) > (In reply to Luke Kenneth Casson Leighton from comment #42) > > (In reply to Jacob Lifshay from comment #39) > > > (In reply to Luke Kenneth Casson Leighton from comment #37) > > > > but just going straight to something inefficient (such as the > > > > loop-unrolled mul256 algorithm you wrote, although it gets us > > > > one incremental step ahead), this is *not* satisfying the conditions > > > > of the grant. > > > > > > which definition of efficiency are you using? > > > > the one that meets customer requirements which i repeated many times: > > top priority on code size. number of regs second. > > ok. > > > > it is down to the hardware to merge VF and HF elements into > > "issue batches". which is here repeatedly everyone including > > you keeps assuming VF is incapable of doing that "thrrefore it > > musy be inefficient performance wise". > > I was basing my efficiency claims on both: > * the complexity I expect will be required to get a vertical-first divmod to > work at all. I fully expect it to take *more* (and more complex) > instructions than the horizontal-first version, because afaict it doesn't > cleanly map to VF mode. this is bad for both code size and power and > probably performance. > * it will most likely require lots of dynamic predicates (more than just > 1<<r3) with *large* amounts of bits that are zeros, again: please please please, patiently i repeat: please *stop* making assumptions about what the hardware is capable of. the assumption that you are making is one that a *SIMD* architecture has because the ISA directly connects to the back-end hardware. this is *not true* with Simple-V. this inherently is > rather inefficient from a performance perspective, because I'm assuming > either: > * the predicate will have to be handed to the decode pipe > before the predicated operations can be issued. this is > bad for performance which is *not the focus of this research* > because you're forced to stall the > entire fetch/decode pipe for several cycles while waiting > for the predicate to be computed. *only on naive implementations* > * the predicate is not known at decode/issue time, so the > full set of element operations are issued, potentially > blocking issue queues, *only on naive implementations* > the predicate not being known at issue time also means > that propagating results to registers and/or any following > instructions is also blocked for any instructions that > use twin-predication, since the cpu needs to wait until > it knows which registers to write to. only on underresourced implementations. it basically is not your problem "as an assembler writer" to worry about what hardware does, because there is no direct connection (unlike in a SIMD ISA). this is something it seems that literally everyone who has ever used SIMD has to unlearn, including you. please completely and utterly forget, stop talking about, stop 2nd guessing and stop trying to assess, and kind of hardware performance *completely*, and just trust the hardware, and focus exclusively on getting program size down as compact as possible ok?
(In reply to Luke Kenneth Casson Leighton from comment #46) > the assumption that you are making is one that a *SIMD* architecture > has because the ISA directly connects to the back-end hardware. no, i was thinking of vertical-first mode, not SIMD. I do know how it works, having read your explanations for years and having read about how Mitch Alsup's cpu design works. > it basically is not your problem "as an assembler writer" to > worry about what hardware does, because there is no direct connection > (unlike in a SIMD ISA). it *is* my problem, because: * the code I write directly determines code size, which is what we're optimizing for here. * the data-flow graph of the code I write directly determines minimum latency even if the cpu has effectively infinite resources. you can't escape Amdahl's law. not even if you have a monster cpu with all the fanciest SV features.
I'm dropping the conversation about performance, it's a distraction and less arguing means less stress for luke. For code size, we can wait and see.
(In reply to Jacob Lifshay from comment #47) > (In reply to Luke Kenneth Casson Leighton from comment #46) > > the assumption that you are making is one that a *SIMD* architecture > > has because the ISA directly connects to the back-end hardware. > > no, i was thinking of vertical-first mode, not SIMD. I do know how it works, > having read your explanations for years and having read about how Mitch > Alsup's cpu design works. ... but you then go on to make assumptions about the hardware implementation, which is completely inappropriate to do given that there will be *multiple* implementations, of varying capability in amongst those, the *only* thing that is 100% guaranteed is that "smaller code size results in lower power consumption". therefore you *must* stop limiting your thinking based on the *assumption* that *we* will implement *only one* piece of hardware, and that that one piece of hardware will have "some limitation or other on performance if feature X of SVP64 is utilised". the entire purpose of the entire grants every single one is to prove the *ISA* not the implementations *of* the ISA. > * the data-flow graph of the code I write directly determines minimum > latency even if the cpu has effectively infinite resources. you can't escape > Amdahl's law. not even if you have a monster cpu with all the fanciest SV > features. again: *not your damn problem*!!! :) that is the *hardware implementor's* problem to deal with, not yours, right now. i asked you not to second-guess the hardware implementation(s) and you did exactly that only 5 minutes later! (In reply to Jacob Lifshay from comment #48) > I'm dropping the conversation about performance, it's a distraction yes it is. 100%. we are proving - and demonstrating - the *ISA* here, not implementations *of* the ISA, ok? we simply cannot guess what *future* implementors will come up with that solve every single one of every single "objection related to performance". therefore we *must* drop it, stone-cold, as "100% not our problem". exactly like the Khronos Group does not specify performance in the Certification of a given implementation. only "do the tests pass, yes, then you got your Compliance Certificate".
As a hardware guy :-) I am very conscious that just as in the software world there are "ways of doing things" that a hardware person wont understand or appreciate, there are hardware tricks that will run rings round the software, and cannot be guessed at. Beyond a cursory look at what the hardware approach might be it is not efficient to second guess the hardware implementation until RED has a hardware team which can interact on these ideas. Dont loose anything clever but as I have said before save them for when future thinking.
(In reply to djac from comment #50) > Beyond a cursory look at what the hardware approach might be it is not > efficient to second guess the hardware implementation until RED has a > hardware team which can interact on these ideas. not just inefficient but actually damaging because they will not have any examples to demonstrate why *their* misunderstandings of the full features are wrong (yes, the hardware team will have to learn SVP64 "on-the-hoof"). we *need* examples that push the hardware implementors out of their SIMD-based assumptions. > Dont loose anything clever but as I have said before save them for when > future thinking. yes, this is part of an ISA Designer's responsibility to think through *all* possible implementations (exactly like how the Khronos Group Members must do). as the person directly responsible for SVP64 and having been on it 100% full-time i had to do this (partly in my head, partly written). it will be years if not decades before all "hardware tricks" emerge. a SIMD ISA on the other hand, the only real "tricks" that they can apply are: 1. multi-issue (both ARM NEON and IBM POWER9/10 do this) 2. Cray-conceptualised vector chaining *on elements in the SIMD registers* ( the only public recent company to reveal they did this was AMD with their AVX512 implementation.) we have way more options open than just those two, thanks to the clean "abstracted" API, but i say "we" when i mean *all* future implementors of SVP64 (not just RED, in the distant future).
https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=5f0468b5f432804b8e6d201f2e38bdd52fdce612 commit 5f0468b5f432804b8e6d201f2e38bdd52fdce612 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Tue Oct 10 22:07:17 2023 -0700 WIP divmod: implemented division by single word commit b43960a84efcb0386c07281961d439167ae8a3e7 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Tue Oct 10 22:05:49 2023 -0700 support ignoring integer registers for ExpectedState commit 147e028182d01f9e3f698cdffef7d1442c8302ff Author: Jacob Lifshay <programmerjake@gmail.com> Date: Tue Oct 10 22:03:38 2023 -0700 convert assigned values to SVSHAPE when writing to SVSHAPE[0-3] SPRs this makes mtspr SVSHAPE0, reg properly maintain ISACaller invariants
one annoying thing I encountered while implementing DivModKnuthAlgorithmD is setting up a REMAP to reverse element order -- to implement the simple python loop where denom_size = len(v): n = denom_size # get non-zero length of divisor while n > 0 and v[n - 1] == 0: n -= 1 I had to do all of: setvl 0, 0, {denom_size}, 0, 1, 1 addis 0, 0, {svshape_high} ori 0, 0, {svshape_low} mtspr {SVSHAPE0}, 0 # mtspr SVSHAPE0, 0 svremap 0o01, 0, 0, 0, 0, 0, 0 # enable SVSHAPE0 for RA sv.cmpi/ff=ne *0, 1, *{v}, 0 setvl {n_scalar}, 0, 1, 0, 0, 0 # getvl {n_scalar} subfic {n_scalar}, {n_scalar}, {denom_size} if we had reverse gear for data-dependent fail-first, it would be 3 instructions: setvl 0, 0, {denom_size}, 0, 1, 1 sv.cmpi/ff=ne/mrr *0, 1, *{v}, 0 setvl {n_scalar}, 0, 1, 0, 0, 0 # getvl {n_scalar} (I also had to fix the simulator for mtspr SVSHAPE*, which was not keeping the SVSHAPE* SPRs as SVSHAPE instances, I changed SPR.__setitem__ to convert to SVSHAPE)
(In reply to Jacob Lifshay from comment #53) > one annoying thing I encountered while implementing DivModKnuthAlgorithmD is > setting up a REMAP to reverse element order use /mrr.
(In reply to Jacob Lifshay from comment #53) subfic {n_scalar}, {n_scalar}, {denom_size} > > if we had reverse gear for data-dependent fail-first, it would be 3 > instructions: > setvl 0, 0, {denom_size}, 0, 1, 1 > sv.cmpi/ff=ne/mrr *0, 1, *{v}, 0 > setvl {n_scalar}, 0, 1, 0, 0, 0 # getvl {n_scalar} sorry, too early in the morning. still blinking. let me check the spec. this is exactly the kind of thing that these research grants are here to find. > (I also had to fix the simulator for mtspr SVSHAPE*, which was not keeping > the SVSHAPE* SPRs as SVSHAPE instances, I changed SPR.__setitem__ to convert > to SVSHAPE) awesome.
https://libre-soc.org/openpower/sv/cr_ops/ ## Format SVP64 RM `MODE` (includes `ELWIDTH_SRC` bits) for CR-based operations: |6 | 7 |19:20|21 | 22:23 | description | |--|---|-----|---|---------|------------------| |/ | / |0 0 |RG | dz sz | simple mode | |/ | / |1 0 |RG | dz sz | scalar reduce mode (mapreduce) | |zz|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode | |/ |SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result) | nope, you're out of luck. not enough bits. that's annoying. in *theory* it could be done... oh hang on! look, bit 6 is free! it won't be perfect, ddffirst on CR-Field ops (3) would not be possible but on CR-*bit* ops (5) such as cmpi not a problem. okaay bugreport time, done as bug #1183 it'll need a TODO list involving * spec * powerdecoder * insndb * ISACaller unit tests * binutils * binutils unit tests a lot of work nowadays.
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=aff7be153e28891f6873b89802d9af8fa5e6907c commit aff7be153e28891f6873b89802d9af8fa5e6907c Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Oct 11 20:29:51 2023 -0700 WIP getting asm version of knuth's algorithm d working
https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=8d6113697dfce5ca482dc7cdd83d631992332a6b commit 8d6113697dfce5ca482dc7cdd83d631992332a6b Author: Jacob Lifshay <programmerjake@gmail.com> Date: Fri Oct 13 15:13:15 2023 -0700 WIP divmod: finished writing out asm knuth's algorithm d, still buggy
I ran into an issue: I'm trying to use: svremap 0o10, 0, 0, 0, 0, 0, 0 # enable SVSHAPE0 for RT # q[j] = qhat sv.or 4, 12, 12 # note all of RT, RA, and RB are scalar I expected the REMAP in SVSHAPE0 to apply to RT, but it seems like it doesn't. I couldn't find documentation one way or the other... (I saw documentation that says REMAP doesn't apply to scalar instructions, but I think that just means instructions without a SV-prefix) I think REMAP should apply to scalar operands when there's a sv-prefix (but not sv-single-prefix). This seems useful to me so you can e.g. extract the 2nd element in a byte vector by using SVSHAPE.offset or other times you'd want to use a REMAP-ped index. For now, I'll use sv.or/m=1<<r3 *4, 12, 12 but that workaround forces me to put the index in r3 which means I can't keep anything else there for that instruction.
I got the assembly version of Knuth's algorithm D to work! Next, I plan on getting the powmod to work, and then optimizing the divmod somewhat, since it probably has a few redundant instructions. After that, I think I'll work on the multiply remap, and then, if we still have enough time, attempt the vertical-first Knuth's Algorithm D (I don't expect to have enough time to finish it due to lack of budget). running one divmod using the simulator takes 4.81s (I'd randomly guess 30% is just for startup e.g. insndb parsing) on my computer when calculating: n = 0x5b_16ef_a3ca_8698_fa0f_0509_d80d_6a81_0ea7_baf2_fe8d_f34a_de28_f881_e13c_b9d3_7007_2241_a954_4276_d9c1 d = 0x7_c54b_a969_8413_d194_f6a7_154d_2d7d_fe3a_6783_4859_c671_0374 q, r = divmod(n, d) q == 0xb_b8e2_9e3f_8a67_25bd_c45b_f2a4_2f7a_15b4 r == 0x5_8e2f_6500_9eb0_8011_4645_c75a_e598_dd10_9ee6_3d1f_846f_e831 https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=4dcec3a630e4e082447230445ef31e24245cb6aa commit 4dcec3a630e4e082447230445ef31e24245cb6aa Author: Jacob Lifshay <programmerjake@gmail.com> Date: Sun Oct 15 21:17:09 2023 -0700 add comments telling people to keep the asm and python versions in sync commit 88e79ac8d5758f225703750b6c2a3bd6130b6673 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Sun Oct 15 21:06:02 2023 -0700 divmod: asm version of Knuth's algorithm D works! commit 2f2a215a11d9bbb84671253d3340dcf2654382e2 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Sun Oct 15 20:59:36 2023 -0700 add labels to DIVMOD REGEX log commit b3dc024833e16218399b958360243bbedfe4a51e Author: Jacob Lifshay <programmerjake@gmail.com> Date: Sun Oct 15 20:54:12 2023 -0700 put DIVMOD REGEX under new LogKind: LogKind.OutputMatching
https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=685380aa5d008dbd64d22a79332f282e9a849dca commit 685380aa5d008dbd64d22a79332f282e9a849dca Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 16 22:22:50 2023 -0700 powmod asm tests pass! commit c4783587012052e5a80a29f5efc4a7a27c051943 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 16 16:09:07 2023 -0700 switch powmod to using new divmod implementation -- test not enabled yet commit 34eb127cf2889c6f2c7713e782a69cd644824b95 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 16 16:08:44 2023 -0700 call log, not print commit 030ba8e67010e9f814edbc67a15190ca913e5ff0 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Oct 16 15:39:17 2023 -0700 make parser work for pypy pypy's ast classes' constructors seem to only work if all arguments or no arguments are passed, not only some of them.
(In reply to Jacob Lifshay from comment #59) > I ran into an issue: I'm trying to use: > svremap 0o10, 0, 0, 0, 0, 0, 0 # enable SVSHAPE0 for RT > # q[j] = qhat > sv.or 4, 12, 12 # note all of RT, RA, and RB are scalar > > I expected the REMAP in SVSHAPE0 to apply to RT, but it seems like it > doesn't. no, it doesn't. this is deliberate. > I think REMAP should apply to scalar operands when there's a sv-prefix (but > not sv-single-prefix). no. > This seems useful to me so you can e.g. extract the > 2nd element in a byte vector by using SVSHAPE.offset or other times you'd > want to use a REMAP-ped index. > > For now, I'll use sv.or/m=1<<r3 *4, 12, 12 yes that's how to index a single element. > but that workaround forces me to put the index in r3 which means I can't > keep anything else there for that instruction. indeed.
(In reply to Jacob Lifshay from comment #61) > https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog; > h=685380aa5d008dbd64d22a79332f282e9a849dca > > commit 685380aa5d008dbd64d22a79332f282e9a849dca > Author: Jacob Lifshay <programmerjake@gmail.com> > Date: Mon Oct 16 22:22:50 2023 -0700 > > powmod asm tests pass! so, now that this is all working, I think we can push all improvements of the bigint mul algorithm into the bigint mul remap bug #1155 and declare this bug complete and merge the branch. Also, I think we should assign some money to luke for the review and feedback (and also because he needs money). Luke, what do you think?
(In reply to Jacob Lifshay from comment #63) > (In reply to Jacob Lifshay from comment #61) > > powmod asm tests pass! > > so, now that this is all working, I think we can push all improvements of > the bigint mul algorithm into the bigint mul remap bug #1155 ok. > and declare > this bug complete and merge the branch. ok > Also, I think we should assign some money to luke for the review and > feedback (and also because he needs money). > > Luke, what do you think? good idea. the bigmul remap is quite straightforward, there is a step-by-step procedure to follow. let's get RFPs in and move to bug #1155, and examine the assembler here for patterns. ed25519 also has patterns but they are carry-save ones (modulo wrap on the index sums, (x+y)%N rather than just x+y)
(In reply to Luke Kenneth Casson Leighton from comment #64) > > and declare > > this bug complete and merge the branch. > > ok don't rush ahead and declare everything complete before I merge...currently running tests on rebased branch.
NLnet asked ast week to link to CI test logs for review. do you have them? not the full log just the unit test. not "upload 100 MB of crap as attachment overwhelming the VM" just, "where s the CI test"
(In reply to Jacob Lifshay from comment #65) ok > > don't rush ahead and declare everything complete before I merge...currently > running tests on rebased branch. ack leave it to you. can you cf the link to the CI log just the one that has the powmod test, if not separate may need a line number.
(In reply to Luke Kenneth Casson Leighton from comment #66) > NLnet asked ast week to link to CI test logs for review. > do you have them? > not the full log just the unit test. I don't have logs for just one unit test, CI runs everything and doesn't split stuff out like that (though it does spit out error messages that are delimited in the log. > not "upload 100 MB of crap as attachment overwhelming the VM" > just, "where s the CI test" I can link to debian salsa which is nicer to look at but the logs will go away eventually. Alternatively, I can link to the commit in the git repo of logs I'm keeping (aren't you glad I did that...) on my build server, but it doesn't have handy html pages, it's just a git repo you can clone.
(In reply to Jacob Lifshay from comment #65) > don't rush ahead and declare everything complete before I merge...currently > running tests on rebased branch. well, git majorly messed up the rebase, so I tried git merge which is also messed up...investigating...
(In reply to Jacob Lifshay from comment #69) > (In reply to Jacob Lifshay from comment #65) > > don't rush ahead and declare everything complete before I merge...currently > > running tests on rebased branch. > > well, git majorly messed up the rebase, so I tried git merge which is also > messed up...investigating... ok, fixed it afaict, it was the LogKind -> LogType rename getting in there to mess stuff up. running tests is taking forever (8min only 1/4 completes the full powmod test), so I pushed the newly-rebased branch to programmerjake/powmod and will wait for CI while I do other stuff. I also manually pushed to the mirror so I don't have to wait 24hr(!). https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4983704
(In reply to Jacob Lifshay from comment #70) > (In reply to Jacob Lifshay from comment #69) > > (In reply to Jacob Lifshay from comment #65) > > > don't rush ahead and declare everything complete before I merge...currently > > > running tests on rebased branch. > > > > well, git majorly messed up the rebase, so I tried git merge which is also > > messed up...investigating... > > ok, fixed it afaict, it was the LogKind -> LogType rename getting in there > to mess stuff up. (i always regularly rebase from to avoid that) > running tests is taking forever (8min only 1/4 completes the full powmod > test), it'll need moving to "long" tests. > so I pushed the newly-rebased branch to programmerjake/powmod and > will wait for CI while I do other stuff. I also manually pushed to the > mirror so I don't have to wait 24hr(!). > > https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4983704 hmm hmm hmm we will need something more focussed in future, running just the one unit test not several, so that the log is targetted at a given bugreport. otherwise we have to provide a reproducible build script *for every bug for every RFP* (!!) ok i am seeing "passed" on 116 searches for "powmod" which in my (not-having-been-writing-the-tests) mind seems excessive but i know you do very comprehensive and thorough tests so am perfectly happy with them.
(edit: trim context) (In reply to Luke Kenneth Casson Leighton from comment #71) > it'll need moving to "long" tests. yeah, I remember it taking much less time before I rebased... > ok i am seeing "passed" on 116 searches for "powmod" which in my > (not-having-been-writing-the-tests) mind seems excessive but i know > you do very comprehensive and thorough tests so am perfectly > happy with them. well, it mentions each test twice, once when it starts it and once when it finishes it. also, most of those tests are actually testing sub-pieces of powmod, e.g. I have a pile of tests for divmod to ensure I actually hit all the corner cases (otherwise how would I know if it's correct? it's too complex for me to confidentially say it's bug-free without them). the few tests testing the full powmod algorithm are the ones that take 30min.
(we also cannot send them a link where when they open it, it literally overwhelms their machine, taking MINUTES to load due to MASSIVE javascript processing time in additionto a long download time). some... refining to do, here :) (one more task for a future grant, sigh)
(In reply to Jacob Lifshay from comment #72) > well, it mentions each test twice, once when it starts it and once when it > finishes it. also, most of those tests are actually testing sub-pieces of > powmod, aaawwesoome, this is *exactly* the technique i'd have followed, but also what i want shriya and sadoon to learn. otherwise it is "overwhelm" because assembler is just that hard.
(In reply to Jacob Lifshay from comment #70) > https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4983704 it still didn't finish after running 6hr! (mostly due to pytest's terrible scheduling decisions, it likes to run a lot of tests serially on 1 core towards the end, powmod just happens to be at the end with some very long tests) I made some changes that should speed up CI a bit, unfortunately some profiling shows about 70% of runtime is taken up by nmigen simulation, which is relatively slow. I think the biggest change that probably will speed up CI is renaming the test file so pytest schedules it earlier (allowing more parallelism), I'll try that next. commit 22c2fd82ddb9f4383eaf5bc867779d6b6845fcd4 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Nov 30 14:55:08 2023 -0800 pytest: try to improve scheduling commit 795cd9cba2882b0dd80c4fb93c984501e9f30cf4 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Nov 30 14:54:05 2023 -0800 test_caller_svp64_powmod: rename to test_aaa_caller_svp64_powmod so pytest tries to run it earlier commit 0280a1ead801545c10b72829092dd1dc65aaf054 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Nov 30 14:21:36 2023 -0800 speed up CI by disabling logging and VCD generation commit 489c777a2f0abf9fee2899cc1ceb59aec0a97d22 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Nov 30 14:20:54 2023 -0800 test/runner: allow disabling VCD using SIM_NO_VCD=1 env var
tests currently running on: https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4986961 looks like pytest decided to schedule all the long powmod tests in parallel this time! (TestPowMod43-45 iirc) I'll merge once they pass.
(In reply to Jacob Lifshay from comment #77) > tests currently running on: > https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4986961 > > looks like pytest decided to schedule all the long powmod tests in parallel > this time! (TestPowMod43-45 iirc) > > I'll merge once they pass. they passed, merged to master. the 7 test failures are unrelated and expected. to see them: > git clone "https://build.libre-soc.programmerjake.tk/build-archive.git" build-archive > cd build-archive > git checkout c1ecc0f936a992ca86497e0cb40c41689b609a61 > less -R pipelines/608110/job-4986961-log.txt type /svp64_powmod and press enter, it will show you all powmod tests, press n to go to the next occurrence.
rats forgot about bug #1183, the locations where the assembler is sub-optimal (5 instructions instead of 1) needs to be cross-referenced, bugreport raised if needed. the important thing here is to document what was discovered. implementation can come later (different bug, or even different grant), but documenting and capturing the advances is *real* important.
(In reply to Luke Kenneth Casson Leighton from comment #79) > rats forgot about bug #1183, the locations where the assembler is > sub-optimal (5 instructions instead of 1) ok, icr all the locations, but I think the motivating examples are the spots where Knuth's Algorithm D needs inputs with the most-significant word being non-zero, so we have to find the number of leading zero words in our fixed-length inputs so we know how many words to tell Algorithm D is in the dynamically-sized inputs (basically, we just make Algorithm D ignore all the high-order zero words and act as if the inputs are really shorter than they are). so, count-leading-zeros, except we're counting zero words, not bits. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=7fc794685bebb1f3c2451c64da041a0e81143e29;hb=d3c7a0ddc169877e8daf38a347091c366e85624b#l717 and https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=7fc794685bebb1f3c2451c64da041a0e81143e29;hb=d3c7a0ddc169877e8daf38a347091c366e85624b#l734 I'll link bug #1183 to this comment, so I think that's all we need.
(In reply to Jacob Lifshay from comment #80) > ok, icr all the locations, but I think the motivating examples are the spots > where Knuth's Algorithm D needs inputs with the most-significant word being > non-zero, so we have to find "have to"... remember what i said about not doing performance optimisations and keeping the assembler brutally short? especially if it makes a mess, but also even to find out that this was a shortcoming *after the bug is closed* is quite a serious infraction of the milestone's purpose and could be flagged by an Auditor! can you please immediately document ALL such findings as an absolute top priority as it is literally the entire purpose of this entire grant (2021-02A-051) which you should full well know as i have repeated it to you at least eight times in three years! sorry to have to tell you off on that but i am a little taken aback
(In reply to Luke Kenneth Casson Leighton from comment #81) > (In reply to Jacob Lifshay from comment #80) > > > ok, icr all the locations, but I think the motivating examples are the spots > > where Knuth's Algorithm D needs inputs with the most-significant word being > > non-zero, so we have to find > > "have to"... remember what i said about not doing performance > optimisations and keeping the assembler brutally short? it isn't a performance optimization, it is required for correctness of Knuth's Algorithm D.
(In reply to Jacob Lifshay from comment #82) > it isn't a performance optimization, it is required for correctness of > Knuth's Algorithm D. ahh okaay. been a while i *think* i know where you mean. should work now. this unit test confirmed works https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dd_ffirst.py;h=c371830cf#l70 still documenting these (inspired) changes needed to SVP64 is critical, sorry for not making it really really explicitly clear as a task on the TODO list.
(In reply to Luke Kenneth Casson Leighton from comment #83) > still documenting these (inspired) changes needed to SVP64 is > critical, sorry for not making it really really explicitly clear as a > task on the TODO list. the change is documented in the comment I linked to from https://bugs.libre-soc.org/show_bug.cgi?id=1183#c21 I also opened bug #1237 to track the needed change to the divmod assembly (not part of this bug). that's everything that i recall, so I'm closing this.
(In reply to Jacob Lifshay from comment #84) > the change is documented in the comment I linked to from > https://bugs.libre-soc.org/show_bug.cgi?id=1183#c21 i meant: it goes on a *wiki* page, explicitly in an explicit summary report. i *think* i have it implemented, but seeing it being used, and the implementation linking back to the feature, is incredibly important. > I also opened bug #1237 to track the needed change to the divmod assembly > (not part of this bug). > > that's everything that i recall, so I'm closing this. good stuff.