See also: https://lists.libre-soc.org/pipermail/libre-soc-dev/2023-January/005475.html https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/register.20allocating.20the.20stack/near/319529633 Original comment: posting here because libre-soc-dev is currently broken. Now that the nlnet deadline is effectively over (no time left for new work), I'd like to work on my bigint presentation: If for some reason my presentation is rejected, this would just get published online somewhere instead. I want to build a practical 2048-bit modular exponentiation function for RSA -- the subtasks should fit under https://bugs.libre-soc.org/show_bug.cgi?id=773 and probably other gigabit router subtasks too. it would use montgomery multiplication and a toom-cook multiplication algorithm generator that we can instantiate with different parameters to find the most efficient one. toom-cook covers all of (Toom-1) traditional O(n^2) long multiplication, (Toom-2) O(n^1.58) karatsuba multiplication, and higher algorithms such as Toom-3 O(n^1.46) It would be run on a simplistic model cpu with 1 256-bit simd add unit and 1 256-bit wide-mul unit (the model would be adjustable) and scalar units, I would create a simulator for that, the simulator would cover ALU pipeline timing and dependencies but not much else. the simulator would be designed to output timing info such that blender can generate an animation of those ALUs executing the algorithm. the simulator would probably be written in python and use our existing isacaller simulator to generate execution traces for it to run on. I'm estimating this taking 3-4 weeks of work, it would probably be the most significant demo of svp64 actually providing a significant speed-up so far (isacaller doesn't count since it doesn't simulate actual cpu timing, soc.git isn't far enough along to simulate multi-issue and operation merging for SIMD backends). this would cover the last half of the bigint presentation, the first half would be explaining concepts. What do you think?
(In reply to Jacob Lifshay from comment #0) > It would be run on a simplistic model cpu with 1 256-bit simd add unit and 1 > 256-bit wide-mul unit (the model would be adjustable) and scalar units, I > would create a simulator for that, the simulator would cover ALU pipeline > timing and dependencies but not much else. everything but this is a great idea. there's no point spending time on doing hardware-cycle-accurate pipelining when cavatools is specifically designed to do exactly that, and already has the infrastructure in place. any time spent on developing any hardware-cycle-accurate simulations is therefore both time and money completely wasted. also as i have repeated many times timing is *not* part of the bitmanip NLnet proposal because the financial cost to NLnet would be an order of magnitude larger and put the entire project at risk to even attempt to tackle timing. *this is not up for discussion*. please do not *under any circumstances* spend *any* time attempting to show timing or any level of performance other than reduction in instruction count and reduction in complexity at this very early stage. the entire focus at the moment is on power reduction and complexity reduction. *NOT* in ANY WAY on TIMING. at all. the algorithm implementations, great idea.
(In reply to Luke Kenneth Casson Leighton from comment #1) > (In reply to Jacob Lifshay from comment #0) > > > It would be run on a simplistic model cpu with 1 256-bit simd add unit and 1 > > 256-bit wide-mul unit (the model would be adjustable) and scalar units, I > > would create a simulator for that, the simulator would cover ALU pipeline > > timing and dependencies but not much else. > > everything but this is a great idea. there's no point spending time > on doing hardware-cycle-accurate pipelining when cavatools is specifically > designed to do exactly that, and already has the infrastructure in place. > any time spent on developing any hardware-cycle-accurate simulations is > therefore both time and money completely wasted. the point is to have pretty animations of how data moves in the cpu so people understand how the wide-mul merging works (which requires something to construct those animations), not to have specific timing for exact performance comparisons. the sw to construct those animations is the simplistic SIMD ALU simulator -- imho spending a week writing that is waay better than creating animations by hand that only cover 1 specific run of 1 specific algorithm and is likely to have mistakes because it was done by hand.
(In reply to Jacob Lifshay from comment #2) > (In reply to Luke Kenneth Casson Leighton from comment #1) > > (In reply to Jacob Lifshay from comment #0) > > > > > It would be run on a simplistic model cpu with 1 256-bit simd add unit and 1 > > > 256-bit wide-mul unit (the model would be adjustable) and scalar units, I > > > would create a simulator for that, the simulator would cover ALU pipeline > > > timing and dependencies but not much else. > > > > everything but this is a great idea. there's no point spending time > > on doing hardware-cycle-accurate pipelining when cavatools is specifically > > designed to do exactly that, and already has the infrastructure in place. > > any time spent on developing any hardware-cycle-accurate simulations is > > therefore both time and money completely wasted. > > the point is to have pretty animations of how data moves in the cpu so > people understand how the wide-mul merging works (which requires something > to construct those animations), not to have specific timing for exact > performance comparisons. > > the sw to construct those animations is the simplistic SIMD ALU simulator -- > imho spending a week writing that is waay better than creating animations by > hand that only cover 1 specific run of 1 specific algorithm and is likely to > have mistakes because it was done by hand. it is not a waste of time or money because it has a completely different purpose and deadline than cavatools -- it generates animations rather than tries to quickly simulate a whole program, it is simplistic and ignores cache latency, cache hits, branch mispredictions, scalar op latency, etc. cavatools is much more detailed and complete and won't be ready any time soon and won't produce blender animations so is unsuitable for a demo for the bigint presentation.
(In reply to Jacob Lifshay from comment #2) > the point is to have pretty animations of how data moves in the cpu so > people understand how the wide-mul merging works (which requires something > to construct those animations), not to have specific timing for exact > performance comparisons. ah ok then that's different - you might find this approach useful rather than attempting to use (or create a dependency on) a massive software suite named "blender" https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/dct_butterfly_svg.py;hb=HEAD
Luke, can you create a bigint-presentation-code.git repo for the code for this? it would include the animation code and the toom-cook multiplication algorithm generator since that needs matrix inversion
(In reply to Jacob Lifshay from comment #5) > Luke, can you create a bigint-presentation-code.git repo for the code for > this? it would include the animation code and the toom-cook multiplication > algorithm generator since that needs matrix inversion yep done. matrix inversion, do you mean mirror/transpose? xy inversion options on sv.remap allow for dimension "mirroring" but unfortunately there are not enough bits spare to enable them in actual instructions, unless we go 64-bit (EXT001 NOT svp64) i.e. a "psvshape" instruction. temporarily you can blat manual values into the SVSHAPE SPRs, or you can use fmv/mrr with svshape2 enabling the inversion of the other dinension. mrr will run in reverse, but the yx field will invert the Y direction. have to try it and see.
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #5) > > Luke, can you create a bigint-presentation-code.git repo for the code for > > this? it would include the animation code and the toom-cook multiplication > > algorithm generator since that needs matrix inversion > > yep done. thx > > matrix inversion, do you mean mirror/transpose? no, i mean like the matrix equivalent of 1 / x. https://en.wikipedia.org/wiki/Invertible_matrix it happens as part of generating the assembly program for toom-cook so using svp64 isn't helpful there.
(In reply to Jacob Lifshay from comment #7) > no, i mean like the matrix equivalent of 1 / x. > https://en.wikipedia.org/wiki/Invertible_matrix > it happens as part of generating the assembly program for toom-cook so using > svp64 isn't helpful there. hey i remember doing those, urr 35 years ago so uhn not so helpful :) i have vague recollections about calculating them https://stackoverflow.com/questions/32114054/matrix-inversion-without-numpy#51214047 that looks pretty reasonable, still using matrix remap schedules, including the triple loop for k in range(2*n_Columns): Matrix[j][k] = Matrix[j][k] - ratio * Matrix[i][k] with matrix remap you *don't* have to set the same target regs as is set in the example, you can use svremap to arbitrarily target any operation and any regs *in* that operation. the use of predication would be hairraising though.
(In reply to Luke Kenneth Casson Leighton from comment #8) > (In reply to Jacob Lifshay from comment #7) > > > no, i mean like the matrix equivalent of 1 / x. > > https://en.wikipedia.org/wiki/Invertible_matrix > > it happens as part of generating the assembly program for toom-cook so using > > svp64 isn't helpful there. > > hey i remember doing those, urr 35 years ago so uhn not so helpful :) > i have vague recollections about calculating them > > https://stackoverflow.com/questions/32114054/matrix-inversion-without- > numpy#51214047 > > that looks pretty reasonable, still using matrix remap schedules, except that the matrix ops will be executed while creating the assembly program, not as part of the assembly program. this is essentially compile-time computation, so svp64 matrix ops aren't needed, a plain python function will do best.
(In reply to Jacob Lifshay from comment #9) > except that the matrix ops will be executed while creating the assembly > program, not as part of the assembly program. this is essentially > compile-time computation, so svp64 matrix ops aren't needed, a plain python > function will do best. aw doh :)
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #5) > > Luke, can you create a bigint-presentation-code.git repo for the code for > > this? it would include the animation code and the toom-cook multiplication > > algorithm generator since that needs matrix inversion > > yep done. Can you set it as public? the gitlab mirroring scripts (needed for gitlab ci) try to clone via https, but it failed: git clone https://git.libre-soc.org/git/bigint-presentation-code.git Cloning into 'bigint-presentation-code'... fatal: repository 'https://git.libre-soc.org/git/bigint-presentation-code.git/' not found
commit 05c03eb9551619d9a61874cd7639020e49a1999f (HEAD -> master, origin/master) Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Oct 5 22:03:33 2022 -0700 add Matrix class I added a Matrix class for matrixes of rational numbers, it includes matrix multiply, inverse, and most of the other common ops. Unfortunately no one can see the code because the repo isn't yet public...as mentioned in the previous comment. Later, I'll get around to splitting this bug up into sub-bugs that can be assigned funding.