Implement poly1305 --- * TODO: copy style of chacha20 unit test https://bugs.libre-soc.org/show_bug.cgi?id=965#c0 this involves putting a python-based version of poly1305 in the unit test and comparing output * TODO: replace Index REMSP with bigmul REMAP when ready (see bug #1155) * TODO: (if time/budget) drop the algorithm into plain assembler (see bug #1007)

sadoon if you copy the style of chacha20 using indices (see bug #965) you will have something that is very straightforward to drop the bigmul REMAP in its place.

https://libre-soc.org/irclog/latest.log.html#t2023-09-10T18:37:56

(In reply to Sadoon Albader from comment #2) > https://libre-soc.org/irclog/latest.log.html#t2023-09-10T18:37:56 ok that's not a bad implementation, look for regular patterns. btw this will do as the unit test just drop that into the openpower-isa repo, make sure to preserve copyright, where it came from, and the license https://github.com/ph4r05/py-chacha20poly1305/blob/master/chacha20poly1305/poly1305.py hubert's is LGPLv2 https://github.com/ph4r05/py-chacha20poly1305/blob/master/LICENSE the regular patterns you can source the operands from alternate regs, using REMAP Index and Vertical-First just like in chacha20 so, line 176 https://github.com/floodyberry/poly1305-donna/blob/e6ad6e091d30d7f4ec2d4f978be1fcfcbce72781/poly1305-donna-64.h#L176 c = 0 # makes the pattern regular h1 += c c = (h1 >> 44); h1 &= 0xfffffffffff; h2 += c; c = (h2 >> 42); h2 &= 0x3ffffffffff; h0 += c * 5; c = (h0 >> 44); h0 &= 0xfffffffffff; h1 += c; c = (h1 >> 44); h1 &= 0xfffffffffff; h2 += c; c = (h2 >> 42); h2 &= 0x3ffffffffff; h0 += c * 5; c = (h0 >> 44); h0 &= 0xfffffffffff; h1 += c; # do this manually (outside the loop) what you can do is: * 2 regs with consts 0xffff and 0x3fffff then REMAP Indices 0 1 0 0 1 0 * another 2 regs with shift amounts 44 and 42 then REMAP Indices 0 1 0 0 1 0 oh look those are the same, use the exact same SVSHAPE as above * 2 regs with a constant 1 and 5 for a mul c*1 or c*5, here you will need a 2nd REMAP Index 0 0 1 0 0 1 * 3 regs h0 h1 h2 use another REMAP Index 1 2 0 1 2 0 if you are very lucky you can target the same reg for the shift as the AND, which means you don't need to do svremap inside the loop,you can do it outside. whatever you do, *go slowly*. even just the above is enough to get on with, do it *as its own unit test*. it will be: * 5 instructions to set up Index SVSHAPEs and loop counter, and setvl in vertical-first mode * a mul, an add, a shift, an AND * svstep to decrement the counter and bc to do the VF loop * an add for h1+=c 12 instructions, absolute max. (mul by 5 or 1 avoids pissing about with branches, which would damage parallelism). strongly suggest *only* doing this, as a first incremental step.

https://libre-soc.org/irclog/latest.log.html#t2023-09-10T18:44:00 konstantinos was only able to do this *after* i had created the test_caller_svp64_chacha20.py unit test. in ISACaller you can pre-load *and extract* all regs in python which is far far easier. go *slow*, maximum of 5-8 lines of assembler at a time, and bootstrap your way up. then once the snippets are ready in test_caller_xxxx.py you are ready to move to pypowersim. you will need to do a *full* function as a ".s" file just like how konstantinos did it, starting from: * a POWER v3.0 c function compiled to assembler with gcc -S * dropping that into the test binary *in place* of the c file * converting line by line to svp64 assembler, taking the fragments from the *python* unit test and dropping them in. all of this is slow patient and above all *incremental* work that you can in no way shortcut or attempt to just go "i will do everything from cold".

"This form is the most parallel possible. We could compute every product in parallel, then sum everything, map-reduce style. There’s only one problem: we need to compute r squared, cubed, and so on. To avoid this, we can take a hybrid approach, with two applications of Horner’s rule instead of just one:" https://loup-vaillant.fr/tutorials/poly1305-design? sadoon we have Parallel Prefix REMAP for doing these cascading multiplies but i am not sure it s worth it here.

sadoon always always always cross-reference. do it immediately at the time that you have the discussion. https://libre-soc.org/irclog/%23libre-soc.2023-09-15.log.html#t2023-09-15T17:26:58

https://libre-soc.org/irclog/%23libre-soc.2023-09-15.log.html#t2023-09-15T17:39:01 what is happening here is explained in this page https://libre-soc.org/openpower/sv/biginteger/analysis/ adde uses XER.CA as carry-in and creates XER.CA as carry-out. therefore if you do adde adde adde adde with successive register numbers you have created a big-int add by chaining the carrys together on each adde. *therefore* you can *replace the adde sequence with **ONE** sv.adde instruction. real simple. the maddedu instruction works similarly by treating one of the reg pair as a *64-bit* carry-in carry-out. btw one absolutely critical thing to understand about Simple-V is that the Vector Length is in **ELEMENTS NOT BITS**. this makes it natural for humans to work with.

sadoon i'm going to do a quick python port of the excellent algorithm you found (written in c). this will make it easier for you to understand the porting process and also keep to pure python.

added hubert's LGPLv2.1 pure-python implementation of poly1305 https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=8d18377c0

https://albader.co/poly1305.sv https://libre-soc.org/irclog/%23libre-soc.2023-09-15.log.html#t2023-09-15T17:49:42

(In reply to Luke Kenneth Casson Leighton from comment #8) > sadoon i'm going to do a quick python port of the > excellent algorithm you found (written in c). this > will make it easier for you to understand the porting > process and also keep to pure python. Thanks for that, was planning to post my findings and chat logs ASAP but got busy with something urgent over the weekend.

(In reply to Sadoon Albader from comment #11) > Thanks for that, was planning to post my findings and chat logs ASAP but got > busy with something urgent over the weekend. no problem. i am currently going "oink" over how MUL macro works in poly1305-donna-64.h which is... behaving oddly

(In reply to Luke Kenneth Casson Leighton from comment #12) > (In reply to Sadoon Albader from comment #11) > > > Thanks for that, was planning to post my findings and chat logs ASAP but got > > busy with something urgent over the weekend. > > no problem. i am currently going "oink" over how MUL macro works > in poly1305-donna-64.h which is... behaving oddly It's quite simple really, I hope this answers your concern: - unsigned long long is 64 bits on 64-bit systems. - d0,d1,d2, and d are uint128_t variables. - all this MUL macro is doing is typecasting the multiplier to 128-bit unsigned to ensure that the result stays in 128-bits, because in all cases in the poly1305_blocks function the multiplication is between two 64-bit unsigned ints which results in a maximum of 128-bit length ints. FWIW if you see any odd behavior, my poly1305.sv implementation you linked to earlier uses pure hardware multiplication and can replicate results from poly1305-donna as long as they are in multiples of 16 bytes (I had not implemented "leftover" processing due to lack of time back then). This was verified both in testbench and on two FPGAs (CPLD & FPGA if I need to be pedantic) Here's the testbench file (also from over two years ago, sorry for the messy code) https://albader.co/poly1305_tb.sv I have only verified this working with ModelSim back then, and have not verified it against the final poly1305.sv (64-bits vs 8-bits) file but it should just work.

successful first port: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ecda34dbc as there are carry-rollovers it is *not* guaranteed that these are the same, because the carry-over may occur intermittently. this will need some random input (or just some sequential data) to check that poly1305.py and poly1305-donna.py are equivalent, at which point it becomes possible to start "morphing" the algorithm in the same way as was done for chacha20: extract "schedules" around "repeated stuff"

(In reply to Luke Kenneth Casson Leighton from comment #3) > so, line 176 > https://github.com/floodyberry/poly1305-donna/blob/ > e6ad6e091d30d7f4ec2d4f978be1fcfcbce72781/poly1305-donna-64.h#L176 > > c = 0 # makes the pattern regular > h1 += c c = (h1 >> 44); h1 &= 0xfffffffffff; okaaay nooow we have poly1305-donna-64bit.h converted to python, *now* it is possible/convenient to morph towards a version that "illustrates" how REMAP/Indexed would be dropped in, easily https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=d8ae5d7fa this is *very deliberately* slow-and-steady work, sadoon: you have to keep the unit test working *at all times*, never allowing not even one single micro-morph-step to produce a wrong result nor deviate along the "i think i will just start writing some massive amounts of untested code from scratch with no way to test it except after about 2 months daunting work" path. in that way, you know: * where you are * what the goal is * you can work from both ends to meet in the middle * and you always have the unit test to tell you if your changes still work with that one illustration there, you should be able to do the exact same conversion trick (look for the common pattern, "break out" the indices/consts into lists) and once all done it will be near-ridiculously-trivial to convert each section to SVP64 assembler, and once all fragments are converted, it becomes just as ridiculously-trivial to stitch them all together into a single algorithm.

(In reply to Luke Kenneth Casson Leighton from comment #15) > (In reply to Luke Kenneth Casson Leighton from comment #3) > okaaay nooow we have poly1305-donna-64bit.h converted to python, Awesome! Will take a look at it tomorrow, thanks! I always like to say that writing unit tests before the code makes the process more fun and effective.

Alright just git pulled openpower-isa to test things out, I wanted to make sure that we can handle both messages larger than 16 bytes *and* messages with leftover bytes (not multiple of 16). I grabbed the known MAC from poly1305-donna with its key and message, all seem to be working flawlessly. Brilliant Luke! Time to get to porting to SVP64, one baby step at a time.

(In reply to Sadoon Albader from comment #17) > Alright just git pulled openpower-isa to test things out, I wanted to make > sure that we can handle both messages larger than 16 bytes *and* messages > with leftover bytes (not multiple of 16). I grabbed the known MAC from > poly1305-donna with its key and message, all seem to be working flawlessly. > Brilliant Luke! awesome. did you remember to add those verifications as unit tests? i had a bug where c was multiplied by *four* not five and so the possibility i described in comment #14 is a very real one. i look forward to seeing you follow the standard project development procedures (which i also look forward to not being placed into a position of having to remind you of again) to put the commit git diff URL into a bugreport alongside reports such as the above, with those unit tests. i suggest just doing a test_poly1305.py which imports from both poly1305.py and poly1305_donna.py and chucks a whole stack of random-length random-data at them both and compares the two. > Time to get to porting to SVP64, one baby step at a time. i just tried adding a python-based implementation of dsrd which *should* be possible to use https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=49d5222a you can see places where it could potentially be used: h0 += (( t0 ) & 0xfffffffffff); h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff); h2 += (((t1 >> 24) ) & 0x3ffffffffff) | hibit; that would be *TWO* uses of dsrd, not 3 - because h2 ends up with the remainder of t1 from the *second* use of dsrd. the dsrd calls can be split out: dd0 = (( t0 ) & 0xfffffffffff); dd1 = (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff); dd2 = (((t1 >> 24) ) & 0x3ffffffffff) h0 += dd0; h1 += dd1; h2 += dd2 | hibit; where (guessing here): dd0,dd1 = dsrd(t0, 0, 44) dd0,dd1 = dsrd(t0, t1, 44) you see how that would work? it's not quite there, but close.

I'm trying to put as much of my focus on the multiplication right now, here goes: The multiplication is literally just long multiplication with a nice trick (read https://loup-vaillant.fr/tutorials/poly1305-design to understand better.) With normal multiplication, if we had hundreds times hundreds (because we are using 3 64-bit integers of course), it'd go like this: h2 h1 h0 r2 r1 r0 * -------------------------------------------------- h2r0 h1r0 h0r0 | h2r1 h1r1 h0r1 | h2r2 h1r2 h0r2 | -------------------------------------------------- + and the result would be: h2r2 h2r1+h1r2 h2r0+h1r1+h0r2 h1r0+h0r1 h0r0 Note that h2r2 can also be 2 digits, which makes the final number a maximum of 6 digits. So we limit the 64-bit variables of both h[x] and r[x] to 44 bits for elements 0 and 1 and 42 bits for element 2, to ensure that they don't overflow in addition over 2^130, and don't exceed 128-bits in 64*64-bit multiplication. Modular multiplication does things a little differently, to get the remainder, we don't divide by (2^130)-5, we instead multiply by 5 * 2^(130-128), and lo and behold, we get our remainder *if* the number is less than p (2^130-5, our prime). To do this, we divide the two leftmost numbers by 2^128 (essentially moving them to the right as you will see below) and then multiply them by 5 * 2 ^ 2 Sounds familiar? It's literally sll and srl, nothing special. Now let's look at what happens: h2 h1 h0 r2 r1 r0 * ------------------------------------------------------- h2r0 h1r0 h0r0 | h1r1 h0r1 h2r1*20 | h0r2 h2r2*20 h1r2*20 | ------------------------------------------------------- + Now if you read the C/Python implementation, it makes perfect sense why: s1 = r1 * (5 << 2) s2 = r2 * (5 << 2) d0 = h0r0 + h1s2 + h2s1 d1 = h0r1 + h1r0 + h2s2 d2 = h0r2 + h1r1 + h2r0 Stitch the d's together and do carry propagation: h0 = (d0 ) & 0xfffffffffff (44 bits) h1 = (d1 + d0 >> 44) & 0xfffffffffff h2 = (d2 + d1 >> 44) & 0x3ffffffffff h0 = (h0 + (d2 >> 42)) * 5 h1 = h1 + (h0 >> 44) h0 = h0 & 0xfffffffffff This way, we can have an array of 3 64-bit integers h[x] where h[0-2] = h0,h1,h2 And that's our hash for one 16-byte block. Thoughts re: implementing in SVP64: I still don't fully understand how SVP64 works but I'll make assumptions to brainstorm and hopefully be pushed in the right direction: We can use a maximum of 64-bit registers *unless* we use bigint. However, 64-bit registers are what is needed for implementing a 32-bit version of poly1305, not 64-bits, because multiplying 44*44 (or 42*42 minimum) exceeds 64-bits but is well below 128-bits. Given the parallel nature of SVP64, we might as well implement a 32-bit version which limits itself to 26-bits in all variables (130 / 26 = 5) and we would need a total of 10 registers for r[] and d[]. Since bigint basically stitches numbers together and uses 64-bit multiplication anyways (please do correct me if I'm wrong), there's no benefit to using 64-bits as opposed to 32-bits, could be quite the opposite with the bigint overhead (again correct me if I'm wrong) Now with that in mind, let's assume this (it'll be more numbers but the pattern repeats) d0 = h0r0 + h1s2 + h2s1 d1 = h0r1 + h1r0 + h2s2 d2 = h0r2 + h1r1 + h2r0 If we load h[] and r[] to our SVSHAPEs, we can use the clear pattern of 0,0 1,2 2,1 0,1 1,0 2,2 0,2 1,1 2,0 The leftmost number is n where n = 0..2, the rightmost number is n where n = 0..2 in the first case, n = 2,0,1 in the second case (overflowing 2 to 0) and n = 1,2,0 (again starting from 1, overflowing from 2 to 0) We multiply all the numbers first in parallel, then add the first 3 additions in parallel, then the second addition. Seems doable in 3 instructions after register preparation at least for this part? Long post I know but I wanted to get things right :)

Relevant discussion: (the matrix bridge crashed at the beginning, I copy-pasted the comments before it starts) Sadoon: https://bugs.libre-soc.org/show_bug.cgi?id=1157#c19 Sadoon: I know you and Luke will probably enjoy this Sadoon: I disected the algorithm to find patterns that can be useful for SVP64 Sadoon: I also found that we should probably do 32-bit ints instead of 64-bit Continues on the irclog below: https://libre-soc.org/irclog/%23libre-soc.2023-09-18.log.html#t2023-09-18T20:28:04

(In reply to Sadoon Albader from comment #20) > Relevant discussion: (the matrix bridge crashed at the beginning, I > copy-pasted the comments before it starts) > > Sadoon: https://bugs.libre-soc.org/show_bug.cgi?id=1157#c19 > Sadoon: I know you and Luke will probably enjoy this > Sadoon: I disected the algorithm to find patterns that can be useful for > SVP64 excellent. now read the pseudocode for svfixedarith (below) ad the bigint test cases. see comment #18 commitdiff. > Sadoon: I also found that we should probably do 32-bit ints instead of 64-bit no, because of dsrd and maddedu, both of which are designed as 3-in 2-out instructions. https://libre-soc.org/openpower/isa/svfixedarith/ this is why i converted poly135-donna-64.h not poly1305-donna-32.h

(In reply to Sadoon Albader from comment #19) > If we load h[] and r[] to our SVSHAPEs, we can use the clear pattern of > > 0,0 1,2 2,1 > 0,1 1,0 2,2 > 0,2 1,1 2,0 ta-daaa. that is *exactly* the pattern that jacob is organising for the bigint REMAP. > The leftmost number is n where n = 0..2, the rightmost number is n where n = > 0..2 in the first case, n = 2,0,1 in the second case (overflowing 2 to 0) > and n > = 1,2,0 (again starting from 1, overflowing from 2 to 0) x, (x+y)%MAXVL > We multiply all the numbers first in parallel, then add the first 3 additions > in parallel, then the second addition. Seems doable in 3 instructions after > register preparation at least for this part? absolutely. utterly ridiculously easy, isn't it? now write it out in python - in poly1305-donna.py - to make absolutely sure you got it right. but first *write that unit test i said twice already is essential*. please do not make me repeat things more than once. comment #14 comment #18

(In reply to Sadoon Albader from comment #19) > We can use a maximum of 64-bit registers *unless* we use bigint. scalar instructions DESIGNED to accelerate bigint operations. see svfixedarith. these are ***SCALAR** instructions. they are NOT repeat NOT vector instructions. the whole purpose of this grant is to design SCALAR instructions *AND* to design a vector-looping-system that can take advantage of those SCALAR instructions. we found that maddedu was really damn good when auto-looped by SVP64. but this task "actually shows and proves it". if it *doesn't* do that then we work out *why*, go back to the drawing board and *design sclar instructions that do*. this is how it works.

(In reply to Luke Kenneth Casson Leighton from comment #22) > (In reply to Sadoon Albader from comment #19) > > x, (x+y)%MAXVL Ooooh that's the one, perfect. > but first *write that unit test i said twice already is essential*. > please do not make me repeat things more than once. Yup yup of course, don't worry, on it as I type this :)

ok the interception is added of all math primitives https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a79fe44cf but reporting is only done on ADD and ADDLO, for now. there's no filtering, and no "where the heck did that add-operation come from again??" which i will add later.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0857d123 this adds some *real* basic filtering, to show only those ADD/ADDLO operations that have a 2nd arg between 0 and 7. there's actually only one non-zero add i have seen so far, add 5. this is going to need a *lot* more data run through it but also some capturing of more information such as "what part of the algorithm was using ADD/ADDLO". for tomorrow...

This is a copy-pasta of poly1305-donna.py BUT now has a small function that runs ISACaller in such syntax: simulator(lst, RS, RA, RB) https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=6a798c6eb31f17bc2e7e4e2f2b25e51c3de39044 Obviously doesn't work with immediate-based instructions but the most important part is: I have added ASM inlining using ISACaller in a very simple way, which means I can finally move to the SVP64 implementation while still using the same tests. For now, only the first h0+= ... is implemented in ISACaller, I explicitly did this for testing whether this works, and luckily it only needs one instruction :D I'll need to reconsider later whether I will use this function (likely not when things grow bigger but it's useful nonetheless) but it was a good exercise to a- revise my python skills and b- make sure I know how to call ISACaller

(In reply to Sadoon Albader from comment #27) > This is a copy-pasta of poly1305-donna.py as a general rule that is not done. it results in a hell of a maintenance headache if a bug is found in *one* of the *copies* of the duplicated code. also i cannot review the diff because it isn't a diff! i.e. you made *two* changes - two actions - and merged them into one commit. 1. you copied the file AND 2. you made changes to the file. (the clue being the word "AND", this is not allowed under the HDL_workflow development practices) please the simulator() function to the original file. (and remove this one. two separate commits please) also you missed adding yourself as a Copyright holder.

(In reply to Luke Kenneth Casson Leighton from comment #28) > please the simulator() function to the original file. > (and remove this one. two separate commits please) > > also you missed adding yourself as a Copyright holder. My bad, I got ahead of myself from the excitement xD Thanks for the input. I guess I'll do this then: Add this and all the future simulations to the original and always assert (simulation result == original code result) for each simulation run to guarantee that it is always correct. This means not just the program output but the output of chunks of code of course. Will fix this on Friday as it is almost midnight here.

(In reply to Luke Kenneth Casson Leighton from comment #28) > also you missed adding yourself as a Copyright holder. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=38cc7d7633cd96130d984faa6df68187ae333d1f;hp=b84df17c929d1f28d0e2c458e59bea26d512134f Committed to a new branch to be safe, the original python code was written by yourself but the simulator/SVP64 stuff is being written by me, do I just add myself to the copyright in a separate line? Again this is only for me to familiarize myself with ISACaller and PPC asm. Next step is to do all the h[x] += ... block in a single SVP64 run which is a perfect opportunity to start with SVP64 as you suggested in https://bugs.libre-soc.org/show_bug.cgi?id=1157#c3 :) If you run this against the rand.csv generated by running the C code > rand.csv it should run perfectly with no errors, but I think there might be a tiny bug due to the way I did the asm. Not a big deal because I'm only using the current asm stuff to make porting *that* to SVP64 more streamlined and will delete the original asm once I'm done with the SVP64 one.

My current brainstorming for the initial h[x]+= block: 1- store 0xfffff... (twice) and 0x3ffff... in three registers (p0,p1,p2) 2- store h0, h1, and h2 in 3 registers 3- split t0 and t1 into 3 registers: t0_s = t0 t1_s = t0 >> 44 | t1 << 20 t2_s = t1 >> 24 | hibit (this might be wrong, need to check) 4- setvl 3? 9? 5- the final run should be something like this: h_s[x] += t[x]_s & p[x] Which is perfectly doable in two or a few more SVP64 lines if I'm now understanding things correctly.

(In reply to Sadoon Albader from comment #31) > My current brainstorming for the initial h[x]+= block: > > 1- store 0xfffff... (twice) and 0x3ffff... in three registers (p0,p1,p2) yep > 2- store h0, h1, and h2 in 3 registers > 3- split t0 and t1 into 3 registers: > t0_s = t0 > t1_s = t0 >> 44 | t1 << 20 > t2_s = t1 >> 24 | hibit (this might be wrong, need to check) hibit is "1<<40 if final else 0" you can use sv.dsrd here with some "morphing". notice how 44+20 == 64 and also 24+40 (from the hibit shift) is also == 64? therefore if you do: * t2 = 1 * set an array of [44,24] in RB * point RA at t0 * point RT at t1_s you have created t1_s with one vector instruction. now, of course, it looks pointless because you have one instruction to setvl=2 then sv.dsrd is another, you might as well have done two dsrd instructions! so try that first ok? theeen.... ha ha you'll like this: try with *three* dsrd instructions, but use an array [0,44,24] in RB! > 4- setvl 3? 9? 2, starting at t1_s and going to t2_s. but if you do the trick above of RB=[0,44,24] and start at t0_s then you get the "t0_s = t0 >> 0 | something" , arrange something to contain zero and voila, no need to do a manual copy (no addi t0_s,t0,0) > 5- the final run should be something like this: > > h_s[x] += t[x]_s & p[x] t_s[x] but yes > Which is perfectly doable in two or a few more SVP64 lines if I'm now > understanding things correctly. yes it is. if you stick to Horizontal-First then annoyingly it has to be done as: # first HF instruction, sv.and for x in range(VL) tmpregs[x] = t_s[x] & p[x] # 2nd HF instruction, sv.add for x in range(VL) h_s[x] += tmpregs[x] which given the wasted extra vector of tmpregs is precisely why Vertical First was invented, as that becomes: for x in range(VL) tmpreg = t_s[x] & p[x] h_s[x] += tmpreg note tmpREG scalar rather than tmpREGS vector, but you need a loop and an svstep. instruction which is more instructions, so you only use VF if the regfile is under pressure or you just want to show off :)

(In reply to Luke Kenneth Casson Leighton from comment #3) > what you can do is: > > * 2 regs with consts 0xffff and 0x3fffff then REMAP Indices > 0 1 0 0 1 0 > * another 2 regs with shift amounts 44 and 42 then REMAP > Indices 0 1 0 0 1 0 oh look those are the same, use the > exact same SVSHAPE as above > * 2 regs with a constant 1 and 5 for a mul c*1 or c*5, > here you will need a 2nd REMAP Index 0 0 1 0 0 1 > * 3 regs h0 h1 h2 use another REMAP Index 1 2 0 1 2 0 I've been trying to understand this for 6 hours so.. here goes: x = [0] * 16 x[0] = 0xffff x[1] = 0x3fffff x[2] = 44 x[3] = 42 x[4] = 1 x[5] = 5 x[6] = h1 x[7] = h2 x[8] = h0 setvl 0, 0, 2(?), 1, 1, 1 # 2 because we have two registers? svindex 0/2, 0, 1, 0(?), 0, 1, 0 # 0 because they are 64-bit? svindex 2/2, 1, 1, 0, 0, 1, 0 svindex 4/2, 2, 1, 0, 0, 1, 0 setvl 0, 0, 3(?), 1, 1, 1 # 3 because we have 3 registers? svindex 6/2, 3, 1, 0, 0, 1, 0 I would start with the remaps but I highly doubt this setvl and svindex instructions are correct. The ChaCha20 pages and code are elaborate but I need a few much simpler examples to wrap my head around this.

(In reply to Sadoon Albader from comment #33) > (In reply to Luke Kenneth Casson Leighton from comment #3) > > what you can do is: > > > > * 2 regs with consts 0xffff and 0x3fffff then REMAP Indices > > 0 1 0 0 1 0 let's focus on this first > I've been trying to understand this for 6 hours so.. (edit: you should have written 5h50m sooner, then!) > here goes: > > x = [0] * 16 > x[0] = 0xffff > x[1] = 0x3fffff start the indices for these at say r20. use 8-bit so put the indices 0 1 0 0 1 0 in as 0x00_01_00_00_01_00 for a value of 0x000100000100 into r20. if you set 64-bit indices then you need * 0x000000000000000 in r20 * 0x000000000000001 in r21 * .... * .... r25 which is as you might guess a total waste of 5 registers. > x[2] = 44 > x[3] = 42 > x[4] = 1 > x[5] = 5 > x[6] = h1 > x[7] = h2 > x[8] = h0 those are the *values* not the indices (offsets) *to* those values. > setvl 0, 0, 2(?), 1, 1, 1 # 2 because we have two registers? > svindex 0/2, 0, 1, 0(?), 0, 1, 0 20/2 because you point at the INDICES not the VALUES. > # 0 because they are 64-bit? 3 because the INDICES are 8-bit otherwise you waste 5 registers > I would start with the remaps but I highly doubt this setvl and svindex > instructions are correct. The ChaCha20 pages and code are elaborate but I > need a few much simpler examples to wrap my head around this. not going to happen :) or... it is but you need to get the REMAP concept first. read the spec, pay attention to the pseudocode, start of the page rhen on section 3.4 https://libre-soc.org/openpower/sv/remap/ basically ask me and i will guide you through it. please *do not* waste your time thinking "i MUST do this on my own because of some stubborn sense of feeling i must do it independently", this is completely new Computer Science that nobody has ever seen before ok?

sadoon read this unit test. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l92

(In reply to Luke Kenneth Casson Leighton from comment #35) > sadoon read this unit test. > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/ > decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l92 The first thing that caught my eye was this line: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l110 Says there that vl=10, clearly it is set to 6, is that just a typo? EDIT: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l46 Yes indeed, looks like it was copy-pasted :P Nevermind. I'm at that stage where I can't see the forest for the trees.

(In reply to Sadoon Albader from comment #36) > Says there that vl=10, clearly it is set to 6, is that just a typo? ignore or fix it, i don't mind which. > Nevermind. I'm at that stage where I can't see the forest for the trees. no forest, no trees. it is very simple. go through it step by step. overview. SV Loops https://libre-soc.org/openpower/sv/overview/ for i = 0 to VL-1: GPR(RT+i) = GPR(RA+i) + GPR(RB+i) remap. dead simple. https://libre-soc.org/openpower/sv/remap/ for i in range(VL): GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + GPR[RB+remap3(i)]; indexed. also dead simple. def index_remap(i): return GPR((SVSHAPE.SVGPR<<1)+i) for i in 0..VL-1: GPR(RT + indexed_remap(i)) = ... *THAT IS ALL THERE IS TO IT*. full stop. you can see it right there in the unit test expected results. 118 for i in range(6): 119 RA = initial_regs[0+ ----===>>>idxs[i]<<<====----] 120 RB = initial_regs[0+i] 121 expected_regs[i+8] = RA+RB RA: Index-REMAPped RB: *NOT* REMAPped. you are expecting the VALUES to "magically" be REMAPped but have not supplied any INDEXes. you must SUPPLY the Indices to use for REMAP and i recommended they start at r20. this is NOT THE SAME AS THE VALUES which start at r0. how else can the values be REMAPped if you supply no indices? to repeat: THE INDICES HAVE NOTHING TO DO WITH THE VALUES. conceptually the code that you wrote is attempting to do this: 118 for i in range(6): 119 RA = idxs[i] # values conflated as indices

(In reply to Luke Kenneth Casson Leighton from comment #35) > sadoon read this unit test. > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/ > decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l92 Looking at this unit test again after toying with it and changing values here and there and I now understand much better. I had expected the values to be filled with ASM as well, but here we're (understandably) using python to override the registers as it's much simpler that way. The obligatory question then: Would it be reasonable to just prepare the registers with the data we need and just do the SVP64 parts in ASM? Also by extension, setting svstate.x to the needed values might help me finish up the implementation faster. Even if we needed the actual SVP64 ASM in the end, I'd at least get to a known working state that I can translate to ASM. I hope I'm making sense :)

(In reply to Luke Kenneth Casson Leighton from comment #37) > you are expecting the VALUES to "magically" be REMAPped but have not > supplied any INDEXes. you must SUPPLY the Indices to use for REMAP > and i recommended they start at r20. this is NOT THE SAME AS THE VALUES > which start at r0. how else can the values be REMAPped if you supply no > indices? Ahh now I see my mistake here! Still toying with the same unit test, it's clearing things up a lot.

(In reply to Sadoon Albader from comment #38) > (In reply to Luke Kenneth Casson Leighton from comment #35) > > sadoon read this unit test. > > > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/ > > decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l92 > > Looking at this unit test again after toying with it and changing values > here and there and I now understand much better. > > I had expected the values to be filled with ASM as well, but here we're > (understandably) using python to override the registers as it's much simpler > that way. > > The obligatory question then: > > Would it be reasonable to just prepare the registers with the data we need > and just do the SVP64 parts in ASM? as incremental *partial* unit tests... where you do say the h0/1/2 calculation and only have say 3 lines of assembler... absolutely yes! if you look at comment #3 that's *exactly what i suggest*! > Also by extension, setting svstate.x to > the needed values might help me finish up the implementation faster. yes, as we just discussed here about how svindex unit test does precisely that... yes. but again, see comment #3, you should do *sub* tests only that way, and *SLOWLY* build up to a larger (full, complete) functio that doesn't do that trick https://libre-soc.org/meetings/sync_up/sync_up_2023-11-21/ > Even if > we needed the actual SVP64 ASM in the end, I'd at least get to a known > working state that I can translate to ASM. eexactly. > I hope I'm making sense :) indeed - you can see in comment #3 and from other tests that this is precisely what i expected you to do as it has worked well for me for years :)

Alright, finished up the first part of blocks. I've decided not to create a test_svp64* file just yet, as that requires rewriting the python implementation for poly1305 (class should inherit from FHDLTestCase etc and that gave me many issues). I've decided to focus more on the SVP64 part to get it working because once that works the rest can be done easily. A simple function to call ISACaller with the given lst and initial_regs: def simulation_svp(lst,initial_regs): with Program(lst, bigendian=False) as program: sim = run_tst(program, initial_regs) return sim.gpr And we can then use the gpr in python, basically like inline asm. The SVP64 implementation for the first part of blocks looks like this: 'or 11, 9, 9', # move t0 to r11 'rldicl 20, 9, %d, 44' %(64-44), # equivelant to srdi 'rldicr 21, 10, 20, %d' %(63-20), # equivelant to sldi 'or 12, 20, 21', # move result to r12, 20&21 are temps 'rldicl 13, 10, %d, 24' %(64-24), 'or 13, 13, 30', # hibit or'ing, might be wrong but tests pass 'setvl 0, 0, 3, 0, 1, 1', 'sv.and *11, *11, *0', 'sv.add *3, *3, *11', Which is this in python: h0 += t0 & 0xfffffffffff h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff); h2 += (((t1 >> 24) ) & 0x3ffffffffff) | hibit; Then we can assert to check that the values are correct assert (h* == final_regs[*].value), "h* and h* simulation are unequal!" I've added a comment that the first part could use optimization but I think I'll focus on the actual multiplication and addition first. https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=c89f618a57b299169e1bcbe4a342278247ad2ce5

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=c89f618a57b299169e1bcbe4a342278247ad2ce5 looks good sadoon. first sv instructions! + 'or 11, 9, 9', # move t0 to r11 + 'rldicl 20, 9, %d, 44' %(64-44), # equivelant to srdi + 'rldicr 21, 10, 20, %d' %(63-20), # equivelant to sldi + 'or 12, 20, 21', # move result to r12, 20&21 are temps + 'rldicl 13, 10, %d, 24' %(64-24), + 'or 13, 13, 30', # to accommodate hibit convert those to *scalar* dsrd. comment #18. do *not* attempt to go straight to svp64. incremental. replace with *scalar* first. then go, "huh. these two dsrd instructions are using sequentially- incrementing registers. hey i know, that means i can put "sv." in front and use only one instruction!" notice how i did *not* say "go straight to something complex that you have never used before and do everything at once"? 8ncremental steps, incremental steps, incremental steps.

(In reply to Sadoon Albader from comment #41) > Which is this in python: > > h0 += t0 & 0xfffffffffff > h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff); > h2 += (((t1 >> 24) ) & 0x3ffffffffff) | hibit; see comment #18. it is something like this in python: r1, t0 = dsrd(t0, 0, 44) r2, t1 = dsrd(t1, t0, 44) h1 += r1 h2 += r2 or close to it.

Found a document for poly1305. Good for understanding the algorithm. https://cryptography.io/en/latest/hazmat/primitives/mac/poly1305/#

(In reply to shriya.sharma from comment #44) > Found a document for poly1305. Good for understanding the algorithm. > > https://cryptography.io/en/latest/hazmat/primitives/mac/poly1305/# niice. i'll link on the documentation bugreport, bug #1158

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=951586077f4bc63339a3e5e9b686dae715b87d73 Implemented first half of blocks in dsrd. Made use of setvl as well except for dsrd but will use it for dsrd as well.

(In reply to Sadoon Albader from comment #46) > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=951586077f4bc63339a3e5e9b686dae715b87d73 > > Implemented first half of blocks in dsrd. > ha! delighted to see! > Made use of setvl as well except > for dsrd but will use it for dsrd as well. + 'dsrd 11, 1, 3, 12', + 'dsrd 6, 0, 4, 13', + 'setvl 0, 0, 2, 0, 1, 1', yes by moving r0 r1 and r6 and r11 you should be able to do + 'dsrd 6, 0, 3, 12', + 'dsrd 7, 1, 4, 13', which becomes + 'setvl 0, 0, 2, 0, 1, 1', + 'sv.dsrd *6, *0, *3, *12', interested to know what the rldiclr is about? what does it do? btw you *may* have noticed this? h0 += t0 & 0xfffffffffff; h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff); h2 += (((t1 >> 24) ) & 0x3ffffffffff) | hibit; can become: h0 += (((t-1>> 0 ) | (t0 << 0 )) & 0xfffffffffff); h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff); h2 += (((t1 >> 24) ) & 0x3ffffffffff) | hibit; (where t-1 is deliberately set to zero) therefore you can have *three* dsrds, set vl to *three* not two, this should reduce by one instruction at least as you can set vl=3 just the once? also.. + 'sv.or *11, *11, *5', should that just be + 'or 13, 13, 5' assuming h0 is r11 h1 is r12 h2 is r13 ?

p.s. good progress :)

(In reply to Luke Kenneth Casson Leighton from comment #48) > p.s. good progress :) Thanks :) I'm actually focusing on the add/mul part since it's the bulk of the work, I can deal with sv.dsrd later. I've edited the python code a bit to show the full picture: d0=MUL(h0,r0);d=MUL(h1,s2);d0=ADD(d0,d);d=MUL(h2,s1);d0=ADD(d0,d); d1=MUL(h0,r1);d=MUL(h1,r0);d1=ADD(d1,d);d=MUL(h2,s2);d1=ADD(d1,d); d2=MUL(h0,r2);d=MUL(h1,r1);d2=ADD(d2,d);d=MUL(h2,r0);d2=ADD(d2,d); The dependencies here only apply to add, meaning we can do 9 multiplications first in one go, and then add twice (assuming d here is a separate GPR for each multiplication). Now, I was about to do this like I assumed in #19, but let's take a second look at this: We have h0,h1,h2; r0,r1,r2; s1;s2 The multiplication occurs as follows: d0 = h0r0 + h1s2 + h2s1 d1 = h0r1 + h1r0 + h2s2 d2 = h0r2 + h1r1 + h2r0 If we arrange the registers like this: [r2,r1,r0,s2,s1] 1 2 3 4 5 We can do this: setvl to 3 sv.mul *RT, *h, *3 sv.mul *RT, *h, *2 sv.mul *RT, *h, *1 Perfectly reasonable, but I don't like this. These muls can be one sv.mul if we arrange things correctly. I think we need svindex but am still trying to learn svindex and svremap.

discussed during meeting: https://libre-soc.org/meetings/sync_up/sync_up_2024-01-16/

(In reply to Sadoon Albader from comment #49) > (In reply to Luke Kenneth Casson Leighton from comment #48) > > p.s. good progress :) > > Thanks :) > > I'm actually focusing on the add/mul part since it's the bulk of the work, I > can deal with sv.dsrd later. > > I've edited the python code a bit to show the full picture: > > > d0=MUL(h0,r0);d=MUL(h1,s2);d0=ADD(d0,d);d=MUL(h2,s1);d0=ADD(d0,d); > > d1=MUL(h0,r1);d=MUL(h1,r0);d1=ADD(d1,d);d=MUL(h2,s2);d1=ADD(d1,d); > > d2=MUL(h0,r2);d=MUL(h1,r1);d2=ADD(d2,d);d=MUL(h2,r0);d2=ADD(d2,d); > > The dependencies here only apply to add, meaning we can do 9 multiplications > first in one go, and then add twice (assuming d here is a separate GPR for > each multiplication). > > Now, I was about to do this like I assumed in #19, but let's take a second > look at this: > > We have h0,h1,h2; r0,r1,r2; s1;s2 > > The multiplication occurs as follows: > > d0 = h0r0 + h1s2 + h2s1 > d1 = h0r1 + h1r0 + h2s2 > d2 = h0r2 + h1r1 + h2r0 > > If we arrange the registers like this: > [r2,r1,r0,s2,s1] > 1 2 3 4 5 > We can do this: > > setvl to 3 > sv.mul *RT, *h, *3 > sv.mul *RT, *h, *2 > sv.mul *RT, *h, *1 > > Perfectly reasonable, but I don't like this. yep they are supposed to be covered by bigmul REMAP but we really need the indices via svindex first, and then can replicate them in bigmul-REMAP. i *think* i have it, i did add modulo and i did add a "x+y" mode so in theory it should be possible to produce 0 2 1 1 0 2 2 1 0 > These muls can be one sv.mul if > we arrange things correctly. I think we need svindex but am still trying to > learn svindex and svremap. remember you earlier made the mistake of confusing indices with values, but then resolved that. looks like you need a refresher/reminder https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;hb=HEAD pay attention to lines 167, 169 and 181. run the unit test, use print statements if you don't follow immediately ok? the mapping system is a little awkward you have to remember that SVSHAPEs 0-3 are *NOT* automagically connected up to registers. that relationship is set by *svremap* which sets up the 5 possible registers RA RB RC RT RS with the indices 0-3.

(In reply to Luke Kenneth Casson Leighton from comment #51) > 0 2 1 > 1 0 2 > 2 1 0 actually this is x-y modulo 3, isn't it? that's also possible, i added the option to +/- on both x *and* y. jacob how are you getting on with svshape3? bug #1155 comment #0

(In reply to Luke Kenneth Casson Leighton from comment #52) > (In reply to Luke Kenneth Casson Leighton from comment #51) > > > 0 2 1 > > 1 0 2 > > 2 1 0 > > actually this is x-y modulo 3, isn't it? Yup! We discussed this during yesterday's meeting: https://libre-soc.org/meetings/sync_up/sync_up_2024-01-16/ Jacob helped out a lot, Andrey did too. So this is basically the direction. svindex + svremap.