bug #784 -- EUR 3000: GF(2) polynomial ops: * clmul, clmulh, clmulr * cldiv, clrem (maybe) -- decide what divide by zero does * clmadd (maybe) * cltmadd twin for fft (maybe) cladd is xor, so not needed bug #785: GF(2^n) ops: * gfbredpoly # sets reducing polynomial SPR `GFBREDPOLY` unless this is an immediate op, mtspr is completely sufficient. * MSR bit to be reserved to indicate Poly SPR in use (avoids need to save Poly SPR on contextswitch) * gfbmul RT, RA, RB * gfbmadd RT, RA, RB, RC * gfbtmadd twin for fft * gfbinv rt, ra input 0 gives result 0 even though that is division by zero, that's what's needed for AES. bug #786: GF(p) ops: probably should be different SPR than for GF(2^n) * gfpadd, gfpsub * gfpmul * gfpinv (decide what div by 0 does) * gfpmadd, gfpmsub, gfpmsubr (sub reversed) * gfpmaddsubr (for fft) https://libre-soc.org/openpower/sv/gf2.py/

I could be wrong, but from what I know of how Galois fields work, the degree parameter is entirely derived from the reducing polynomial: if the reducing polynomial is the AES one: x^8 + x^4 + x^3 + x + 1 or, in binary: 100011011 in hex: 0x11B So, here, the degree=8 since that's the largest exponent in the polynomial. In my opinion, we should just have the reducing polynomial as an input or immediate, and derive the degree from it, rather than having degree being a separate input/immediate and wasting encoding space. Alternatively, we could have the degree fixed to XLEN, and just have the reducing polynomial specify the bits below the MSB, with the MSB assumed to always be one. so, the AES polynomial would be specified as 0x1B with the top bit assumed to be a 1 since it would only be used with XLEN=8. In either case, having both degree and reducing polynomials as inputs as we do currently seems quite suboptimal. If this sounds good, I'll make the changes needed to have the reducing polynomial as an input, and degree derived from it.

(In reply to Jacob Lifshay from comment #1) > I could be wrong, but from what I know of how Galois fields work, the degree > parameter is entirely derived from the reducing polynomial: iirc some implementations have an implicit top bit removed similar to how > If this sounds good, I'll make the changes needed to have the reducing > polynomial as an input, and degree derived from it. doesn't feel right. i definitely need the full parameter space similar to FMAC, a full 4-operand MAC, for the SVP64 NTT to work. (NTT, part of Reed Solomon, is the same algorithm as Discrete Fourier Transform just using GF arithmetic) https://en.wikipedia.org/wiki/Talk%3ANumber-theoretic_transform note in that page that: The modulus in a Number Theoretic Transfer doesn't necessarily have to be prime. this tends to point towards using SPRs.

(In reply to Luke Kenneth Casson Leighton from comment #2) > (In reply to Jacob Lifshay from comment #1) > > I could be wrong, but from what I know of how Galois fields work, the degree > > parameter is entirely derived from the reducing polynomial: > > iirc some implementations have an implicit top bit removed similar > to how > > > If this sounds good, I'll make the changes needed to have the reducing > > polynomial as an input, and degree derived from it. > > doesn't feel right. > > i definitely need the full parameter space similar to FMAC, a full > 4-operand MAC, for the SVP64 NTT to work. then we need a gfmadd instruction. we probably also want a gfdiv instruction, but that may be too complex. also, gfadd is just xor iirc...so maybe we don't need that one. > > (NTT, part of Reed Solomon, is the same algorithm as Discrete Fourier > Transform just using GF arithmetic) > > https://en.wikipedia.org/wiki/Talk%3ANumber-theoretic_transform > > note in that page that: > > The modulus in a Number Theoretic Transfer doesn't necessarily have to > be prime. yup, here, I'm assuming the modulus (which ends up being the number of values in the field (not to be confused with the reducing polynomial, which is a different kind of modulus)) is 2^deg(reducing-polynomial). also, iirc it's Number Theoretic *Transform* (like Fourier Transform). > this tends to point towards using SPRs. ok, sounds good! what do you think of just using CTR? though, otoh if we had a dedicated SPR we could precalculate stuff (degree, maybe some implementations want to precalculate some tables) when the SPR was written. In order to support GF(2^XLEN) as well as smaller fields, I think we should pick an encoding for the reducing polynomial and degree: def decode(v, XLEN): """returns the decoded coefficient list in LSB to MSB order, len(retval) == degree + 1""" v &= ((1 << XLEN) - 1) # mask to XLEN bits if v == 0 or v == 2: # GF(2) return [0, 1] # degree = 1, poly = x if v & 1: degree = floor_log2(v) else: # all reducing polynomials of degree > 1 must have the LSB set, # because they must be irreducible polynomials (meaning they # can't be factored), if the LSB was clear, then they would # have x as a factor. Therefore, we can reuse the LSB clear # to instead mean the polynomial has degree XLEN. degree = XLEN v |= 1 << XLEN v |= 1 # LSB must be set return [(v >> i) & 1 for i in range(1 + degree)]

also, for naming, we'd probably want to name these instructions gfbmadd, gfbmul, gfbinv, etc. (Galois Field (binary) ...) because they're specifically for the binary fields GF(2^n). We may later want to add support for odd-prime fields GF(p), since they're also used by some forms of elliptic curve cryptography: * gfpadd -- Galois Field (prime) addition * gfpsub -- Galois Field (prime) subtraction * gfpmul * gfpmadd * gfpmsub we probably won't want division or inverse cuz iirc those are crazy complex.

(In reply to Jacob Lifshay from comment #4) > we probably won't want division or inverse cuz iirc those are crazy complex. modular inverse algorithm: https://en.m.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers

(In reply to Jacob Lifshay from comment #3) > then we need a gfmadd instruction. and a twin version. > we probably also want a gfdiv instruction, but that may be too complex. i knoow, aren't they fun? best way is inverse then mul. galileo something inverse. https://libre-soc.org/openpower/isa/svfparith/ twin butterfly. > also, gfadd is just xor iirc...so maybe we don't need that one. yes because on exceeding polynomial you need a cleanup. as long as both inputs are below polynomial the cleanup is i believe a simple check. > > > > (NTT, part of Reed Solomon, is the same algorithm as Discrete Fourier > > Transform just using GF arithmetic) > > > > https://en.wikipedia.org/wiki/Talk%3ANumber-theoretic_transform > > > > note in that page that: > > > > The modulus in a Number Theoretic Transfer doesn't necessarily have to > > be prime. > > yup, here, I'm assuming the modulus (which ends up being the number of > values in the field (not to be confused with the reducing polynomial, which > is a different kind of modulus)) is 2^deg(reducing-polynomial). do they mean e.g. Gf 2^5 > also, iirc it's Number Theoretic *Transform* (like Fourier Transform). doh. how did i end up typing transfer?? > > this tends to point towards using SPRs. > > ok, sounds good! > > what do you think of just using CTR? see sv.bc for how much functionality would be lost. sv.bc is stunningly powerful, a "simple" bc gets something mad like 128 additional options half of which rely on CTR. > # all reducing polynomials of degree > 1 must have the LSB set, > # because they must be irreducible polynomials (meaning they > # can't be factored), take a look again at the wikipedia NTT talk page apparently nonprime polynomials are actually useful, and LSB=0 is nonprime. oh right GF 2^5 i.e.the degree. can you doublexheck to see if there are any circumstances where a polyynomial with LSB=0 is useful or not?

(In reply to Jacob Lifshay from comment #4) > we probably won't want division or inverse cuz iirc those are crazy complex. isn't it beautifully completely nuts? :) i tracked down a simple version based on long division, see bitmanip page section "gf invert", it looks half decent, reminds me of the arithmetic div pipeline and of CORDIC. the ultimate goal here, the big test, is to do NTT in around 11 instructions similar to DCT. the entire main NTT triple loop with SV REMAP, on top of the appropriate twin butterfly ops. (In reply to Jacob Lifshay from comment #5) > modular inverse algorithm: > https://en.m.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers euclidean not galileo :) see gf invert section you will likely find it is that one. div can then be invert followed by mul. no accuracy is lost because both are ring fields. when prime polynomial.

Please don't edit old comments...I didn't notice till now that you had changed what you wrote... (In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #3) > > > then we need a gfmadd instruction. > > and a twin version. If you mean something that can use twin predication, I'm not sure how we'd do that, since basically all these ops have multiple inputs, iirc thereby using up the space in the SVP64 encoding that could have been used for twin predication. > > > we probably also want a gfdiv instruction, but that may be too complex. > > i knoow, aren't they fun? best way is inverse then mul. > galileo something inverse. > > https://libre-soc.org/openpower/isa/svfparith/ > > twin butterfly. > > > also, gfadd is just xor iirc...so maybe we don't need that one. > > yes because on exceeding polynomial you need a cleanup. > as long as both inputs are below polynomial the cleanup > is i believe a simple check. I'd just expect the inputs to nearly always be in-range (100% of the time in >99% of programs), meaning that xor is completely sufficient. I don't think this justifies a separate instruction. If we need an instruction to change some input to fit in range (cleanup), we can just do gfmadd with both multiplicands be constant zeros (RA|0), which the decoder can easily decode specially to skip the multiplication if we really need the extra speed. > > > > > > (NTT, part of Reed Solomon, is the same algorithm as Discrete Fourier > > > Transform just using GF arithmetic) > > > > > > https://en.wikipedia.org/wiki/Talk%3ANumber-theoretic_transform > > > > > > note in that page that: > > > > > > The modulus in a Number Theoretic Transfer doesn't necessarily have to > > > be prime. > > > > yup, here, I'm assuming the modulus (which ends up being the number of > > values in the field (not to be confused with the reducing polynomial, which > > is a different kind of modulus)) is 2^deg(reducing-polynomial). > > do they mean e.g. Gf 2^5 yes, when you have any degree 5 reducing polynomial. If you have a polynomial that is some other degree, such as 6, then the field described is some instance of GF(2^6) for degree 6, GF(2^3) for degree 3, etc. > > what do you think of just using CTR? > > see sv.bc for how much functionality would be lost. > sv.bc is stunningly powerful, a "simple" bc gets something > mad like 128 additional options half of which rely on CTR. yeah, I wasn't thinking that through. CTR is too important for looping. > > > > # all reducing polynomials of degree > 1 must have the LSB set, > > # because they must be irreducible polynomials (meaning they > > # can't be factored), > > > take a look again at the wikipedia NTT talk page apparently nonprime > polynomials are actually useful, and LSB=0 is nonprime. as you already probably figured out, the non-prime part is the size of the field (so for GF(27), the field has size 27), not the reducing polynomial. AFAIK the reducing polynomial always > > oh right GF 2^5 i.e.the degree. > > can you doublexheck to see if there are any circumstances where > a polyynomial with LSB=0 is useful or not? there's one, when you're using GF(p) for some prime p. in this case, the only applicable one is GF(2), which I already have the special case for. The polynomial must always be irreducible (the polynomial equivalent of being a prime number), otherwise you don't get a Galois Field (you get a Ring instead), which means that the LSB must be non-zero for degree > 2 polynomials, otherwise it would be divisible by the polynomial x^1+0, which would mean that your polynomial wasn't irreducible after all, making your nice Galois Field no longer a Field, essentially breaking the math. Think of it as like dividing by zero.

(In reply to Luke Kenneth Casson Leighton from comment #7) > (In reply to Jacob Lifshay from comment #4) > > > we probably won't want division or inverse cuz iirc those are crazy complex. For GF(p) for prime p (we can assume that p can get quite large, close to 2^64), a modular inversion essentially a loop over integer divmod and multiplication, divmod itself is already kinda a loop. Then we get the outer loops from SV. We're going to get a like 5 or 6-deep loop all running from 1 instruction, which imho is kinda overboard. That sounds like CISC, but even more so -- super CISC. Next we'll be adding a javascript parser as 1 instruction...:P > > isn't it beautifully completely nuts? :) i tracked down a simple version > based on long division, see bitmanip page section "gf invert", it looks > half decent, reminds me of the arithmetic div pipeline and of CORDIC. yeah, that's for binary GF. prime GF is a whole other beast. > the ultimate goal here, the big test, is to do NTT in around 11 instructions > similar to DCT. the entire main NTT triple loop with SV REMAP, on top of > the appropriate twin butterfly ops. well, iirc NTT is only gfmadd, no gfdiv for prime GF needed there. > > (In reply to Jacob Lifshay from comment #5) > > > modular inverse algorithm: > > https://en.m.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers > > euclidean not galileo :) > > see gf invert section you will likely find it is that one. > > div can then be invert followed by mul. no accuracy is lost because > both are ring fields. when prime polynomial. yeah, for binary GF, gfinv or gfdiv is totally feasible, if a bit slow.

(In reply to Jacob Lifshay from comment #8) > Please don't edit old comments...I didn't notice till now that you had > changed what you wrote... > > (In reply to Luke Kenneth Casson Leighton from comment #6) > > (In reply to Jacob Lifshay from comment #3) > > > > > then we need a gfmadd instruction. > > > > and a twin version. > > If you mean something that can use twin predication, nono, not predication, two ops in one instruction. look at the pseudocode https://libre-soc.org/openpower/isa/svfparith/ the reason why i was able to do inplace FFT is the twin butterfly ops. otherwise you need a temp reg and two instructions and obviously that doesnt fit with Horizontal SVP64. > > yes because on exceeding polynomial you need a cleanup. > > as long as both inputs are below polynomial the cleanup > > is i believe a simple check. > > I'd just expect the inputs to nearly always be in-range (100% of the time in > >99% of programs), meaning that xor is completely sufficient. I don't think > this justifies a separate instruction. mrhhm mumble mumble we need to be 100% certain. > > do they mean e.g. Gf 2^5 > > yes, when you have any degree 5 reducing polynomial. If you have a > polynomial that is some other degree, such as 6, then the field described is > some instance of GF(2^6) for degree 6, GF(2^3) for degree 3, etc. ok excellent. i knew that, but sometimes reading wikipedia talk they help you lose track of the facts rather than give you insights :) > The polynomial must always be irreducible (the polynomial equivalent of > being a prime number), otherwise you don't get a Galois Field (you get a > Ring instead), which means that the LSB must be non-zero for degree > 2 > polynomials, otherwise it would be divisible by the polynomial x^1+0, which > would mean that your polynomial wasn't irreducible after all, making your > nice Galois Field no longer a Field, essentially breaking the math. Think of > it as like dividing by zero. eep. bad. and also fascinating. ok then i like the encoding. (In reply to Jacob Lifshay from comment #9) > > isn't it beautifully completely nuts? :) i tracked down a simple version > > based on long division, see bitmanip page section "gf invert", it looks > > half decent, reminds me of the arithmetic div pipeline and of CORDIC. > > yeah, that's for binary GF. prime GF is a whole other beast. ok i get the message i need to look this up. if you know of some simple easy to understand implementation do put it in the wiki. give me a day or a few to read up on it, any papers/academic tutorials you know of do let me know.

correct me if wrong i believe it quite simple to show XOR insufficient polynomial 0b110001 x 0b110000 y 0b001111 XOR 0b111111 clearly this is greater than the polynomial which modulo you are supposed to XOR and instead get the correct answer: polynomial 0b110001 x 0b110000 y 0b001111 XOR 0b111111 this greater than polynomial subtract p 0b110001 use XOR to subtract answer 0b001110 so i think you'll find that we can get away with: * assumption that both RA and RB are less than polynomial * do XOR * compare (arithmetic cmp) with polynomial * if greater, XOR with polynomial no huge iteration needed, just one extra XOR *as long as* both RA and RB are 0 <= RA/RB < polynomial

(In reply to Luke Kenneth Casson Leighton from comment #10) > (In reply to Jacob Lifshay from comment #8) > > Please don't edit old comments...I didn't notice till now that you had > > changed what you wrote... > > > > (In reply to Luke Kenneth Casson Leighton from comment #6) > > > (In reply to Jacob Lifshay from comment #3) > > > > > > > then we need a gfmadd instruction. > > > > > > and a twin version. > > > > If you mean something that can use twin predication, > > nono, not predication, two ops in one instruction. look at the pseudocode > https://libre-soc.org/openpower/isa/svfparith/ will look in the morning...I thought that was just for complex numbers tho. GF and NTT both operate on modular integers/polynomials rather than complex numbers, so shouldn't need a separate element for real/imag components. > > the reason why i was able to do inplace FFT is the twin butterfly ops. > otherwise you need a temp reg and two instructions and obviously that > doesnt fit with Horizontal SVP64. > > > > yes because on exceeding polynomial you need a cleanup. > > > as long as both inputs are below polynomial the cleanup > > > is i believe a simple check. > > > > I'd just expect the inputs to nearly always be in-range (100% of the time in > > >99% of programs), meaning that xor is completely sufficient. I don't think > > this justifies a separate instruction. > > mrhhm mumble mumble we need to be 100% certain. > > > > do they mean e.g. Gf 2^5 > > > > yes, when you have any degree 5 reducing polynomial. If you have a > > polynomial that is some other degree, such as 6, then the field described is > > some instance of GF(2^6) for degree 6, GF(2^3) for degree 3, etc. > > ok excellent. i knew that, but sometimes reading wikipedia talk they help > you lose track of the facts rather than give you insights :) yay :) > > The polynomial must always be irreducible (the polynomial equivalent of > > being a prime number), otherwise you don't get a Galois Field (you get a > > Ring instead), which means that the LSB must be non-zero for degree > 2 > > polynomials, otherwise it would be divisible by the polynomial x^1+0, which > > would mean that your polynomial wasn't irreducible after all, making your > > nice Galois Field no longer a Field, essentially breaking the math. Think of > > it as like dividing by zero. > > eep. bad. > > and also fascinating. > > ok then i like the encoding. :) > > (In reply to Jacob Lifshay from comment #9) > > > > isn't it beautifully completely nuts? :) i tracked down a simple version > > > based on long division, see bitmanip page section "gf invert", it looks > > > half decent, reminds me of the arithmetic div pipeline and of CORDIC. > > > > yeah, that's for binary GF. prime GF is a whole other beast. > > ok i get the message i need to look this up. if you know of some simple > easy to understand implementation do put it in the wiki. > > give me a day or a few to read up on it, any papers/academic tutorials > you know of do let me know. well, it took me like 2-3mo to finally let it sink in...that was when I was trying to create an algebraic number library for use in the FP math reference implementation a few years ago. Turns out polynomials of modular integers show up a bunch when you want to efficiently factor polynomials with integer coefficients. Unfortunately, all I have at the moment is Wikipedia. Check out the elliptic curve page too, that's where I got the terms prime/binary GF.

(In reply to Luke Kenneth Casson Leighton from comment #11) > correct me if wrong i believe it quite simple to show XOR insufficient > > polynomial 0b110001 > x 0b110000 > y 0b001111 > XOR 0b111111 > clearly this is greater than the polynomial which modulo you are > supposed to XOR and instead get the correct answer: well, that's because you started with something bigger you should have...remember, these are polynomials, so any bit set means you divide it out. > polynomial 0b110001 > x 0b110000 here x would have to be reduced to something without the top bit set, only after reducing it is it a value you'd actually expect to have as an input. I'm not working out what it'd reduce to at the moment, it's 4:30am here.

https://stackoverflow.com/questions/25305909/inverse-within-a-finite-field https://en.m.wikipedia.org/wiki/Modular_multiplicative_inverse#Computation ah HA! https://en.m.wikibooks.org/wiki/Algorithm_Implementation/Mathematics/Extended_Euclidean_algorithm

(In reply to Jacob Lifshay from comment #13) > here x would have to be reduced to something without the top bit set, only > after reducing it is it a value you'd actually expect to have as an input. > I'm not working out what it'd reduce to at the moment, it's 4:30am here. brainmelt :) when not 4:30am, is there any reason to support binary? i mean if prime is just XOR then pffh.

(In reply to Jacob Lifshay from comment #12) > (In reply to Luke Kenneth Casson Leighton from comment #10) > > nono, not predication, two ops in one instruction. look at the pseudocode > > https://libre-soc.org/openpower/isa/svfparith/ > > will look in the morning...I thought that was just for complex numbers tho. > GF and NTT both operate on modular integers/polynomials rather than complex > numbers, so shouldn't need a separate element for real/imag components. ahh, now I get what you mean! NTT inner loop: https://github.com/sympy/sympy/blob/68022011872a20c27abd047125e5feb8ce3289cc/sympy/discrete/transforms.py#L173 h = 2 while h <= n: hf, ut = h // 2, n // h for i in range(0, n, h): for j in range(hf): u, v = a[i + j], a[i + j + hf]*w[ut * j] a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p h *= 2

(In reply to Jacob Lifshay from comment #16) > NTT inner loop: > https://github.com/sympy/sympy/blob/68022011872a20c27abd047125e5feb8ce3289cc/ > sympy/discrete/transforms.py#L173 > > h = 2 > while h <= n: > hf, ut = h // 2, n // h > for i in range(0, n, h): > for j in range(hf): > u, v = a[i + j], a[i + j + hf]*w[ut * j] > a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p > h *= 2 looks like gfpmuladdsub

(In reply to Luke Kenneth Casson Leighton from comment #15) > (In reply to Jacob Lifshay from comment #13) > > > here x would have to be reduced to something without the top bit set, only > > after reducing it is it a value you'd actually expect to have as an input. > > I'm not working out what it'd reduce to at the moment, it's 4:30am here. > > brainmelt :) > > when not 4:30am, is there any reason to support binary? i mean if prime is > just XOR then pffh. you got them mixed up. binary gf add is xor. prime gf add is integer addition followed by subtracting the prime if the sum is too big.

(In reply to Jacob Lifshay from comment #17) > (In reply to Jacob Lifshay from comment #16) > > NTT inner loop: > > https://github.com/sympy/sympy/blob/68022011872a20c27abd047125e5feb8ce3289cc/ > > sympy/discrete/transforms.py#L173 > > > > h = 2 > > while h <= n: > > hf, ut = h // 2, n // h > > for i in range(0, n, h): > > for j in range(hf): > > u, v = a[i + j], a[i + j + hf]*w[ut * j] > > a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p > > h *= 2 nice find. i wish i had been able to track that down as quickly, it took me 2-3 weeks to derive that same algorithm from project nayuki recursive version, sigh https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;hb=HEAD so, the twin thingy means no need to have a temp var, because it is a butterfly swap on a[i+j] and a[i+j+hf] what i did in SV.REMAP, was, create 3 separate "schedules": * one for i+j * one for i+j+hf * one for... where is it... ah: ut+j > looks like gfpmuladdsub yes, basically you set up gfpmuladdsub(RT, RA, RB) and apply *two* SV RM destinations to RT (one for the add, one for the sub), then hit: * RTv1 with an i+j schedule * RTv2 with an i+j+hf schedule * RB which is the multiply coefficient array w[] with ut*j have to look again at the fft svp64 example because i am missing something there but basically that entire triple for-loop, the while plus i plus j, all "disappear" into one single application of *THREE* SV.REMAPs on the one single sv.gfpmuladdsub instruction. kinda stunningly ridiculously elegant and overkill at the same time :) but, joking aside, it means you can do the entire in-place NTT (or DFT) triple loop in something mad like the equivalent space of 5 32 bit instructions. remap_fft_yield.py shows the principle.

(In reply to Jacob Lifshay from comment #18) > you got them mixed up. binary gf add is xor. prime gf add is integer > addition followed by subtracting the prime if the sum is too big. i'm glad you're on this, now you see why it takes me so long, i constantly get things inverted.

found it https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;h=5ea7fcc89c8102b7a0ca34403fefb9e15f40eb2c;hb=d1b415e4383366cf445fd4ff2db828a612f88099#l140 twin butterfly, of Cooley-Tukey, we want a GF version. you can use the DFT example as pretty much cut/paste for everything, almost. it is *3* input, *2* output, see ffmadds, where the SV RM entry has 2 RT destinations hence you get 5 actual operands where there is only space for 4 in the opcode. if really really really pushed for space it can be made a 3 operand "overwrite" (with 5 offsets) so actually: sv.gfpmuladdsub RS, RA, RB but: * RS gets a "src1" offset (this is the add) * RA gets a "src2" offset (this is the sub) * RB gets a "src3" offset (this is the w coefficient) * RS and RA *ALSO* get *dest* offsets where again if pushed src1 offset can be the same as dst1 and src2 can be the same as dest2 offset. do not worry about the "efficiency" of 3in 2out operations, it is more important to have RS RA RB all as "in-flight", using Reservation Stations to avoid the use of a temp register.

(In reply to Luke Kenneth Casson Leighton from comment #19) > (In reply to Jacob Lifshay from comment #17) > > (In reply to Jacob Lifshay from comment #16) > > > NTT inner loop: > > > https://github.com/sympy/sympy/blob/68022011872a20c27abd047125e5feb8ce3289cc/sympy/discrete/transforms.py#L173 > > > > > > h = 2 > > > while h <= n: > > > hf, ut = h // 2, n // h > > > for i in range(0, n, h): > > > for j in range(hf): > > > u, v = a[i + j], a[i + j + hf]*w[ut * j] > > > a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p > > > h *= 2 > > nice find. i wish i had been able to track that down as quickly, > it took me 2-3 weeks to derive that same algorithm from project > nayuki recursive version, sigh it was one of the first few things i found on google... > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;hb=HEAD > > so, the twin thingy means no need to have a temp var, > because it is a butterfly swap on a[i+j] and a[i+j+hf] > > what i did in SV.REMAP, was, create 3 separate "schedules": > > * one for i+j > * one for i+j+hf > * one for... where is it... ah: ut+j > > > looks like gfpmuladdsub > > yes, basically you set up gfpmuladdsub(RT, RA, RB) > and apply *two* SV RM destinations to RT (one for the add, one > for the sub), then hit: hmm, I think we should instead defined the scalar gfpmuladdsub to write to two separate registers, like how ldu works, rather than having to use svp64 to disambiguate the two destinations (which requires using svp64 for the instruction to be useful since the scalar instruction writes twice to the same register). Also, naming wise, that subtraction is term - product, which is the reverse of normal muladd, so i think we should call it gfpmuladdsubr rather than gfpmuladdsub. I propose: gfpmuladdsubr RT, RA, RB, RC Pseudocode: factor1 = (RA) factor2 = (RB) term = (RC) prime = GFPRIME # the gfp* prime modulus SPR (RA) = (term - factor1 * factor2) % prime (RT) = (term + factor1 * factor2) % prime This allows SVP64 REMAP to have: * `a[i + j]` be RT and RC * `a[i + j + hf]` be RA * `w[ut * j]` be RB Inner loop body for reference: > u, v = a[i + j], a[i + j + hf]*w[ut * j] > a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p I picked gfpmuladdsubr overwriting one of the factors (RA) because that makes it more flexible since there's another identical input (RB) if you need your factor to not be overwritten, whereas you can't do that with the term input (RC) cuz there's only one of them, so I had that result go to RT instead of overwriting the term input.

(In reply to Jacob Lifshay from comment #22) > hmm, I think we should instead defined the scalar gfpmuladdsub to write to > two separate registers, already ahead of you, ffmaddsub scalar targets RT and RT+1 the vector non-REMAP version targets RT+srcoffs,RT+VL+srcoffs (i am dragging much of this out of longterm memory) when i said this needs to fit exactly with what has already been done i really meant it. everything is there, it was 8+ intensive weeks developing REMAP FFT and we need modifying it like a hole in the head. there is more, in terms of Horizontal First Mode which shaped some of the register name conventions, i just can't recall them immediately but it allowed 5 regs to fit 3 SHAPES and use scalar temps and 2 vector ops all without changing the REMAPs inside the critical loop. avoiding changing that by arbitrary register allocation is a high priority. do spend some time looking closely at the ffmaddsub and fft remap unit tests and run them.

(In reply to Jacob Lifshay from comment #22) > I picked gfpmuladdsubr overwriting one of the factors (RA) because that > makes it more flexible since there's another identical input (RB) if you > need your factor to not be overwritten, whereas you can't do that with the > term input (RC) cuz there's only one of them, so I had that result go to RT > instead of overwriting the term input. to reiterate: you really, *really* do not want to modify the FFT REMAP algorithm. as in, really. don't. do it. that means working out which reg was the coefficient, which the add, which the sub, from the ffmaddsub instruction and copying it verbatim but moving to GF. yes it really was that much of a deep dive. there is literally no difference in the schedules between FFT and NTT so there is absolutely no technical reason to modify the FFT REMAP system, either.

https://www.doc.ic.ac.uk/~mrh/330tutor/ch04s04.html jacob when you said "binary polynomials" were you referring to GF(2^m)? if so, can we try to keep the discussion to GF(2^m) because otherwise my brain will melt.

(In reply to Jacob Lifshay from comment #3) > def decode(v, XLEN): > """returns the decoded coefficient list in LSB to MSB order, > len(retval) == degree + 1""" so just to be clear: v is the modulus? and the degree is implicitly calculated from it? if so one instruction can set the modulus and degree into a (new) SPR and that in turn frees up a boatload of operand space, certainly enough to make a 4-operand gfmuladd and also gftwinmuladdsub and basically have exactly the same operand types and number of operands as standard arithmetic.

https://libre-soc.org/openpower/isa/svfparith/ Floating Multiply-Add FFT [Single] A-Form ffmadds FRT,FRA,FRC,FRB (Rc=0) ffmadds. FRT,FRA,FRC,FRB (Rc=1) Pseudo-code: FRT <- FPMULADD32(FRA, FRC, FRB, 1, 1) FRS <- FPMULADD32(FRA, FRC, FRB, -1, 1) note **FRS** there are *five* operands only *four* listed in the mnemonic. FRS is implicitly calculated * scalar: FRS=FRT+1 * vector: FRS=FRT+svstep+VL * v/REMAP: FRS=FRT+REMAP(svstep) the gfmuladdsub needs to be the same form. anything else requires such extensive interference with SV REMAP it would be prohibitively disruptive for absolutely no good reason whatsoever.

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/gf2.py;hb=HEAD i dropped the 2 functions gf2MUL and gf2divmod into a file to play with and they work well. gf_invert however did not, it needs investigating after i made a couple of changes. with some though gf2MUL can be seen to be standard long multiplication in binary (XOR instead of the normal add) with standard modulo as well, even that is analogous.

here is gcd: 59 def xgcd(a, b): 60 """return (g, x, y) such that a*x + b*y = g = gcd(a, b)""" 61 x0, x1, y0, y1 = 0, 1, 1, 0 62 while a != 0: 63 (q, a), b = divmodGF2(b, a), a 64 y0, y1 = y1, y0 ^ multGF2(q , y1) 65 x0, x1 = x1, x0 ^ multGF2(q , x1) 66 return b, x0, y0 it works, but the complexity is high (3 higher functions in a loop, and 3 return results) so is perfect for doing in assembler, with divmodgf and multgf as opcodes. not recommending putting this one in as instruction, nor gf_invert which uses gcd

since GF(2^n) is a field, invert should be `gfdiv(1, v)` (since 1 is the multiplicative identity) for all GF(2^n) values `v`. the gcd algorithm is used for gf(p) for prime p, but there the mul, div, etc are all integer div, mul, etc, not GF(n) operations. afaict you confused it again.

(In reply to Jacob Lifshay from comment #30) > since GF(2^n) is a field, invert should be `gfdiv(1, v)` (since 1 is the > multiplicative identity) for all GF(2^n) values `v`. > > the gcd algorithm is used for gf(p) for prime p, but there the mul, div, etc > are all integer div, mul, etc, not GF(n) operations. afaict you confused it > again. ahh, gf(2^n) gfdiv can be implemented a gfmul combined with the gcd algorithm where the div and mul and add ops in the gcd algorithm are *not* GF(2^n) ops, they are instead operating on polynomials of GF(2). you can also use gfmul to compute gfdiv(a, b) in gf(2^n) is gfmul(a, gfpow(b, 2^n - 1)), where gfpow can be implemented using the standard squaring algorithm, giving runtime O(n * gfmul_time).

also, one important point worth considering: for elliptic curve cryptography, it may use GF(2^n) where n is much larger than what fits in a single register -- several hundred. we are likely to have the same issue with GF(p) where p is much larger than 2^64.

I found a paper that references a VLSI-optimized version of gfinv: https://scholar.archive.org/work/ktlygagf6jhslhx42gpuwwzc44/access/wayback/http://acsel-lab.com/arithmetic/arith18/papers/ARITH18_Kobayashi.pdf (Algorithm 3, from [7]) GF(2^m) inversion. G(x) is the reducing polynomial. m is the degree of G(x). A(x) is the input polynomial. r[m] is the mth coefficient of R(x). s[m] is the mth coefficient of S(x). Algorithm 3: S(x) := G(x); V(x) := 0; R(x) := A(x); U(x) := 1; δ := 0; for i = 1 to 2*m do if r[m] = 0 then R(x) := x × R(x); # shift left by 1 U(x) := x × U(x); # shift left by 1 δ = δ + 1; # integer increment else if s[m] = 1 then S(x) := S(x) − R(x); # polynomial subtraction aka. xor V(x) := V(x) − U(x); # polynomial subtraction aka. xor end if S(x) := x × S(x); # shift left by 1 if δ = 0 then {R(x) := S(x), S(x) := R(x)}; # swap S(x) and R(x) # shift V(x) left by one, then swap V(x) and U(x) {U(x) := x × V(x), V(x) := U(x)}; δ := 1; else U(x) := U/x; # shift right 1 δ := δ − 1; end if end if end for output U(x) as the result. (U(x) = A^−1(x))

(In reply to Jacob Lifshay from comment #31) > (In reply to Jacob Lifshay from comment #30) > > since GF(2^n) is a field, invert should be `gfdiv(1, v)` (since 1 is the > > multiplicative identity) for all GF(2^n) values `v`. > > > > the gcd algorithm is used for gf(p) for prime p, but there the mul, div, etc > > are all integer div, mul, etc, not GF(n) operations. afaict you confused it > > again. almost certainly :) much of what i'm doing is "hunt-the-algorithm" (guess-work), then running unit tests to compensate. run that loop iteratively, fast enough, and it's indistinguishable from knowing what the hell you're doing in the first place :) > you can also use gfmul to compute gfdiv(a, b) in gf(2^n) is gfmul(a, > gfpow(b, 2^n - 1)), where gfpow can be implemented using the standard > squaring algorithm, giving runtime O(n * gfmul_time). i remember this. it's one of the really important features of GF, that if you multiply enough times you get back to the original. (In reply to Jacob Lifshay from comment #33) > I found a paper that references a VLSI-optimized version of gfinv: > https://scholar.archive.org/work/ktlygagf6jhslhx42gpuwwzc44/access/wayback/ > http://acsel-lab.com/arithmetic/arith18/papers/ARITH18_Kobayashi.pdf > (Algorithm 3, from [7]) nice. i'm glad you can read it! i dropped a copy here https://ftp.libre-soc.org/ARITH18_Kobayashi.pdf > S(x) := S(x) − R(x); # polynomial subtraction aka. xor > V(x) := V(x) − U(x); # polynomial subtraction aka. xor v. helpful, the comments. > (U(x) = A^−1(x)) what's this one? A to the power of "-1(x)" ? oh wait, that's just a declaration, "U *is* the inverse of A". got it. (In reply to Jacob Lifshay from comment #32) > also, one important point worth considering: for elliptic curve > cryptography, it may use GF(2^n) where n is much larger than what fits in a > single register -- several hundred. we are likely to have the same issue > with GF(p) where p is much larger than 2^64. yes, i'm not sure what's the best option there. oh wow, apparently doing an FFT is one option! holy cow https://github.com/popcornell/pyGF2/blob/master/pyGF2/gf2_mul.py https://github.com/popcornell/pyGF2/blob/master/pyGF2/gf2_inv.py *splutter* hey guess what? we have hardware-assisted FFT.... :)

(In reply to Luke Kenneth Casson Leighton from comment #34) > (In reply to Jacob Lifshay from comment #33) > > I found a paper that references a VLSI-optimized version of gfinv: > > https://scholar.archive.org/work/ktlygagf6jhslhx42gpuwwzc44/access/wayback/ > > http://acsel-lab.com/arithmetic/arith18/papers/ARITH18_Kobayashi.pdf > > (Algorithm 3, from [7]) > > nice. i'm glad you can read it! > i dropped a copy here https://ftp.libre-soc.org/ARITH18_Kobayashi.pdf Ok. since it's archive.org, i expect the url to work well into the future. that's why I picked the archive.org url over some other options. > > > S(x) := S(x) − R(x); # polynomial subtraction aka. xor > > V(x) := V(x) − U(x); # polynomial subtraction aka. xor > > v. helpful, the comments. yeah, i added the comments, i figured it would help with your algorithm dyslexia :)

[edit: trailing not leading] (In reply to Jacob Lifshay from comment #35) > yeah, i added the comments, i figured it would help with your algorithm > dyslexia :) yeah very much, thank you, was really straightforward. amazingly this seems to get the right answer: https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=0a9f45f2615f35902c4783bcdd07ab6151db841d any change to "x" as long as it is less than 0x11b seems to produce the correct value for y1 (bottom of the test), but i had to hunt through a lot of random values of y to find one that produced the right answer (again, independent of changes to x) no idea why, but hey i guessed that "reducing polynomial" means the global "polyred", and that from the for-loop "1 to 2*m" that this meant that m is the degree, which then allowed me to use "if s & mask1" which is testing the MSB looking closely at the algorithm, now, i think we can use "count-trailing-1s" to skip ahead a number of loops: i = 1 while i < 2*degree+1: jump = count_trailing_1s(r) if jump == 0: ... ... else: r <<= jump # jump left u <<= jump # jump left j += jump that would save on iterations, similar to how you can jump ahead in long-division by detecting runs of zeros.

this stuff was already addressed on irc, but I'm restating here to make this bug's comments more useful. (In reply to Luke Kenneth Casson Leighton from comment #36) > i guessed that "reducing polynomial" means the global "polyred", no, it means polyred but with the 2^degree bit re-added. lkcl already fixed this. > and that from the for-loop "1 to 2*m" that this meant that m > is the degree, which then allowed me to use "if s & mask1" > which is testing the MSB yes, that's correct, though i probably would have spelled out the arithmetic rather than relying on globals. > looking closely at the algorithm, now, i think we can use > "count-trailing-1s" to skip ahead a number of loops: > ... > that would save on iterations, similar to how you can jump ahead > in long-division by detecting runs of zeros. we could...but i'm inclined to instead have a pipeline... 1 stage per iteration. then we could merge stages together since the mux & xor should be fast enough. We could probably get 5-6 stages per merged stage, giving us 3-4 cycles for GF(2^8), 6-7 cycles for GF(2^16), 11-13 cycles for GF(2^32), and 22-26 cycles for GF(2^64)

(In reply to Jacob Lifshay from comment #37) > yes, that's correct, though i probably would have spelled out the arithmetic > rather than relying on globals. that came from the programming style used by the gfMUL function which i originally found online, but if you think about it, it kinda reflects the concept we had in mind to have an SPR for the modulo. i've often found that converting python code to classes, so as to avoid considering the *module* as the encapsulation of the concept, to result in unnecessary complexity and loss of clarity. given that the SPR for the modulo would be fully global in nature, polyred as a global funnily enough more accurately represents the real-world scenario as to how the assembly instructions would be used in the ISA, giving us a "feel" for how things would go.

after thinking about a bit, I think gfmul's algorithm can be partially merged with the gfinv algorithm I posted in comment 33, giving us a gfdiv algorithm that takes less time than gfinv followed by gfmul. Let's assume we're calculating gfdiv(N(x), A(x)) (A in the denominator in order to match the gfinv algorithm). This is because the U(x) and V(x) polynomials are never used as inputs in the gfinv algorithm, and the algorithm has some nice additional properties, so we can change them to clmul(N(x), U(x)) and clmul(N(x), V(x)) throughout the algorithm, that way the algorithm will return clmul(N(x), gfinv(A(x))) which will be denoted by T(x). T(x), when reduced by G(x) (the reducing polynomial) produces the result of gfdiv(N(x), A(x)).

(In reply to Jacob Lifshay from comment #39) > after thinking about a bit, I think gfmul's algorithm can be partially > merged with the gfinv algorithm I posted in comment 33, giving us a gfdiv > algorithm that takes less time than gfinv followed by gfmul. neat. feel free, but keep the originals around (in the same file, gf2.py) so that the evidence-trail (for review etc.) is around, and when it comes to unit-testing, the versions of the functions that have attribution / explanations can be used.

though, do we really need gfdiv? it's a lot of extra complexity and i can't think of any algorithms that need it...if there are algorithms, they can easily run gfmul on the result of gfinv, giving the equivalent of gfdiv. In any case, here's the list of binary gf ops I think we should implement: * clmul, clmulh (for gf operations that are > 64-bit, and other stuff) * gfbmul RT, RA, RB * gfbmuli RT, RA, imm # discard if we don't have enough encoding space * gfbmadd RT, RA, RB, RC * gfbinv rt, ra input 0 gives result 0 even though that is division by zero, that's what's needed for AES. gfbadd is covered by the existing xor instructions. gfbmaddsubr is not needed, since sub is identical to add, this is just two gfbmadd instructions. If that all sounds good, i can create a new tracking bug for binary gf instructions and start working on implementing them.

(In reply to Jacob Lifshay from comment #41) > though, do we really need gfdiv? i'm sure i've seen modulo (remainder) used somewhere. if producing that, might as well have div. > it's a lot of extra complexity how? https://libre-soc.org/openpower/sv/gf2.py/ that's what... 7 lines? > In any case, here's the list of binary gf ops I think we should implement: > > * clmul, clmulh (for gf operations that are > 64-bit, and other stuff) see wiki, clmulr is missing. > * gfbmul RT, RA, RB ack. is it worth doing a gfbmulh? looking at multGF2 it looks like it could return (res >> deg) & mask2 just as easily to give the top half. > * gfbmuli RT, RA, imm # discard if we don't have enough encoding space no chance. snowball, hell etc. typically imm takes an entire major opcode. mulli addi they have dedicated major opcodes. makes no sense except for less than degree 16, with gf jumping all over the place. if constructing consts then addi and addis can be used. on presenting to OPF ISA WG they really want a gfbmuli i see no reason why not, but it does take an entire major op. > * gfbmadd RT, RA, RB, RC ack. remember it has to match madd and fmadd for REMAP purposes. this is IMPORTANT. > * gfbinv rt, ra > input 0 gives result 0 even though that is division by zero, > that's what's needed for AES. ok great. > gfbadd is covered by the existing xor instructions. ack. > gfbmaddsubr is not needed, since sub is identical to add, this is just two > gfbmadd instructions. ack. gftwinmuladd is missing. btw see wiki page. please do read and refer to it, i get the impression you're not reading it regularly. > If that all sounds good, i can create a new tracking bug for binary gf > instructions and start working on implementing them. the basics (lowlevel modules) and bugreport, yes. btw do not consider pipeline designs to be "the peak epitomy of absolute performance", it is misleading. multiple simpler FSMs with multiple RSes can be more efficient and take advantage of early-out and jump-ahead opportunities that pipelines cannot. this makes them *less* efficient rather than more, particularly for long-running ops. [yes a FSM can be done which processes multiple chained combinatorial stages] RSes need to farm-in and fan-out to pipelines (which have to be BIG muxes), where FSMs can wire directly. each RS has its own FSM, each FSM stores its result in the RS, no massive fan-in/out. bottom line you need to think of the context in which the pipeline is used.

(In reply to Luke Kenneth Casson Leighton from comment #42) > (In reply to Jacob Lifshay from comment #41) > > though, do we really need gfdiv? > > i'm sure i've seen modulo (remainder) used somewhere. yes, in defining gf(2^n) in terms of gf(2) polynomials, the polynomial remainder is used. that is *not* a gfbrem. if we were to have a gfbrem instruction, it would be kinda useless, always returning 0, since gfb is a field, meaning that, like complex numbers, every non-zero value has a multiplicative inverse. therefore, solving the equation for a remainder: n = q * d + r q = n / d # always produces exact results in any field n = (n / d) * d + r n = n + r 0 = r > if producing that, might as well have div. > > > it's a lot of extra complexity well, if we're using a pipeline for gfbinv (which we totally want to if we want AES to be usably fast without needing 7 zillion gfbinv FSMs), it's much harder to tack another several stages on the end of a pipeline (unless we want to not share it with gfmul), unless we want to do microcoding. > how? https://libre-soc.org/openpower/sv/gf2.py/ > that's what... 7 lines? > > > In any case, here's the list of binary gf ops I think we should implement: > > > > * clmul, clmulh (for gf operations that are > 64-bit, and other stuff) > > see wiki, clmulr is missing. ah, ok. yeah, we can add clmulr. > > > * gfbmul RT, RA, RB > > ack. is it worth doing a gfbmulh? looking at > multGF2 it looks like it could return (res >> deg) & mask2 > just as easily to give the top half. there is no top half. gfbmul returns the entire answer. if we tried to make a gfbmulh, it would always return 0 unless the reducing polynomial had a degree of more than 64 (64 is the max in the current design). > > > * gfbmuli RT, RA, imm # discard if we don't have enough encoding space > > no chance. snowball, hell etc. typically imm > takes an entire major opcode. mulli addi they > have dedicated major opcodes. makes no sense > except for less than degree 16, with gf jumping > all over the place. > > if constructing consts then addi and addis can be used. > on presenting to OPF ISA WG they really want a gfbmuli > i see no reason why not, but it does take an entire > major op. > > > * gfbmadd RT, RA, RB, RC > > ack. remember it has to match madd and fmadd for REMAP > purposes. this is IMPORTANT. > > > * gfbinv rt, ra > > input 0 gives result 0 even though that is division by zero, > > that's what's needed for AES. > > ok great. > > > gfbadd is covered by the existing xor instructions. > > ack. > > > gfbmaddsubr is not needed, since sub is identical to add, this is just two > > gfbmadd instructions. > > ack. > > gftwinmuladd is missing. > > btw see wiki page. please do read and > refer to it, i get the impression you're not reading it > regularly. yeah, because i have no easy way of figuring out if it has changed (automatic notifications would be nice!), and i don't like rereading stuff i've already read 5 times in the hope that you might have changed it meanwhile... git log is nearly useless since nearly everything has no useful commit message. > > > If that all sounds good, i can create a new tracking bug for binary gf > > instructions and start working on implementing them. > > the basics (lowlevel modules) and bugreport, yes. > > btw do not consider pipeline designs to be "the peak epitomy > of absolute performance", it is misleading. > > multiple simpler FSMs with multiple RSes can be more efficient > and take advantage of early-out and jump-ahead opportunities > that pipelines cannot. this makes them *less* efficient rather > than more, particularly for long-running ops. that depends...if the pipeline allows us to specialize stages to specific parts of the algorithm it can be faster (e.g. fpmul with one stage for unpacking/denormal-shifting, one stage for partial product computation and first part of carry-save adders, another stage of carry-save adders, final stage of carry-save adders and final carry-propagating add, stage rounding&packing&denormal handling) since each stage only needs to implement the parts it handles, making the stages run faster. if it were instead several totally separate FSMs for the same throughput, each FSM would need the logic for all stages of the algorithm, making the data path much more complex and possibly taking more time due to the complexity, reducing max clock speed. > > [yes a FSM can be done which processes multiple chained combinatorial > stages] > > RSes need to farm-in and fan-out to pipelines (which have > to be BIG muxes), where FSMs can wire directly. each RS has > its own FSM, each FSM stores its result in the RS, no > massive fan-in/out. that's kinda deceptive, since the muxes with the massive fan-out/in are now in the scheduler logic and the busses for carrying intermediate register values around (i temporarily forgot the name) rather than in the pipeline fan-in fan-out. for example, lets say the algorithm takes 10 cycles, and that every pipe/FSM needs 2 extra FUs in order to stay 100% utilized, and that there are 20 other FUs in the whole cpu: pipeline: 12 FUs -> pipeline 12 -> 1 mux on input 1 -> 12 mux on output (getting big) something -> 32 mux to fill all cpu FUs from reg reads/operand forwarding (pretty big) 10 FSMs (for same throughput): 3 FUs -> 10-step FSM 3 FUs -> 10-step FSM 3 FUs -> 10-step FSM ... 3 FUs -> 10-step FSM 30 FUs 10x 3 -> 1 muxes on input 10x 1 -> 3 muxes on output (meh, tiny, who cares) something bigger -> 50 mux to fill all cpu FUs from reg reads/operand forwarding (woah, huge!! aah, run!) > > bottom line you need to think of the context in which the > pipeline is used.

(In reply to Jacob Lifshay from comment #43) > (In reply to Luke Kenneth Casson Leighton from comment #42) > > (In reply to Jacob Lifshay from comment #41) > > > though, do we really need gfdiv? > > > > i'm sure i've seen modulo (remainder) used somewhere. > > yes, in defining gf(2^n) in terms of gf(2) polynomials, the polynomial > remainder is used. that is *not* a gfbrem. bear in mind, i don't understand the difference, so i am trusting you to ask and answer this question: why then have several people published divmodGF2 algorithms in their GF2 libraries? they must be used for something. > well, if we're using a pipeline for gfbinv (which we totally want to if we > want AES to be usably fast without needing 7 zillion gfbinv FSMs), you've fundamentally misunderstood something about OoO design here which i will cover in a mailing list post. > there is no top half. gfbmul returns the entire answer. if we tried to make > a gfbmulh, it would always return 0 unless the reducing polynomial had a > degree of more than 64 (64 is the max in the current design). by that logic clmulh should not exist. there is something missing here which needs to be explained and clarified. > yeah, because i have no easy way of figuring out if it has changed > (automatic notifications would be nice!), and i don't like rereading stuff > i've already read 5 times in the hope that you might have changed it > meanwhile... git log is nearly useless since nearly everything has no useful > commit message. i re-read the same page sometimes 20 to 30 times a day. this is just good practice because each re-read results in different insights. i never consider it "beneath" me to re-read something that many times. > if it were instead > several totally separate FSMs for the same throughput, each FSM would need > the logic for all stages of the algorithm, making the data path much more > complex and possibly taking more time due to the complexity, reducing max > clock speed. it is perfectly possible for a FSM design to itself farm-in and fan-out to shared pipelines. > > > > [yes a FSM can be done which processes multiple chained combinatorial > > stages] > > > > RSes need to farm-in and fan-out to pipelines (which have > > to be BIG muxes), where FSMs can wire directly. each RS has > > its own FSM, each FSM stores its result in the RS, no > > massive fan-in/out. > > that's kinda deceptive, since the muxes with the massive fan-out/in are now > in the scheduler logic and the busses for carrying intermediate register > values around (i temporarily forgot the name) rather than in the pipeline > fan-in fan-out. i'll answer this on-list. the absolute inviolate rule is that you absolutely cannot have an untracked in-flight computation. every register involved in every computation has to be tracked, and every RS with an outstanding computation has to be frozen until the result hits a regfile.

as promised, a write-up on RSes and the balance between FSMs and pipelines https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-March/004528.html https://libre-soc.org/3d_gpu/pipeline_vs_fsms.jpg frankly i'm astounded that i can hold this in my head and follow it easily, yet i get confused on even simple algorithms and can't distinguish GF2 from GF(2^m) or follow why clmulh exists but multGF2 would always return zero for the HI bits :)

(In reply to Luke Kenneth Casson Leighton from comment #44) > (In reply to Jacob Lifshay from comment #43) > > (In reply to Luke Kenneth Casson Leighton from comment #42) > > > (In reply to Jacob Lifshay from comment #41) > > > > though, do we really need gfdiv? > > > > > > i'm sure i've seen modulo (remainder) used somewhere. > > > > yes, in defining gf(2^n) in terms of gf(2) polynomials, the polynomial > > remainder is used. that is *not* a gfbrem. > > bear in mind, i don't understand the difference, so i am trusting you to > ask and answer this question: why then have several people published > divmodGF2 algorithms in their GF2 libraries? think of it like this: class GFp: """Galois Field GF(p) with the number of elements that is a prime `p`. these are isomorphic to the integers mod `p`, though it can be defined differently (e.g. by using `GFPrimeToTheK` with `k == 1`). """ def __init__(self, p, v=None): if v is None: v = 0 self.v = v % p self.p = p ... # no % or divmod here, since % would always return 0 class Polynomial(Generic[T]): """A polynomial with coefficients of type T. `coefficients[i]` is the coefficient for `x ** i` """ def __init__(self, coefficients=()): self.coefficients = list(coefficients) ... def __divmod__(self, rhs): """Polynomial divmod makes sense for all polynomials, even if it doesn't make sense to divmod a coefficient. """ ... @property def degree(self): retval = 0 for i, c in enumerate(self.coefficients): if c != 0: retval = i return retval class GFpPolynomial(Polynomial[GFp]): """Polynomial with coefficients that are GFp with matching p""" def __init__(self, coefficients, p): super().__init__(coefficients) self.p = p for i in self.coefficients: assert isinstance(i, GFp) assert i.p == p class GFPrimeToTheK: """aka. GF(q) where q = p ** k. Galois Field with number of elements that is a prime `p` raised to the power `k` where `k` is a positive integer. This isn't a polynomial (though mathematicians kinda mix their abstraction layers together whenever convenient), but is usually defined as the polynomial remainders you get from dividing `GFpPolynomial`s by a specific reducing `GFpPolynomial` `reducing_polynomial` of degree k. `p`, `k`, and the `reducing_polynomial` define a specific Galois Field, with the addition of the remainder polynomial, you get a specific Galois Field element. Note that the general shape of a Galois Field doesn't depend on what specific reducing polynomial you've selected (they are all isomorphic to each other for a specific `p` and `k`), so when people don't care, they just leave the reducing polynomial unstated and up to the reader to pick. """ def __init__(self, p, k, reducing_polynomial, remainder): self.p = p self.k = k self.reducing_polynomial = reducing_polynomial assert isinstance(reducing_polynomial, GFpPolynomial) assert isinstance(remainder, GFpPolynomial) assert p == reducing_polynomial.p assert k == reducing_polynomial.degree assert reducing_polynomial.is_irreducable() self.remainder = remainder % reducing_polynomial ... # no % or divmod here, since % would always return 0 > > they must be used for something. well, the top result in google for divmod "gf2" for me is: (changed to link to source code, yes i tried -- https doesn't seem to work) http://www.diegm.uniud.it/~bernardi/Software/Cplusplus/galoislib_doc/gf2_8H-source.html#l00201 00201 void divmod(GF<FieldSize> a, GF<FieldSize> ", GF<FieldSize> &rem) 00202 { 00203 quot = a*inv(*this); 00204 rem = GF<FieldSize>::zero(); 00205 } As you can see, the remainder is always zero. I'd guess divmod is used here for API compatibility with Rings, where a remainder is actually useful, such as the Ring of integers.

(In reply to Luke Kenneth Casson Leighton from comment #44) > > there is no top half. gfbmul returns the entire answer. if we tried to make > > a gfbmulh, it would always return 0 unless the reducing polynomial had a > > degree of more than 64 (64 is the max in the current design). > > by that logic clmulh should not exist. there is something missing > here which needs to be explained and clarified. clmul is polynomial multiplication where the coefficients are GF(2), not GF(2^k) multiplication. GF(2^k) multiplication has the additional step of calculating the polynomial remainder of dividing by the reducing polynomial -- essentially taking what would be the top half and merging it into the bottom half, leaving the top half now empty. This is similar to why the integers mod 16 have zeros everywhere except the least significant hex digit, because the mod 16 operation removed them, leaving zeros.

(In reply to Jacob Lifshay from comment #46) > I'd guess divmod is used here for API compatibility with Rings, where a > remainder is actually useful, such as the Ring of integers. rright. ok. yes, i have seen code which provides divmod then later a separate class embeds the one providing divmod into a GFPolynomial one. (In reply to Jacob Lifshay from comment #47) > clmul is polynomial multiplication where the coefficients are GF(2), not > GF(2^k) multiplication. the RV xbitmanip spec (see wiki) describes clmul as "GF(2^k) multiplication with the reducing polynomial set to 2^k" does that sound about right? because if it is, it is basically a way to "switch off" the "dividing by the reducing polynomial" > GF(2^k) multiplication has the additional step of calculating the polynomial > remainder of dividing by the reducing polynomial -- essentially taking what > would be the top half and merging it into the bottom half, leaving the top > half now empty. This is similar to why the integers mod 16 have zeros > everywhere except the least significant hex digit, because the mod 16 > operation removed them, leaving zeros. ok. vaguely starting to make sense. if the reducing polynomial is set to 2^XLEN (switched off in effect) does divmodGF2 and multGF2 effectively become "ring of integer" thingies? i would like to see both included, in effect, preferably without having to have separate HDL?

(In reply to Luke Kenneth Casson Leighton from comment #48) > (In reply to Jacob Lifshay from comment #46) > > > I'd guess divmod is used here for API compatibility with Rings, where a > > remainder is actually useful, such as the Ring of integers. > > rright. ok. yes, i have seen code which provides divmod then later > a separate class embeds the one providing divmod into a GFPolynomial > one. > > (In reply to Jacob Lifshay from comment #47) > > > clmul is polynomial multiplication where the coefficients are GF(2), not > > GF(2^k) multiplication. > > the RV xbitmanip spec (see wiki) describes clmul as "GF(2^k) multiplication > with the reducing polynomial set to 2^k" that's kinda like saying "a prime number p, but divisible by 32" -- it doesn't really make sense...anyway...if you took the GF(2^k) multiplication algorithm and replaced the reducing polynomial with 2^k where k == XLEN, then, yes, that is the result of clmul, since clmul's result is truncated to XLEN bits. clmulh and clmulr would require a larger reducing polynomial to avoid truncating the product. > does that sound about right? because if it is, it is basically a way > to "switch off" the "dividing by the reducing polynomial" yes. that said, clmul should be its own instruction, otherwise you would usually end up doing many cycles of extra work for the polynomial division by the reducing polynomial. multGF2 takes an iterative polynomial mul (iterative version of clmul) and merges it with a polynomial remainder operation, making it hard to run fast because remainders are serial operations (with most algorithms). clmul can and probably should be a combinatorial (with optional pipelining) multiply, like a Dadda multiplier -- much faster than remainder. In fact, we could modify our multiplier to also do clmul, just by masking out all carries. > > GF(2^k) multiplication has the additional step of calculating the polynomial > > remainder of dividing by the reducing polynomial -- essentially taking what > > would be the top half and merging it into the bottom half, leaving the top > > half now empty. This is similar to why the integers mod 16 have zeros > > everywhere except the least significant hex digit, because the mod 16 > > operation removed them, leaving zeros. > > ok. vaguely starting to make sense. > > if the reducing polynomial is set to 2^XLEN (switched off in effect) > does divmodGF2 and multGF2 effectively become "ring of integer" thingies? it would be a ring of truncated polynomials with GF(2) coefficients. A Ring is any math thing that has addition, multiplication, and works kinda like the integers in that it has an additive identity 0, and a multiplicative identity 1, and iirc associativity and commutativity. Rings can be made of things that are totally not integers, like here, where they are truncated polynomials. divmodGF2 was always a polynomial divmod with GF(2) coefficients, it wasn't ever operating on GF(2^n) things, other than being used to produce the polynomial inside a GF(2^n). It is totally unaffected by changing reducing polynomials, since those only matter for GF(2^n) operations. > > i would like to see both included, in effect, preferably without having > to have separate HDL? we could do that...clmul would be really slow though.

(In reply to Jacob Lifshay from comment #49) > > the RV xbitmanip spec (see wiki) describes clmul as "GF(2^k) multiplication > > with the reducing polynomial set to 2^k" > > that's kinda like saying "a prime number p, but divisible by 32" -- it > doesn't really make sense...anyway... :) > if you took the GF(2^k) multiplication > algorithm and replaced the reducing polynomial with 2^k where k == XLEN, > then, yes, that is the result of clmul, since clmul's result is truncated to > XLEN bits. clmulh and clmulr would require a larger reducing polynomial to > avoid truncating the product. okaay. > > does that sound about right? because if it is, it is basically a way > > to "switch off" the "dividing by the reducing polynomial" > > yes. that said, clmul should be its own instruction, yes absolutely, messing about with setting the polynomial to 2^XLEN before-hand would be a pain. > operations (with most algorithms). clmul can and probably should be a > combinatorial (with optional pipelining) multiply, like a Dadda multiplier > -- much faster than remainder. In fact, we could modify our multiplier to > also do clmul, just by masking out all carries. oooOoo. > it would be a ring of truncated polynomials with GF(2) coefficients. A Ring > is any math thing that has addition, multiplication, and works kinda like > the integers in that it has an additive identity 0, and a multiplicative > identity 1, and iirc associativity and commutativity. Rings can be made of > things that are totally not integers, like here, where they are truncated > polynomials. surprisingly, i get the Ring thing. i used it for a cryptographic algorithm in 2003. numbers that repeat, and are unique within their set, and by a beautiful coincidence happen to have consistent arithmetic. > divmodGF2 was always a polynomial divmod with GF(2) coefficients, it wasn't > ever operating on GF(2^n) things, other than being used to produce the > polynomial inside a GF(2^n). It is totally unaffected by changing reducing > polynomials, since those only matter for GF(2^n) operations. okaaay. > > > > i would like to see both included, in effect, preferably without having > > to have separate HDL? > > we could do that...clmul would be really slow though. best not then. basically what i'm trying to advocate is to keep gfdiv and gfmod but not name them... what did you use... a gfb- prefix? can you update comment #0 so i have a full list? i'll then double-check the wiki page for opcode allocations.

[edit: delete unnecessary context] (In reply to Luke Kenneth Casson Leighton from comment #50) > (In reply to Jacob Lifshay from comment #49) > > > > > > i would like to see both included, in effect, preferably without having > > > to have separate HDL? > > > > we could do that...clmul would be really slow though. > > best not then. > > basically what i'm trying to advocate is to keep gfdiv and gfmod but not > name them... what did you use... a gfb- prefix? we'd probably want to name it cldiv and clrem, since that way it matches with clmul. they would be polynomial divmod, where the polynomial coefficients are GF(2). gfb implies GF(2^n), which depends on the reducing polynomial in order to make sense. > > can you update comment #0 so i have a full list? i'll then double-check > the wiki page for opcode allocations. yeah. i'll leave the gfp (prime) list un-elaborated for now.

(In reply to Luke Kenneth Casson Leighton from comment #50) > > it would be a ring of truncated polynomials with GF(2) coefficients. A Ring > > is any math thing that has addition, multiplication, and works kinda like > > the integers in that it has an additive identity 0, and a multiplicative > > identity 1, and iirc associativity and commutativity. Rings can be made of > > things that are totally not integers, like here, where they are truncated > > polynomials. > > surprisingly, i get the Ring thing. i used it for a cryptographic algorithm > in 2003. numbers that repeat, and are unique within their set, and > by a beautiful coincidence happen to have consistent arithmetic. repetition isn't needed for a Ring, the consistent arithmetic part is the important part.

filled out comment 0

(In reply to Jacob Lifshay from comment #51) > we'd probably want to name it cldiv and clrem, since that way it matches > with clmul. they would be polynomial divmod, where the polynomial > coefficients are GF(2). that makes sense. like it. > gfb implies GF(2^n), which depends on the reducing polynomial in order to > make sense. > > > > can you update comment #0 so i have a full list? i'll then double-check > > the wiki page for opcode allocations. > > yeah. i'll leave the gfp (prime) list un-elaborated for now. do include them i need the names and qty. and basic descriptions. urr gfbmadd is 4op which are in short supply. have to lose Rc=1 to get them to fit.

(In reply to Luke Kenneth Casson Leighton from comment #54) > (In reply to Jacob Lifshay from comment #51) > > yeah. i'll leave the gfp (prime) list un-elaborated for now. > > do include them i need the names and qty. done. > and basic descriptions. uuh...should be pretty obvious?? i'll add them in the morning if it isn't. > > urr gfbmadd is 4op which are in short supply. have to lose Rc=1 > to get them to fit. Rc isn't super useful for most GF operations, eq is the only CR bit that could occasionally be useful.

(In reply to Jacob Lifshay from comment #55) > uuh...should be pretty obvious?? we've already established that to me it isn't :) am getting there but it would help to doublecheck. GF2 and 2^m are currently mixed in with each other in the wiki page > i'll add them in the morning if it isn't. star. eek. 4am for you. tsk tsk :) > Rc isn't super useful for most GF operations, eq is the only CR bit that > could occasionally be useful. true but madd seems not to have it for similar space reasons iirc. or is it OE=1 cant recall immediately

(In reply to Luke Kenneth Casson Leighton from comment #56) > (In reply to Jacob Lifshay from comment #55) > > > uuh...should be pretty obvious?? > > we've already established that to me it isn't :) i meant the stuff needed to allocate instruction encodings (e.g. gfpmadd uses 4 registers). I can fill in the pseudo-code on the wiki. > am getting there but it would help to doublecheck. yeah, I'll check, though it may have to wait till after the meeting today > GF2 and 2^m are currently mixed in with each other > in the wiki page > > > i'll add them in the morning if it isn't. > > star. eek. 4am for you. tsk tsk :) it was actually around 1-2am... DST starts on the 13th here. > > > Rc isn't super useful for most GF operations, eq is the only CR bit that > > could occasionally be useful. > > true but madd seems not to have it for similar space reasons iirc. > or is it OE=1 cant recall immediately icr, but the only madd instructions that might need OE=1 are the cl* instructions, all the GF(2^n) and GF(p) madd instructions can't overflow.

I added all the proposed ops to the wiki, I still need to add pseudo-code for several of them.

(In reply to Jacob Lifshay from comment #58) > I added all the proposed ops to the wiki, I still need to add pseudo-code > for several of them. The encodings still need to be decided, should I let you (lkcl) figure that out, or should I figure them out myself?

(In reply to Jacob Lifshay from comment #58) > I added all the proposed ops to the wiki, I still need to add pseudo-code > for several of them. -# Galois Field 2^M +# Instructions for Carry-less Operations aka. Polynomials with coefficients in `GF(2)` clmul(rh) were already added. prime-polynomial ones there's not enough room unless moving ternary to another major opcode (sigh), the ??madd ones take up vast amounts of space. another option, the entirety of gf* and cl* moves to its own major opcode. (In reply to Jacob Lifshay from comment #59) > The encodings still need to be decided, should I let you (lkcl) figure that > out, or should I figure them out myself? best let me do it (tomorrow) because i know the "rules" i applied to them and have the working-set in my head yes go for it on the pseudocode, do move the clmul(rh) ones to the (existing) carryless section, do keep an eye on that, read the whole page several times (now you know why i said "read it several times", because it's not just about the "changes", you have to know *everything* in the page, particularly if you want to correctly design the encodings. yes, it's a frickin lot of work!) yes drop pseudocode in, if it's missing.

(In reply to Luke Kenneth Casson Leighton from comment #60) > (In reply to Jacob Lifshay from comment #58) > > I added all the proposed ops to the wiki, I still need to add pseudo-code > > for several of them. > > -# Galois Field 2^M > +# Instructions for Carry-less Operations aka. Polynomials with coefficients > in `GF(2)` > > clmul(rh) were already added. > > prime-polynomial ones there's not enough room unless moving ternary > to another major opcode (sigh), the ??madd ones take up vast amounts > of space. we could change the madd ops to be 3-arg rather than 4: if we did that, we'd want to have additional separate variants that allow you to overwrite the term input rather than a factor input (inspired by x86's fma ops). # was gfpmsub gfpmsubf RT, RA, RB # overwrites factor (RT) = int_to_gfp((RT) * (RA) - (RB), GFPRIME) gfpmsubt RT, RA, RB # overwrites term (RT) = int_to_gfp((RA) * (RB) - (RT), GFPRIME) # was gfpmsubr gfpmsubrf RT, RA, RB # overwrites factor (RT) = int_to_gfp((RB) - (RT) * (RA), GFPRIME) gfpmsubrt RT, RA, RB # overwrites term (RT) = int_to_gfp((RT) - (RA) * (RB), GFPRIME) > > another option, the entirety of gf* and cl* moves to its own major opcode. > > (In reply to Jacob Lifshay from comment #59) > > > The encodings still need to be decided, should I let you (lkcl) figure that > > out, or should I figure them out myself? > > best let me do it (tomorrow) because i know the "rules" i applied to them > and have the working-set in my head > > yes go for it on the pseudocode, do move the clmul(rh) ones to the > (existing) carryless section, I intentionally started new sections, because i think the grouping/ordering makes more sense. I didn't delete the old sections yet cuz I figured we wanted to keep whatever info was there for reference until the new sections explain everything. > yes drop pseudocode in, if it's missing. I'm likely to also edit the existing/new pseudo-code to change naming and stuff to make it consistent with the instructions.

(In reply to Jacob Lifshay from comment #61) > > yes go for it on the pseudocode, do move the clmul(rh) ones to the > > (existing) carryless section, > > I intentionally started new sections, because i think the grouping/ordering > makes more sense. I didn't delete the old sections yet cuz I figured we > wanted to keep whatever info was there for reference until the new sections > explain everything. I want to rewrite the explanations to be more mathematically and conceptually consistent, since the current explanations confuse/mix a lot of the math concepts.

(In reply to Jacob Lifshay from comment #61) > (In reply to Luke Kenneth Casson Leighton from comment #60) > > (In reply to Jacob Lifshay from comment #58) > > > I added all the proposed ops to the wiki, I still need to add pseudo-code > > > for several of them. > > > > -# Galois Field 2^M > > +# Instructions for Carry-less Operations aka. Polynomials with coefficients > > in `GF(2)` > > > > clmul(rh) were already added. > > > > prime-polynomial ones there's not enough room unless moving ternary > > to another major opcode (sigh), the ??madd ones take up vast amounts > > of space. > > we could change the madd ops to be 3-arg rather than 4: i already explained clearly and non-negotiably the damaging consequences and cost of that. > if we did that, we'd want to have additional separate variants that allow > you to overwrite the term input rather than a factor input (inspired by > x86's fma ops). which in turn would require a rewrite of SV REMAP to match it and it took about 10 weeks to write. i am puzzled that you do not remember this, it was only about 3 days ago. matching the exact form (assembly, reg names) is an absolute highest inviolate top priority over everything else. except for the poly SPR which allows gf ops to match int arith > I intentionally started new sections, because i think the grouping/ordering > makes more sense. I didn't delete the old sections yet cuz I figured we > wanted to keep whatever info was there for reference until the new sections > explain everything. makes sense, do mark them as such tho > > yes drop pseudocode in, if it's missing. > > I'm likely to also edit the existing/new pseudo-code to change naming and > stuff to make it consistent with the instructions. yep go for it. drop into gf.py to test them as well. gf.py should become an easy to use/read reference, sandbox for other people, and basis of tests (eventually)

(In reply to Jacob Lifshay from comment #62) > I want to rewrite the explanations to be more mathematically and > conceptually consistent, since the current explanations confuse/mix a lot of > the math concepts. totally got it, v. relieved as well. started, missed some.

(In reply to Luke Kenneth Casson Leighton from comment #63) > (In reply to Jacob Lifshay from comment #61) > > we could change the madd ops to be 3-arg rather than 4: > > i already explained clearly and non-negotiably the damaging > consequences and cost of that. I meant the non-twin madd ops (which aren't used for fft iirc). the twin madd is used for fft and would still be 4-arg.

(In reply to Luke Kenneth Casson Leighton from comment #63) > (In reply to Jacob Lifshay from comment #61) > > if we did that, we'd want to have additional separate variants that allow > > you to overwrite the term input rather than a factor input (inspired by > > x86's fma ops). > > which in turn would require a rewrite of SV REMAP to match it > and it took about 10 weeks to write. couldn't we just use the 3-arg SVP64 prefix? this is exactly like rlwinm except it has 3 registers.

(In reply to Jacob Lifshay from comment #65) > (In reply to Luke Kenneth Casson Leighton from comment #63) > > (In reply to Jacob Lifshay from comment #61) > > > we could change the madd ops to be 3-arg rather than 4: > > > > i already explained clearly and non-negotiably the damaging > > consequences and cost of that. > > I meant the non-twin madd ops (which aren't used for fft iirc). ah yes, so sorry, yes those are used in Horizontal-First FFT REMAP mode. that was related to complex numbers, where the "usual" single-for-loop-on-an-instruction would only work if we had actual complex-mul-and-add [which would be complete overkill and swamp us with work] instead by using Horizontal-First it was possible to use *four* separate and distinct mul-and-adds with intermediary variables (as scalars) there was also iirc some interaction between fmadd.. it turned out to be possible to usefully set up all five REMAPs (RA RB RC, RT TS) and by a beautiful/horrendous coincidence only some of those are used by some operations (fmadds?) and then the next instruction used others, without needing a 2nd REMAP in between. having worked that all out in eight severely-intensive weeks, i don't want to go anywhere near it, again.

(In reply to Luke Kenneth Casson Leighton from comment #67) > (In reply to Jacob Lifshay from comment #65) > > (In reply to Luke Kenneth Casson Leighton from comment #63) > > > (In reply to Jacob Lifshay from comment #61) > > > > we could change the madd ops to be 3-arg rather than 4: > > > > > > i already explained clearly and non-negotiably the damaging > > > consequences and cost of that. > > > > I meant the non-twin madd ops (which aren't used for fft iirc). > > ah yes, so sorry, yes those are used in Horizontal-First FFT REMAP > mode. that was related to complex numbers, where the "usual" > single-for-loop-on-an-instruction would only work if we had actual > complex-mul-and-add [which would be complete overkill and swamp > us with work] well, since complex numbers don't use cl*, gfb*, or gfp*, are you aware of any other issues we might encounter? if not, I don't see why changing the madd ops to 3-arg wouldn't work. if we're too low on encoding space, we could have just the most widely used *madd be 4-arg for extra flexibility, and *msub and *msubr would be 3-arg.

also, whatever problems we encounter for 3-arg madd are likely shared with ternlogi which is also 3-arg that both reads/writes RT.

(In reply to Jacob Lifshay from comment #69) > also, whatever problems we encounter for 3-arg madd are likely shared with > ternlogi which is also 3-arg that both reads/writes RT. no because i cannot see a reason why anyone would use ternlogi with the SV REMAP FFT schedules (In reply to Jacob Lifshay from comment #68) > well, since complex numbers don't use cl*, gfb*, or gfp*, are you aware of > any other issues we might encounter? if not, I don't see why changing the > madd ops to 3-arg wouldn't work. please: stop pushing on this. please accept the answer no please do not waste my time persistently asking for an explanation when an explanation has been provided please accept the explanation please stop wasting your time pursuing something to which you have been told "no" five times.

https://www.nayuki.io/page/number-theoretic-transform-integer-dft https://www.nayuki.io/res/number-theoretic-transform-integer-dft/numbertheoretictransform.py well, i am a blithering idiot. NTT uses prime modulo arithmetic. it was the two numbers being identical in GF2 that tipped me off. that did not strike me as particularly useful. vastly more investigation needed. sigh.

(In reply to Luke Kenneth Casson Leighton from comment #71) > https://www.nayuki.io/page/number-theoretic-transform-integer-dft > https://www.nayuki.io/res/number-theoretic-transform-integer-dft/ > numbertheoretictransform.py > > well, i am a blithering idiot. NTT uses prime modulo arithmetic. that's GF(p) FFT. I expect carry-less FFT and GF(2^n) FFT to be useful as well.

https://www.nayuki.io/page/montgomery-reduction-algorithm apparently speeds up chains of modulo multiplies. i recognise the reciprocal function in the python version as the gf_invert which did not work.

(In reply to Jacob Lifshay from comment #72) > that's GF(p) FFT. sigh that is the correct one used for Reed Solomon and ECC25519 > I expect carry-less FFT and GF(2^n) FFT to be useful as well. Frobenius Additive FFT apparently. https://news.ycombinator.com/item?id=21151646

https://github.com/fast-crypto-lab/Frobenius_AFFT that's totally opaque to me until i realised, it is FFT with mul-add as modulo-mul and XOR. then i also remembered that it is DCT that has the mul-with-add-and-sub (which is meaningless for GF2 because add and sub are the same) whereas FFT is: RT = MUL(RA, RC) + RB RS = RA - RB

(In reply to Luke Kenneth Casson Leighton from comment #74) > (In reply to Jacob Lifshay from comment #72) > > > that's GF(p) FFT. > > sigh that is the correct one used for Reed Solomon and ECC25519 Reed Solomon uses either carry-less or GF(2^n) iirc. ECC uses either GF(2^n) or GF(p) (depending on the variety of ECC), except that the word size is usually > 64-bits, so we may need to use carry-less or big integer operations there.

(In reply to Luke Kenneth Casson Leighton from comment #75) > https://github.com/fast-crypto-lab/Frobenius_AFFT i'm not going completely mad. other way round: Frobenius Additive FFT appears to use the same DCT mul-add-sub Floating Twin Multiply-Add DCT [Single] A-Form fdmadds FRT,FRA,FRC,FRB (Rc=0) fdmadds. FRT,FRA,FRC,FRB (Rc=1) Pseudo-code: FRS <- FPADD32(FRA, FRB) sub <- FPSUB32(FRA, FRB) FRT <- FPMUL32(FRC, sub) https://libre-soc.org/openpower/isa/svfparith/ this is enough operations to justify moving them to their own major opcode otherwise they overwhelm the bitmanip one. i will see how much space is in the FP opcodes

gfbmadd RT, RA, RB, RC (RT) = gfbadd(gfbmul((RA), (RB)), (RC)) jacob you modified the ordering and naming of the registers. when i said "these have to be exactly as the int/fp operations", it means, "these HAVE to be EXACTLY as the int/fp operations". please therefore restore them to the original conventions. please look at the Power ISA spec (madd, fmadd, fmadds), observe that the naming is as i originally specified and make absolutely certain that when implementing these that they are wholly, one hundred percent, without exception, identically laid out and utilising the exact same register order, naming, and conventions. we do not have any choice here, these conventions have already been predefined by the previous ISA Architects. plus SV REMAP is critically relying on the naming and ordering.

(In reply to Luke Kenneth Casson Leighton from comment #78) > gfbmadd RT, RA, RB, RC > (RT) = gfbadd(gfbmul((RA), (RB)), (RC)) > > jacob you modified the ordering and naming > of the registers. well, what I put exactly matches maddld: maddld RT,RA.RB,RC prod0:127 ← (RA) × (RB) sum0:127 ← prod + EXTS(RC) RT ← sum64:127

(In reply to Jacob Lifshay from comment #79) > well, what I put exactly matches maddld: > maddld RT,RA.RB,RC > prod0:127 ← (RA) × (RB) > sum0:127 ← prod + EXTS(RC) > RT ← sum64:127 drat, drat. and i based the GF twin ops on FP ones... damnit that's going to have ramifications for SV REMAP (madd* is not implemented yet) last thing i want to dig into right now. not sure what's best to do here. the twin ops can be left for later, as long as the opcode space is allocated.

I added separate python files for clmul, clmulh, clmulr, and cldivrem (helper function for cldiv and clrem), and added a unittest-based test (test_cl.py) for them all. https://git.libre-soc.org/?p=libreriscv.git;a=tree;f=openpower/sv/bitmanip;hb=4852bd74b9e2ec7f4c8ae8c0677daa4402314a1b I also changed bitmanip.mdwn to embed the python files into it, rather than having separately maintained code in the markdown, allowing us to test everything and import it much easier, since I'm planning on using the cf* python code in the definitions of the GF(2^n) ops. I think this approach is better than having to constantly keep gf2.py and bitmanip.mdwn in sync and having the DIY tests in gf2.py where the user has to manually compare print outputs.

I added python implementations and tests for all the gfb* instructions

(In reply to Jacob Lifshay from comment #81) > I think this approach is better than having to constantly keep gf2.py and > bitmanip.mdwn in sync and having the DIY tests in gf2.py where the user has > to manually compare print outputs. absolutely. (In reply to Jacob Lifshay from comment #82) > I added python implementations and tests for all the gfb* instructions great. btw i altered the twin-instruction to be mul-add for RT and just add for RS. this matches with the DCT twin-instruction interestingly, and i believe is the basis of additive NTT. about which: i am beginning to question the value of adding GF-Prime. reason is: ecc25519 requires 256-bit arithmetic and we ain't addin' tha' so have to use bigint math anyway. therefore if doing bigint math is there anything that can be added which will speed that up?

(In reply to Luke Kenneth Casson Leighton from comment #83) > (In reply to Jacob Lifshay from comment #82) > > I added python implementations and tests for all the gfb* instructions > > great. btw i altered the twin-instruction to be mul-add for RT > and just add for RS. this matches with the DCT twin-instruction > interestingly, and i believe is the basis of additive NTT. imho that needs some testing to ensure it gives the same answers as other algorithms, such as that one I found in sympy, otherwise we may need both (or even more) kinds of twin-mul-add-subr. > > about which: i am beginning to question the value of adding GF-Prime. yeah, it may not be used much by common cryptography, but it definitely isn't useless...assuming it's fast enough, it would definitely be useful for modular arithmetic for small (<=64-bit) moduli, which is used in a bunch of core algorithms for computer algebra systems (e.g. polynomial factorization by the currently fastest known method, which is used for exact computations involving algebraic numbers (the "real" type in SMTLIB2, the interface language used by most of the formal proof software we've been using such as z3, yices, etc.). algebraic numbers are also used by the simple-soft-float library I wrote. algebraic numbers are what you get if you ask a symbolic math library to find all roots of some arbitrary polynomial with rational coefficients, which is pretty core functionality to solving most equations.) > reason is: ecc25519 requires 256-bit arithmetic and we ain't addin' > tha' so have to use bigint math anyway. > > therefore if doing bigint math is there anything that can be added > which will speed that up? hmm, for generic bigint math, a 128 by 64-bit div and rem (used by bigint div/rem)? openpower only provides a version where the 128-bit dividend has all zeros in the lower half, requiring a bunch of extra instructions to resynthesize what the hardware can already do at low additional cost. A 128-bit to 64-bit shift like x86 shld https://www.felixcloutier.com/x86/shld would also be nice (used by bigint shifting). also, some instructions for getting the high half of a 64-bit * 64-bit + 128-bit muladd/mulsub (used for bigint * 64-bit number, and probably bigint*bigint), but that would require some thought to avoid being 4-in 1-out instructions.

I added python implementations for all the gfp* instructions, as well as tests (which pass). I also added descriptions of the FFT twin mul-adds (all 3 of them) so that I could finish deleting the redundant sections. According to Wikipedia, NTT is the correct name only for GF(p) (or other rings of integers mod m with some restrictions), the others are just weird DFTs. https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general)#Number-theoretic_transform It occurred to me, that the gfp* instructions should really be named the modular arithmetic instructions, since they're all useful for composite moduli too. GF(p) is only defined if p is a prime. gfpinv will have more inputs than just zero that don't work if the modulus isn't a prime.