Bug 782 - add galois field bitmanip instructions
Summary: add galois field bitmanip instructions
Status: CONFIRMED
Alias: None
Product: Libre-SOC's second ASIC
Classification: Unclassified
Component: source code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Jacob Lifshay
URL: https://libre-soc.org/openpower/sv/bi...
Depends on: 785 786 784
Blocks: 741 771
  Show dependency treegraph
 
Reported: 2022-03-03 03:09 GMT by Jacob Lifshay
Modified: 2023-09-08 03:51 BST (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks: 3000
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation: 741
child tasks for budget allocation: 784 785 786
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2022-03-03 03:09:05 GMT
bug #784 -- EUR 3000:
GF(2) polynomial ops:
* clmul, clmulh, clmulr
* cldiv, clrem (maybe) -- decide what divide by zero does
* clmadd (maybe)
* cltmadd twin for fft (maybe)

cladd is xor, so not needed

bug #785:

GF(2^n) ops:

* gfbredpoly # sets reducing polynomial SPR `GFBREDPOLY`
    unless this is an immediate op, mtspr is completely sufficient.
* MSR bit to be reserved to indicate Poly SPR in use
  (avoids need to save Poly SPR on contextswitch)
* gfbmul RT, RA, RB
* gfbmadd RT, RA, RB, RC
* gfbtmadd twin for fft
* gfbinv rt, ra
    input 0 gives result 0 even though that is division by zero,
    that's what's needed for AES.

bug #786:

GF(p) ops:

probably should be different SPR than for GF(2^n)

* gfpadd, gfpsub
* gfpmul
* gfpinv (decide what div by 0 does)
* gfpmadd, gfpmsub, gfpmsubr (sub reversed)
* gfpmaddsubr (for fft)

https://libre-soc.org/openpower/sv/gf2.py/
Comment 1 Jacob Lifshay 2022-03-03 03:31:10 GMT
I could be wrong, but from what I know of how Galois fields work, the degree parameter is entirely derived from the reducing polynomial:
if the reducing polynomial is the AES one:
x^8 + x^4 + x^3 + x + 1
or, in binary: 100011011  in hex: 0x11B
So, here, the degree=8 since that's the largest exponent in the polynomial.

In my opinion, we should just have the reducing polynomial as an input or immediate, and derive the degree from it, rather than having degree being a separate input/immediate and wasting encoding space.

Alternatively, we could have the degree fixed to XLEN, and just have the reducing polynomial specify the bits below the MSB, with the MSB assumed to always be one.
so, the AES polynomial would be specified as 0x1B with the top bit assumed to be a 1 since it would only be used with XLEN=8.

In either case, having both degree and reducing polynomials as inputs as we do currently seems quite suboptimal.

If this sounds good, I'll make the changes needed to have the reducing polynomial as an input, and degree derived from it.
Comment 2 Luke Kenneth Casson Leighton 2022-03-03 09:36:11 GMT
(In reply to Jacob Lifshay from comment #1)
> I could be wrong, but from what I know of how Galois fields work, the degree
> parameter is entirely derived from the reducing polynomial:

iirc some implementations have an implicit top bit removed similar
to how 

> If this sounds good, I'll make the changes needed to have the reducing
> polynomial as an input, and degree derived from it.

doesn't feel right.

i definitely need the full parameter space similar to FMAC, a full
4-operand MAC, for the SVP64 NTT to work.

(NTT, part of Reed Solomon, is the same algorithm as Discrete Fourier
Transform just using GF arithmetic)

https://en.wikipedia.org/wiki/Talk%3ANumber-theoretic_transform

note in that page that:

     The modulus in a Number Theoretic Transfer doesn't necessarily have to be prime.


this tends to point towards using SPRs.
Comment 3 Jacob Lifshay 2022-03-03 16:14:42 GMT
(In reply to Luke Kenneth Casson Leighton from comment #2)
> (In reply to Jacob Lifshay from comment #1)
> > I could be wrong, but from what I know of how Galois fields work, the degree
> > parameter is entirely derived from the reducing polynomial:
> 
> iirc some implementations have an implicit top bit removed similar
> to how 
> 
> > If this sounds good, I'll make the changes needed to have the reducing
> > polynomial as an input, and degree derived from it.
> 
> doesn't feel right.
> 
> i definitely need the full parameter space similar to FMAC, a full
> 4-operand MAC, for the SVP64 NTT to work.

then we need a gfmadd instruction.

we probably also want a gfdiv instruction, but that may be too complex.

also, gfadd is just xor iirc...so maybe we don't need that one.

> 
> (NTT, part of Reed Solomon, is the same algorithm as Discrete Fourier
> Transform just using GF arithmetic)
> 
> https://en.wikipedia.org/wiki/Talk%3ANumber-theoretic_transform
> 
> note in that page that:
> 
>      The modulus in a Number Theoretic Transfer doesn't necessarily have to
> be prime.

yup, here, I'm assuming the modulus (which ends up being the number of values in the field (not to be confused with the reducing polynomial, which is a different kind of modulus)) is 2^deg(reducing-polynomial).

also, iirc it's Number Theoretic *Transform* (like Fourier Transform).

> this tends to point towards using SPRs.

ok, sounds good!

what do you think of just using CTR? though, otoh if we had a dedicated SPR we could precalculate stuff (degree, maybe some implementations want to precalculate some tables) when the SPR was written.

In order to support GF(2^XLEN) as well as smaller fields, I think we should pick an encoding for the reducing polynomial and degree:

def decode(v, XLEN):
    """returns the decoded coefficient list in LSB to MSB order,
        len(retval) == degree + 1"""
    v &= ((1 << XLEN) - 1) # mask to XLEN bits
    if v == 0 or v == 2: # GF(2)
        return [0, 1] # degree = 1, poly = x
    if v & 1:
        degree = floor_log2(v)
    else:
        # all reducing polynomials of degree > 1 must have the LSB set,
        # because they must be irreducible polynomials (meaning they
        # can't be factored), if the LSB was clear, then they would
        # have x as a factor. Therefore, we can reuse the LSB clear
        # to instead mean the polynomial has degree XLEN.
        degree = XLEN
        v |= 1 << XLEN
        v |= 1 # LSB must be set
    return [(v >> i) & 1 for i in range(1 + degree)]
Comment 4 Jacob Lifshay 2022-03-03 16:34:00 GMT
also, for naming, we'd probably want to name these instructions gfbmadd, gfbmul, gfbinv, etc. (Galois Field (binary) ...) because they're specifically for the binary fields GF(2^n).

We may later want to add support for odd-prime fields GF(p), since they're also used by some forms of elliptic curve cryptography:
* gfpadd -- Galois Field (prime) addition
* gfpsub -- Galois Field (prime) subtraction
* gfpmul
* gfpmadd
* gfpmsub

we probably won't want division or inverse cuz iirc those are crazy complex.
Comment 5 Jacob Lifshay 2022-03-03 16:42:04 GMT
(In reply to Jacob Lifshay from comment #4)
> we probably won't want division or inverse cuz iirc those are crazy complex.

modular inverse algorithm:
https://en.m.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers
Comment 6 Luke Kenneth Casson Leighton 2022-03-04 01:57:14 GMT
(In reply to Jacob Lifshay from comment #3)
 
> then we need a gfmadd instruction.

and a twin version.

> we probably also want a gfdiv instruction, but that may be too complex.

i knoow, aren't they fun?  best way is inverse then mul.
galileo something inverse.

https://libre-soc.org/openpower/isa/svfparith/

twin butterfly.  

> also, gfadd is just xor iirc...so maybe we don't need that one.

yes because on exceeding polynomial you need a cleanup.
as long as both inputs are below polynomial the cleanup
is i believe a simple check.

> > 
> > (NTT, part of Reed Solomon, is the same algorithm as Discrete Fourier
> > Transform just using GF arithmetic)
> > 
> > https://en.wikipedia.org/wiki/Talk%3ANumber-theoretic_transform
> > 
> > note in that page that:
> > 
> >      The modulus in a Number Theoretic Transfer doesn't necessarily have to
> > be prime.
> 
> yup, here, I'm assuming the modulus (which ends up being the number of
> values in the field (not to be confused with the reducing polynomial, which
> is a different kind of modulus)) is 2^deg(reducing-polynomial).

do they mean e.g. Gf 2^5 
 
> also, iirc it's Number Theoretic *Transform* (like Fourier Transform).

doh. how did i end up typing transfer??

> > this tends to point towards using SPRs.
> 
> ok, sounds good!
> 
> what do you think of just using CTR?

see sv.bc for how much functionality would be lost.
sv.bc is stunningly powerful, a "simple" bc gets something
mad like 128 additional options half of which rely on CTR.


>         # all reducing polynomials of degree > 1 must have the LSB set,
>         # because they must be irreducible polynomials (meaning they
>         # can't be factored),


take a look again at the wikipedia NTT talk page apparently nonprime
polynomials are actually useful, and LSB=0 is nonprime.

oh right GF 2^5 i.e.the degree.

can you doublexheck to see if there are any circumstances where
a polyynomial with LSB=0 is useful or not?
Comment 7 Luke Kenneth Casson Leighton 2022-03-04 02:13:20 GMT
(In reply to Jacob Lifshay from comment #4)

> we probably won't want division or inverse cuz iirc those are crazy complex.

isn't it beautifully completely nuts? :)  i tracked down a simple version
based on long division, see bitmanip page section "gf invert", it looks
half decent, reminds me of the arithmetic div pipeline and of CORDIC.

the ultimate goal here, the big test, is to do NTT in around 11 instructions
similar to DCT. the entire main NTT triple loop with SV REMAP, on top of
the appropriate twin butterfly ops.

(In reply to Jacob Lifshay from comment #5)

> modular inverse algorithm:
> https://en.m.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers

euclidean not galileo :)

see gf invert section you will likely find it is that one.

div can then be invert followed by mul.  no accuracy is lost because
both are ring fields. when prime polynomial.
Comment 8 Jacob Lifshay 2022-03-04 11:06:16 GMT
Please don't edit old comments...I didn't notice till now that you had changed what you wrote...

(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #3)
>  
> > then we need a gfmadd instruction.
> 
> and a twin version.

If you mean something that can use twin predication, I'm not sure how we'd do that, since basically all these ops have multiple inputs, iirc thereby using up the space in the SVP64 encoding that could have been used for twin predication.
> 
> > we probably also want a gfdiv instruction, but that may be too complex.
> 
> i knoow, aren't they fun?  best way is inverse then mul.
> galileo something inverse.
> 
> https://libre-soc.org/openpower/isa/svfparith/
> 
> twin butterfly.  
> 
> > also, gfadd is just xor iirc...so maybe we don't need that one.
> 
> yes because on exceeding polynomial you need a cleanup.
> as long as both inputs are below polynomial the cleanup
> is i believe a simple check.

I'd just expect the inputs to nearly always be in-range (100% of the time in >99% of programs), meaning that xor is completely sufficient. I don't think this justifies a separate instruction.

If we need an instruction to change some input to fit in range (cleanup), we can just do gfmadd with both multiplicands be constant zeros (RA|0), which the decoder can easily decode specially to skip the multiplication if we really need the extra speed.

> > > 
> > > (NTT, part of Reed Solomon, is the same algorithm as Discrete Fourier
> > > Transform just using GF arithmetic)
> > > 
> > > https://en.wikipedia.org/wiki/Talk%3ANumber-theoretic_transform
> > > 
> > > note in that page that:
> > > 
> > >      The modulus in a Number Theoretic Transfer doesn't necessarily have to
> > > be prime.
> > 
> > yup, here, I'm assuming the modulus (which ends up being the number of
> > values in the field (not to be confused with the reducing polynomial, which
> > is a different kind of modulus)) is 2^deg(reducing-polynomial).
> 
> do they mean e.g. Gf 2^5 

yes, when you have any degree 5 reducing polynomial. If you have a polynomial that is some other degree, such as 6, then the field described is some instance of GF(2^6) for degree 6, GF(2^3) for degree 3, etc.

> > what do you think of just using CTR?
> 
> see sv.bc for how much functionality would be lost.
> sv.bc is stunningly powerful, a "simple" bc gets something
> mad like 128 additional options half of which rely on CTR.

yeah, I wasn't thinking that through. CTR is too important for looping.
> 
> 
> >         # all reducing polynomials of degree > 1 must have the LSB set,
> >         # because they must be irreducible polynomials (meaning they
> >         # can't be factored),
> 
> 
> take a look again at the wikipedia NTT talk page apparently nonprime
> polynomials are actually useful, and LSB=0 is nonprime.

as you already probably figured out, the non-prime part is the size of the field (so for GF(27), the field has size 27), not the reducing polynomial.

AFAIK the reducing polynomial always 
> 
> oh right GF 2^5 i.e.the degree.
> 
> can you doublexheck to see if there are any circumstances where
> a polyynomial with LSB=0 is useful or not?

there's one, when you're using GF(p) for some prime p. in this case, the only applicable one is GF(2), which I already have the special case for.

The polynomial must always be irreducible (the polynomial equivalent of being a prime number), otherwise you don't get a Galois Field (you get a Ring instead), which means that the LSB must be non-zero for degree > 2 polynomials, otherwise it would be divisible by the polynomial x^1+0, which would mean that your polynomial wasn't irreducible after all, making your nice Galois Field no longer a Field, essentially breaking the math. Think of it as like dividing by zero.
Comment 9 Jacob Lifshay 2022-03-04 11:16:32 GMT
(In reply to Luke Kenneth Casson Leighton from comment #7)
> (In reply to Jacob Lifshay from comment #4)
> 
> > we probably won't want division or inverse cuz iirc those are crazy complex.

For GF(p) for prime p (we can assume that p can get quite large, close to 2^64), a modular inversion essentially a loop over integer divmod and multiplication, divmod itself is already kinda a loop. Then we get the outer loops from SV. We're going to get a like 5 or 6-deep loop all running from 1 instruction, which imho is kinda overboard. That sounds like CISC, but even more so -- super CISC. Next we'll be adding a javascript parser as 1 instruction...:P
> 
> isn't it beautifully completely nuts? :)  i tracked down a simple version
> based on long division, see bitmanip page section "gf invert", it looks
> half decent, reminds me of the arithmetic div pipeline and of CORDIC.

yeah, that's for binary GF. prime GF is a whole other beast.

> the ultimate goal here, the big test, is to do NTT in around 11 instructions
> similar to DCT. the entire main NTT triple loop with SV REMAP, on top of
> the appropriate twin butterfly ops.

well, iirc NTT is only gfmadd, no gfdiv for prime GF needed there.
> 
> (In reply to Jacob Lifshay from comment #5)
> 
> > modular inverse algorithm:
> > https://en.m.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers
> 
> euclidean not galileo :)
> 
> see gf invert section you will likely find it is that one.
> 
> div can then be invert followed by mul.  no accuracy is lost because
> both are ring fields. when prime polynomial.

yeah, for binary GF, gfinv or gfdiv is totally feasible, if a bit slow.
Comment 10 Luke Kenneth Casson Leighton 2022-03-04 12:08:01 GMT
(In reply to Jacob Lifshay from comment #8)
> Please don't edit old comments...I didn't notice till now that you had
> changed what you wrote...
> 
> (In reply to Luke Kenneth Casson Leighton from comment #6)
> > (In reply to Jacob Lifshay from comment #3)
> >  
> > > then we need a gfmadd instruction.
> > 
> > and a twin version.
> 
> If you mean something that can use twin predication,

nono, not predication, two ops in one instruction. look at the pseudocode
https://libre-soc.org/openpower/isa/svfparith/

the reason why i was able to do inplace FFT is the twin butterfly ops.
otherwise you need a temp reg and two instructions and obviously that
doesnt fit with Horizontal SVP64.

> > yes because on exceeding polynomial you need a cleanup.
> > as long as both inputs are below polynomial the cleanup
> > is i believe a simple check.
> 
> I'd just expect the inputs to nearly always be in-range (100% of the time in
> >99% of programs), meaning that xor is completely sufficient. I don't think
> this justifies a separate instruction.

mrhhm mumble mumble we need to be 100% certain.

> > do they mean e.g. Gf 2^5 
> 
> yes, when you have any degree 5 reducing polynomial. If you have a
> polynomial that is some other degree, such as 6, then the field described is
> some instance of GF(2^6) for degree 6, GF(2^3) for degree 3, etc.

ok excellent. i knew that, but sometimes reading wikipedia talk they help
you lose track of the facts rather than give you insights :)
 
> The polynomial must always be irreducible (the polynomial equivalent of
> being a prime number), otherwise you don't get a Galois Field (you get a
> Ring instead), which means that the LSB must be non-zero for degree > 2
> polynomials, otherwise it would be divisible by the polynomial x^1+0, which
> would mean that your polynomial wasn't irreducible after all, making your
> nice Galois Field no longer a Field, essentially breaking the math. Think of
> it as like dividing by zero.

eep. bad.

and also fascinating.

ok then i like the encoding.

(In reply to Jacob Lifshay from comment #9)

> > isn't it beautifully completely nuts? :)  i tracked down a simple version
> > based on long division, see bitmanip page section "gf invert", it looks
> > half decent, reminds me of the arithmetic div pipeline and of CORDIC.
> 
> yeah, that's for binary GF. prime GF is a whole other beast.

ok i get the message i need to look this up. if you know of some simple
easy to understand implementation do put it in the wiki.

give me a day or a few to read up on it, any papers/academic tutorials
you know of do let me know.
Comment 11 Luke Kenneth Casson Leighton 2022-03-04 12:20:14 GMT
correct me if wrong i believe it quite simple to show XOR insufficient

polynomial 0b110001
x          0b110000
y          0b001111
XOR        0b111111
clearly this is greater than the polynomial which modulo you are
supposed to XOR and instead get the correct answer:

polynomial 0b110001
x          0b110000
y          0b001111
XOR        0b111111 this greater than polynomial
subtract p 0b110001 use XOR to subtract
answer     0b001110

so i think you'll find that we can get away with:

* assumption that both RA and RB are less than
  polynomial
* do XOR
* compare (arithmetic cmp) with polynomial
* if greater, XOR with polynomial

no huge iteration needed, just one extra XOR *as long as* both
RA and RB are 0 <= RA/RB < polynomial
Comment 12 Jacob Lifshay 2022-03-04 12:20:47 GMT
(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Jacob Lifshay from comment #8)
> > Please don't edit old comments...I didn't notice till now that you had
> > changed what you wrote...
> > 
> > (In reply to Luke Kenneth Casson Leighton from comment #6)
> > > (In reply to Jacob Lifshay from comment #3)
> > >  
> > > > then we need a gfmadd instruction.
> > > 
> > > and a twin version.
> > 
> > If you mean something that can use twin predication,
> 
> nono, not predication, two ops in one instruction. look at the pseudocode
> https://libre-soc.org/openpower/isa/svfparith/

will look in the morning...I thought that was just for complex numbers tho. GF and NTT both operate on modular integers/polynomials rather than complex numbers, so shouldn't need a separate element for real/imag components.
> 
> the reason why i was able to do inplace FFT is the twin butterfly ops.
> otherwise you need a temp reg and two instructions and obviously that
> doesnt fit with Horizontal SVP64.
> 
> > > yes because on exceeding polynomial you need a cleanup.
> > > as long as both inputs are below polynomial the cleanup
> > > is i believe a simple check.
> > 
> > I'd just expect the inputs to nearly always be in-range (100% of the time in
> > >99% of programs), meaning that xor is completely sufficient. I don't think
> > this justifies a separate instruction.
> 
> mrhhm mumble mumble we need to be 100% certain.
> 
> > > do they mean e.g. Gf 2^5 
> > 
> > yes, when you have any degree 5 reducing polynomial. If you have a
> > polynomial that is some other degree, such as 6, then the field described is
> > some instance of GF(2^6) for degree 6, GF(2^3) for degree 3, etc.
> 
> ok excellent. i knew that, but sometimes reading wikipedia talk they help
> you lose track of the facts rather than give you insights :)

yay :)

> > The polynomial must always be irreducible (the polynomial equivalent of
> > being a prime number), otherwise you don't get a Galois Field (you get a
> > Ring instead), which means that the LSB must be non-zero for degree > 2
> > polynomials, otherwise it would be divisible by the polynomial x^1+0, which
> > would mean that your polynomial wasn't irreducible after all, making your
> > nice Galois Field no longer a Field, essentially breaking the math. Think of
> > it as like dividing by zero.
> 
> eep. bad.
> 
> and also fascinating.
> 
> ok then i like the encoding.

:)
> 
> (In reply to Jacob Lifshay from comment #9)
> 
> > > isn't it beautifully completely nuts? :)  i tracked down a simple version
> > > based on long division, see bitmanip page section "gf invert", it looks
> > > half decent, reminds me of the arithmetic div pipeline and of CORDIC.
> > 
> > yeah, that's for binary GF. prime GF is a whole other beast.
> 
> ok i get the message i need to look this up. if you know of some simple
> easy to understand implementation do put it in the wiki.
> 
> give me a day or a few to read up on it, any papers/academic tutorials
> you know of do let me know.

well, it took me like 2-3mo to finally let it sink in...that was when I was trying to create an algebraic number library for use in the FP math reference implementation a few years ago. Turns out polynomials of modular integers show up a bunch when you want to efficiently factor polynomials with integer coefficients.

Unfortunately, all I have at the moment is Wikipedia. Check out the elliptic curve page too, that's where I got the terms prime/binary GF.
Comment 13 Jacob Lifshay 2022-03-04 12:23:55 GMT
(In reply to Luke Kenneth Casson Leighton from comment #11)
> correct me if wrong i believe it quite simple to show XOR insufficient
> 
> polynomial 0b110001
> x          0b110000
> y          0b001111
> XOR        0b111111
> clearly this is greater than the polynomial which modulo you are
> supposed to XOR and instead get the correct answer:

well, that's because you started with something bigger you should have...remember, these are polynomials, so any bit set means you divide it out.

> polynomial 0b110001
> x          0b110000

here x would have to be reduced to something without the top bit set, only after reducing it is it a value you'd actually expect to have as an input. I'm not working out what it'd reduce to at the moment, it's 4:30am here.
Comment 15 Luke Kenneth Casson Leighton 2022-03-04 12:30:15 GMT
(In reply to Jacob Lifshay from comment #13)

> here x would have to be reduced to something without the top bit set, only
> after reducing it is it a value you'd actually expect to have as an input.
> I'm not working out what it'd reduce to at the moment, it's 4:30am here.

brainmelt :)

when not 4:30am, is there any reason to support binary? i mean if prime is
just XOR then pffh.
Comment 16 Jacob Lifshay 2022-03-05 00:04:58 GMT
(In reply to Jacob Lifshay from comment #12)
> (In reply to Luke Kenneth Casson Leighton from comment #10)
> > nono, not predication, two ops in one instruction. look at the pseudocode
> > https://libre-soc.org/openpower/isa/svfparith/
> 
> will look in the morning...I thought that was just for complex numbers tho.
> GF and NTT both operate on modular integers/polynomials rather than complex
> numbers, so shouldn't need a separate element for real/imag components.

ahh, now I get what you mean!

NTT inner loop:
https://github.com/sympy/sympy/blob/68022011872a20c27abd047125e5feb8ce3289cc/sympy/discrete/transforms.py#L173

    h = 2
    while h <= n:
        hf, ut = h // 2, n // h
        for i in range(0, n, h):
            for j in range(hf):
                u, v = a[i + j], a[i + j + hf]*w[ut * j]
                a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p
        h *= 2
Comment 17 Jacob Lifshay 2022-03-05 00:07:46 GMT
(In reply to Jacob Lifshay from comment #16)
> NTT inner loop:
> https://github.com/sympy/sympy/blob/68022011872a20c27abd047125e5feb8ce3289cc/
> sympy/discrete/transforms.py#L173
> 
>     h = 2
>     while h <= n:
>         hf, ut = h // 2, n // h
>         for i in range(0, n, h):
>             for j in range(hf):
>                 u, v = a[i + j], a[i + j + hf]*w[ut * j]
>                 a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p
>         h *= 2

looks like gfpmuladdsub
Comment 18 Jacob Lifshay 2022-03-05 00:12:25 GMT
(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Jacob Lifshay from comment #13)
> 
> > here x would have to be reduced to something without the top bit set, only
> > after reducing it is it a value you'd actually expect to have as an input.
> > I'm not working out what it'd reduce to at the moment, it's 4:30am here.
> 
> brainmelt :)
> 
> when not 4:30am, is there any reason to support binary? i mean if prime is
> just XOR then pffh.

you got them mixed up. binary gf add is xor. prime gf add is integer addition followed by subtracting the prime if the sum is too big.
Comment 19 Luke Kenneth Casson Leighton 2022-03-05 00:29:38 GMT
(In reply to Jacob Lifshay from comment #17)
> (In reply to Jacob Lifshay from comment #16)
> > NTT inner loop:
> > https://github.com/sympy/sympy/blob/68022011872a20c27abd047125e5feb8ce3289cc/
> > sympy/discrete/transforms.py#L173
> > 
> >     h = 2
> >     while h <= n:
> >         hf, ut = h // 2, n // h
> >         for i in range(0, n, h):
> >             for j in range(hf):
> >                 u, v = a[i + j], a[i + j + hf]*w[ut * j]
> >                 a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p
> >         h *= 2

nice find.  i wish i had been able to track that down as quickly,
it took me 2-3 weeks to derive that same algorithm from project
nayuki recursive version, sigh

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;hb=HEAD

so, the twin thingy means no need to have a temp var,
because it is a butterfly swap on a[i+j] and a[i+j+hf]

what i did in SV.REMAP, was, create 3 separate "schedules":

* one for i+j
* one for i+j+hf
* one for... where is it... ah: ut+j

> looks like gfpmuladdsub

yes, basically you set up gfpmuladdsub(RT, RA, RB)
and apply *two* SV RM destinations to RT (one for the add, one
for the sub), then hit:

* RTv1 with an i+j schedule
* RTv2 with an i+j+hf schedule
* RB which is the multiply coefficient array w[] with ut*j

have to look again at the fft svp64 example because i am missing
something there but basically that entire triple for-loop,
the while plus i plus j, all "disappear" into one single
application of *THREE* SV.REMAPs on the one single
sv.gfpmuladdsub instruction.

kinda stunningly ridiculously elegant and overkill at the
same time :)

but, joking aside, it means you can do the entire in-place
NTT (or DFT) triple loop in something mad like the equivalent
space of 5 32 bit instructions.

remap_fft_yield.py shows the principle.
Comment 20 Luke Kenneth Casson Leighton 2022-03-05 00:31:16 GMT
(In reply to Jacob Lifshay from comment #18)

> you got them mixed up. binary gf add is xor. prime gf add is integer
> addition followed by subtracting the prime if the sum is too big.

i'm glad you're on this, now you see why it takes me so long, i
constantly get things inverted.
Comment 21 Luke Kenneth Casson Leighton 2022-03-05 00:46:57 GMT
found it

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;h=5ea7fcc89c8102b7a0ca34403fefb9e15f40eb2c;hb=d1b415e4383366cf445fd4ff2db828a612f88099#l140

twin butterfly, of Cooley-Tukey, we want a GF version.  you can
use the DFT example as pretty much cut/paste for everything, almost.

it is *3* input, *2* output, see ffmadds, where the SV RM entry
has 2 RT destinations hence you get 5 actual operands where
there is only space for 4 in the opcode.

if really really really pushed for space it can be made a
3 operand "overwrite" (with 5 offsets) so actually:

   sv.gfpmuladdsub RS, RA, RB

but:

* RS gets a "src1" offset (this is the add)
* RA gets a "src2" offset (this is the sub)
* RB gets a "src3" offset (this is the w coefficient)
* RS and RA *ALSO* get *dest* offsets

where again if pushed src1 offset can be the same as dst1
and src2 can be the same as dest2 offset.

do not worry about the "efficiency" of 3in 2out operations,
it is more important to have RS RA RB all as "in-flight",
using Reservation Stations to avoid the use of a temp register.
Comment 22 Jacob Lifshay 2022-03-05 01:11:23 GMT
(In reply to Luke Kenneth Casson Leighton from comment #19)
> (In reply to Jacob Lifshay from comment #17)
> > (In reply to Jacob Lifshay from comment #16)
> > > NTT inner loop:
> > > https://github.com/sympy/sympy/blob/68022011872a20c27abd047125e5feb8ce3289cc/sympy/discrete/transforms.py#L173
> > > 
> > >     h = 2
> > >     while h <= n:
> > >         hf, ut = h // 2, n // h
> > >         for i in range(0, n, h):
> > >             for j in range(hf):
> > >                 u, v = a[i + j], a[i + j + hf]*w[ut * j]
> > >                 a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p
> > >         h *= 2
> 
> nice find.  i wish i had been able to track that down as quickly,
> it took me 2-3 weeks to derive that same algorithm from project
> nayuki recursive version, sigh

it was one of the first few things i found on google...

> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;hb=HEAD
> 
> so, the twin thingy means no need to have a temp var,
> because it is a butterfly swap on a[i+j] and a[i+j+hf]
> 
> what i did in SV.REMAP, was, create 3 separate "schedules":
> 
> * one for i+j
> * one for i+j+hf
> * one for... where is it... ah: ut+j
> 
> > looks like gfpmuladdsub
> 
> yes, basically you set up gfpmuladdsub(RT, RA, RB)
> and apply *two* SV RM destinations to RT (one for the add, one
> for the sub), then hit:

hmm, I think we should instead defined the scalar gfpmuladdsub to write to two separate registers, like how ldu works, rather than having to use svp64 to disambiguate the two destinations (which requires using svp64 for the instruction to be useful since the scalar instruction writes twice to the same register).

Also, naming wise, that subtraction is term - product, which is the reverse of normal muladd, so i think we should call it gfpmuladdsubr rather than gfpmuladdsub.

I propose:
gfpmuladdsubr RT, RA, RB, RC

Pseudocode:
factor1 = (RA)
factor2 = (RB)
term = (RC)
prime = GFPRIME # the gfp* prime modulus SPR
(RA) = (term - factor1 * factor2) % prime
(RT) = (term + factor1 * factor2) % prime

This allows SVP64 REMAP to have:
* `a[i + j]` be RT and RC
* `a[i + j + hf]` be RA
* `w[ut * j]` be RB

Inner loop body for reference:
> u, v = a[i + j], a[i + j + hf]*w[ut * j]
> a[i + j], a[i + j + hf] = (u + v) % p, (u - v) % p

I picked gfpmuladdsubr overwriting one of the factors (RA) because that makes it more flexible since there's another identical input (RB) if you need your factor to not be overwritten, whereas you can't do that with the term input (RC) cuz there's only one of them, so I had that result go to RT instead of overwriting the term input.
Comment 23 Luke Kenneth Casson Leighton 2022-03-05 01:32:40 GMT
(In reply to Jacob Lifshay from comment #22)

> hmm, I think we should instead defined the scalar gfpmuladdsub to write to
> two separate registers, 

already ahead of you, ffmaddsub scalar targets RT and RT+1
the vector non-REMAP version targets RT+srcoffs,RT+VL+srcoffs
(i am dragging much of this out of longterm memory)

when i said this needs to fit exactly with what has already been
done i really meant it. everything is there, it was 8+ intensive
weeks developing REMAP FFT and we need modifying it like a hole
in the head.

there is more, in terms of Horizontal First Mode which shaped some
of the register name conventions, i just can't recall them immediately
but it allowed 5 regs to fit 3 SHAPES and use scalar temps and 2
vector ops all without changing the REMAPs inside the critical loop.

avoiding changing that by arbitrary register allocation is a high
priority. do spend some time looking closely at the ffmaddsub and
fft remap unit tests and run them.
Comment 24 Luke Kenneth Casson Leighton 2022-03-05 01:38:56 GMT
(In reply to Jacob Lifshay from comment #22)

> I picked gfpmuladdsubr overwriting one of the factors (RA) because that
> makes it more flexible since there's another identical input (RB) if you
> need your factor to not be overwritten, whereas you can't do that with the
> term input (RC) cuz there's only one of them, so I had that result go to RT
> instead of overwriting the term input.

to reiterate: you really, *really* do not want to modify the FFT REMAP
algorithm.  as in, really. don't. do it.

that means working out which reg was the coefficient, which the add, which
the sub, from the ffmaddsub instruction and copying it
verbatim but moving to GF.

yes it really was that much of a deep dive.

there is literally no difference in the schedules
between FFT and NTT so there is absolutely no technical
reason to modify the FFT REMAP system, either.
Comment 25 Luke Kenneth Casson Leighton 2022-03-05 11:11:29 GMT
https://www.doc.ic.ac.uk/~mrh/330tutor/ch04s04.html

jacob when you said "binary polynomials" were you referring to
GF(2^m)? if so, can we try to keep the discussion to GF(2^m)
because otherwise my brain will melt.
Comment 26 Luke Kenneth Casson Leighton 2022-03-05 11:53:53 GMT
(In reply to Jacob Lifshay from comment #3)

> def decode(v, XLEN):
>     """returns the decoded coefficient list in LSB to MSB order,
>         len(retval) == degree + 1"""

so just to be clear: v is the modulus? and the degree is implicitly
calculated from it?

if so one instruction can set the modulus and degree into a (new) SPR
and that in turn frees up a boatload of operand space, certainly
enough to make a 4-operand gfmuladd and also gftwinmuladdsub
and basically have exactly the same operand types and number
of operands as standard arithmetic.
Comment 27 Luke Kenneth Casson Leighton 2022-03-05 12:01:10 GMT
https://libre-soc.org/openpower/isa/svfparith/

Floating Multiply-Add FFT [Single]
A-Form

ffmadds FRT,FRA,FRC,FRB (Rc=0)
ffmadds. FRT,FRA,FRC,FRB (Rc=1)

Pseudo-code:

FRT <- FPMULADD32(FRA, FRC, FRB, 1, 1)
FRS <- FPMULADD32(FRA, FRC, FRB, -1, 1)


note **FRS**

there are *five* operands only *four* listed in
the mnemonic.

FRS is implicitly calculated

* scalar: FRS=FRT+1
* vector: FRS=FRT+svstep+VL
* v/REMAP: FRS=FRT+REMAP(svstep)

the gfmuladdsub needs to be the same form.

anything else requires such extensive interference with SV REMAP
it would be prohibitively disruptive for absolutely no
good reason whatsoever.
Comment 28 Luke Kenneth Casson Leighton 2022-03-06 10:30:53 GMT
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/gf2.py;hb=HEAD

i dropped the 2 functions gf2MUL and gf2divmod into a file to play
with and they work well.  gf_invert however did not, it needs investigating
after i made a couple of changes.


with some though gf2MUL can be seen to be standard long multiplication
in binary (XOR instead of the normal add) with standard modulo as well,
even that is analogous.
Comment 29 Luke Kenneth Casson Leighton 2022-03-06 18:01:57 GMT
here is gcd:

  59 def xgcd(a, b):
  60     """return (g, x, y) such that a*x + b*y = g = gcd(a, b)"""
  61     x0, x1, y0, y1 = 0, 1, 1, 0
  62     while a != 0:
  63         (q, a), b = divmodGF2(b, a), a
  64         y0, y1 = y1, y0 ^ multGF2(q , y1)
  65         x0, x1 = x1, x0 ^ multGF2(q , x1)
  66     return b, x0, y0

it works, but the complexity is high (3 higher functions
in a loop, and 3 return results) so is perfect for doing
in assembler, with divmodgf and multgf as opcodes.

not recommending putting this one in as instruction, nor
gf_invert which uses gcd
Comment 30 Jacob Lifshay 2022-03-07 19:24:38 GMT
since GF(2^n) is a field, invert should be `gfdiv(1, v)` (since 1 is the multiplicative identity) for all GF(2^n) values `v`.

the gcd algorithm is used for gf(p) for prime p, but there the mul, div, etc are all integer div, mul, etc, not GF(n) operations. afaict you confused it again.
Comment 31 Jacob Lifshay 2022-03-07 19:37:24 GMT
(In reply to Jacob Lifshay from comment #30)
> since GF(2^n) is a field, invert should be `gfdiv(1, v)` (since 1 is the
> multiplicative identity) for all GF(2^n) values `v`.
> 
> the gcd algorithm is used for gf(p) for prime p, but there the mul, div, etc
> are all integer div, mul, etc, not GF(n) operations. afaict you confused it
> again.

ahh, gf(2^n) gfdiv can be implemented a gfmul combined with the gcd algorithm where the div and mul and add ops in the gcd algorithm are *not* GF(2^n) ops, they are instead operating on polynomials of GF(2).

you can also use gfmul to compute gfdiv(a, b) in gf(2^n) is gfmul(a, gfpow(b, 2^n - 1)), where gfpow can be implemented using the standard squaring algorithm, giving runtime O(n * gfmul_time).
Comment 32 Jacob Lifshay 2022-03-07 19:45:02 GMT
also, one important point worth considering: for elliptic curve cryptography, it may use GF(2^n) where n is much larger than what fits in a single register -- several hundred. we are likely to have the same issue with GF(p) where p is much larger than 2^64.
Comment 33 Jacob Lifshay 2022-03-07 20:16:09 GMT
I found a paper that references a VLSI-optimized version of gfinv:
https://scholar.archive.org/work/ktlygagf6jhslhx42gpuwwzc44/access/wayback/http://acsel-lab.com/arithmetic/arith18/papers/ARITH18_Kobayashi.pdf
(Algorithm 3, from [7])

GF(2^m) inversion.
G(x) is the reducing polynomial. m is the degree of G(x).
A(x) is the input polynomial.
r[m] is the mth coefficient of R(x).
s[m] is the mth coefficient of S(x).

Algorithm 3:
S(x) := G(x); V(x) := 0;
R(x) := A(x); U(x) := 1;
δ := 0;
for i = 1 to 2*m do
    if r[m] = 0 then
        R(x) := x × R(x); # shift left by 1
        U(x) := x × U(x); # shift left by 1
        δ = δ + 1; # integer increment
    else
        if s[m] = 1 then
            S(x) := S(x) − R(x); # polynomial subtraction aka. xor
            V(x) := V(x) − U(x); # polynomial subtraction aka. xor
        end if
        S(x) := x × S(x); # shift left by 1
        if δ = 0 then
            {R(x) := S(x), S(x) := R(x)}; # swap S(x) and R(x)
            # shift V(x) left by one, then swap V(x) and U(x)
            {U(x) := x × V(x), V(x) := U(x)};
            δ := 1;
        else
            U(x) := U/x; # shift right 1
            δ := δ − 1;
        end if
    end if
end for
output U(x) as the result.
(U(x) = A^−1(x))
Comment 34 Luke Kenneth Casson Leighton 2022-03-07 21:27:44 GMT
(In reply to Jacob Lifshay from comment #31)
> (In reply to Jacob Lifshay from comment #30)
> > since GF(2^n) is a field, invert should be `gfdiv(1, v)` (since 1 is the
> > multiplicative identity) for all GF(2^n) values `v`.
> > 
> > the gcd algorithm is used for gf(p) for prime p, but there the mul, div, etc
> > are all integer div, mul, etc, not GF(n) operations. afaict you confused it
> > again.

almost certainly :)  much of what i'm doing is "hunt-the-algorithm"
(guess-work), then running unit tests to compensate.  run that loop
iteratively, fast enough, and it's indistinguishable from knowing
what the hell you're doing in the first place :)

> you can also use gfmul to compute gfdiv(a, b) in gf(2^n) is gfmul(a,
> gfpow(b, 2^n - 1)), where gfpow can be implemented using the standard
> squaring algorithm, giving runtime O(n * gfmul_time).

i remember this.  it's one of the really important features of GF, that
if you multiply enough times you get back to the original.


(In reply to Jacob Lifshay from comment #33)
> I found a paper that references a VLSI-optimized version of gfinv:
> https://scholar.archive.org/work/ktlygagf6jhslhx42gpuwwzc44/access/wayback/
> http://acsel-lab.com/arithmetic/arith18/papers/ARITH18_Kobayashi.pdf
> (Algorithm 3, from [7])

nice.  i'm glad you can read it!
i dropped a copy here https://ftp.libre-soc.org/ARITH18_Kobayashi.pdf

>             S(x) := S(x) − R(x); # polynomial subtraction aka. xor
>             V(x) := V(x) − U(x); # polynomial subtraction aka. xor

v. helpful, the comments.

> (U(x) = A^−1(x))

what's this one?  A to the power of "-1(x)" ?  oh wait, that's just
a declaration, "U *is* the inverse of A".  got it.

(In reply to Jacob Lifshay from comment #32)
> also, one important point worth considering: for elliptic curve
> cryptography, it may use GF(2^n) where n is much larger than what fits in a
> single register -- several hundred. we are likely to have the same issue
> with GF(p) where p is much larger than 2^64.

yes, i'm not sure what's the best option there.

oh wow, apparently doing an FFT is one option! holy cow
https://github.com/popcornell/pyGF2/blob/master/pyGF2/gf2_mul.py
https://github.com/popcornell/pyGF2/blob/master/pyGF2/gf2_inv.py

*splutter*

hey guess what? we have hardware-assisted FFT.... :)
Comment 35 Jacob Lifshay 2022-03-07 21:34:51 GMT
(In reply to Luke Kenneth Casson Leighton from comment #34)
> (In reply to Jacob Lifshay from comment #33)
> > I found a paper that references a VLSI-optimized version of gfinv:
> > https://scholar.archive.org/work/ktlygagf6jhslhx42gpuwwzc44/access/wayback/
> > http://acsel-lab.com/arithmetic/arith18/papers/ARITH18_Kobayashi.pdf
> > (Algorithm 3, from [7])
> 
> nice.  i'm glad you can read it!
> i dropped a copy here https://ftp.libre-soc.org/ARITH18_Kobayashi.pdf

Ok. since it's archive.org, i expect the url to work well into the future. that's why I picked the archive.org url over some other options.
> 
> >             S(x) := S(x) − R(x); # polynomial subtraction aka. xor
> >             V(x) := V(x) − U(x); # polynomial subtraction aka. xor
> 
> v. helpful, the comments.

yeah, i added the comments, i figured it would help with your algorithm dyslexia :)
Comment 36 Luke Kenneth Casson Leighton 2022-03-07 22:09:21 GMT
[edit: trailing not leading]

(In reply to Jacob Lifshay from comment #35)

> yeah, i added the comments, i figured it would help with your algorithm
> dyslexia :)

yeah very much, thank you, was really straightforward.
amazingly this seems to get the right answer:

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=0a9f45f2615f35902c4783bcdd07ab6151db841d

any change to "x" as long as it is less than 0x11b seems to
produce the correct value for y1 (bottom of the test), but
i had to hunt through a lot of random values of y to find one
that produced the right answer (again, independent of changes
to x)

no idea why, but hey

i guessed that "reducing polynomial" means the global "polyred",
and that from the for-loop "1 to 2*m" that this meant that m
is the degree, which then allowed me to use "if s & mask1"
which is testing the MSB

looking closely at the algorithm, now, i think we can use
"count-trailing-1s" to skip ahead a number of loops:

    i = 1
    while i < 2*degree+1:
        jump = count_trailing_1s(r)
        if jump == 0:
            ...
            ...
        else:
            r <<= jump            # jump left
            u <<= jump            # jump left
            j += jump

that would save on iterations, similar to how you can jump ahead
in long-division by detecting runs of zeros.
Comment 37 Jacob Lifshay 2022-03-08 04:28:10 GMT
this stuff was already addressed on irc, but I'm restating here to make this bug's comments more useful.

(In reply to Luke Kenneth Casson Leighton from comment #36)
> i guessed that "reducing polynomial" means the global "polyred",

no, it means polyred but with the 2^degree bit re-added. lkcl already fixed this.

> and that from the for-loop "1 to 2*m" that this meant that m
> is the degree, which then allowed me to use "if s & mask1"
> which is testing the MSB

yes, that's correct, though i probably would have spelled out the arithmetic rather than relying on globals.

> looking closely at the algorithm, now, i think we can use
> "count-trailing-1s" to skip ahead a number of loops:
> ...
> that would save on iterations, similar to how you can jump ahead
> in long-division by detecting runs of zeros.

we could...but i'm inclined to instead have a pipeline... 1 stage per iteration. then we could merge stages together since the mux & xor should be fast enough. We could probably get 5-6 stages per merged stage, giving us 3-4 cycles for GF(2^8), 6-7 cycles for GF(2^16), 11-13 cycles for GF(2^32), and 22-26 cycles for GF(2^64)
Comment 38 Luke Kenneth Casson Leighton 2022-03-08 04:56:47 GMT
(In reply to Jacob Lifshay from comment #37)

> yes, that's correct, though i probably would have spelled out the arithmetic
> rather than relying on globals.

that came from the programming style used by the gfMUL function
which i originally found online, but if you think about it, it
kinda reflects the concept we had in mind to have an SPR for the modulo.

i've often found that converting python code to classes, so as to
avoid considering the *module* as the encapsulation of the concept,
to result in unnecessary complexity and loss of clarity.

given that the SPR for the modulo would be fully global in nature,
polyred as a global funnily enough more accurately represents the
real-world scenario as to how the assembly instructions would be
used in the ISA, giving us a "feel" for how things would go.
Comment 39 Jacob Lifshay 2022-03-08 05:16:00 GMT
after thinking about a bit, I think gfmul's algorithm can be partially merged with the gfinv algorithm I posted in comment 33, giving us a gfdiv algorithm that takes less time than gfinv followed by gfmul.

Let's assume we're calculating gfdiv(N(x), A(x)) (A in the denominator in order to match the gfinv algorithm).

This is because the U(x) and V(x) polynomials are never used as inputs in the gfinv algorithm, and the algorithm has some nice additional properties, so we can change them to clmul(N(x), U(x)) and clmul(N(x), V(x)) throughout the algorithm, that way the algorithm will return clmul(N(x), gfinv(A(x))) which will be denoted by T(x). T(x), when reduced by G(x) (the reducing polynomial) produces the result of gfdiv(N(x), A(x)).
Comment 40 Luke Kenneth Casson Leighton 2022-03-08 05:36:28 GMT
(In reply to Jacob Lifshay from comment #39)
> after thinking about a bit, I think gfmul's algorithm can be partially
> merged with the gfinv algorithm I posted in comment 33, giving us a gfdiv
> algorithm that takes less time than gfinv followed by gfmul.

neat. feel free, but keep the originals around (in the same file, gf2.py)
so that the evidence-trail (for review etc.) is around, and when it comes
to unit-testing, the versions of the functions that have attribution /
explanations can be used.
Comment 41 Jacob Lifshay 2022-03-08 06:00:39 GMT
though, do we really need gfdiv? it's a lot of extra complexity and i can't think of any algorithms that need it...if there are algorithms, they can easily run gfmul on the result of gfinv, giving the equivalent of gfdiv.

In any case, here's the list of binary gf ops I think we should implement:

* clmul, clmulh (for gf operations that are > 64-bit, and other stuff)
* gfbmul RT, RA, RB
* gfbmuli RT, RA, imm # discard if we don't have enough encoding space
* gfbmadd RT, RA, RB, RC
* gfbinv rt, ra
    input 0 gives result 0 even though that is division by zero,
    that's what's needed for AES.

gfbadd is covered by the existing xor instructions.
gfbmaddsubr is not needed, since sub is identical to add, this is just two gfbmadd instructions.

If that all sounds good, i can create a new tracking bug for binary gf instructions and start working on implementing them.
Comment 42 Luke Kenneth Casson Leighton 2022-03-08 06:53:38 GMT
(In reply to Jacob Lifshay from comment #41)
> though, do we really need gfdiv? 

i'm sure i've seen modulo (remainder) used somewhere.
if producing that, might as well have div.

> it's a lot of extra complexity 

how? https://libre-soc.org/openpower/sv/gf2.py/
that's what... 7 lines?

> In any case, here's the list of binary gf ops I think we should implement:
> 
> * clmul, clmulh (for gf operations that are > 64-bit, and other stuff)

see wiki, clmulr is missing.

> * gfbmul RT, RA, RB

ack. is it worth doing a gfbmulh? looking at
multGF2 it looks like it could return (res >> deg) & mask2
just as easily to give the top half.

> * gfbmuli RT, RA, imm # discard if we don't have enough encoding space

no chance. snowball, hell etc. typically imm
takes an entire major opcode. mulli addi they
have dedicated major opcodes. makes no sense
except for less than degree 16, with gf jumping
all over the place.

if constructing consts then addi and addis can be used.
on presenting to OPF ISA WG they really want a gfbmuli
i see no reason why not, but it does take an entire
major op.

> * gfbmadd RT, RA, RB, RC

ack. remember it has to match madd and fmadd for REMAP
purposes.  this is IMPORTANT.

> * gfbinv rt, ra
>     input 0 gives result 0 even though that is division by zero,
>     that's what's needed for AES.

ok great.

> gfbadd is covered by the existing xor instructions.

ack.

> gfbmaddsubr is not needed, since sub is identical to add, this is just two
> gfbmadd instructions.

ack.

gftwinmuladd is missing.

btw see wiki page. please do read and
refer to it, i get the impression you're not reading it
regularly.

> If that all sounds good, i can create a new tracking bug for binary gf
> instructions and start working on implementing them.

the basics (lowlevel modules) and bugreport, yes.

btw do not consider pipeline designs to be "the peak epitomy
of absolute performance", it is misleading.

multiple simpler FSMs with multiple RSes can be more efficient
and take advantage of early-out and jump-ahead opportunities
that pipelines cannot.  this makes them *less* efficient rather
than more, particularly for long-running ops.

[yes a FSM can be done which processes multiple chained combinatorial
stages]

RSes need to farm-in and fan-out to pipelines (which have
to be BIG muxes), where FSMs can wire directly. each RS has
its own FSM, each FSM stores its result in the RS, no
massive fan-in/out.

bottom line you need to think of the context in which the
pipeline is used.
Comment 43 Jacob Lifshay 2022-03-08 08:02:53 GMT
(In reply to Luke Kenneth Casson Leighton from comment #42)
> (In reply to Jacob Lifshay from comment #41)
> > though, do we really need gfdiv? 
> 
> i'm sure i've seen modulo (remainder) used somewhere.

yes, in defining gf(2^n) in terms of gf(2) polynomials, the polynomial remainder is used. that is *not* a gfbrem.
if we were to have a gfbrem instruction, it would be kinda useless, always returning 0, since gfb is a field, meaning that, like complex numbers, every non-zero value has a multiplicative inverse. therefore, solving the equation for a remainder:

n = q * d + r
q = n / d # always produces exact results in any field
n = (n / d) * d + r
n = n + r
0 = r

> if producing that, might as well have div.
> 
> > it's a lot of extra complexity

well, if we're using a pipeline for gfbinv (which we totally want to if we want AES to be usably fast without needing 7 zillion gfbinv FSMs), it's much harder to tack another several stages on the end of a pipeline (unless we want to not share it with gfmul), unless we want to do microcoding.

> how? https://libre-soc.org/openpower/sv/gf2.py/
> that's what... 7 lines?
> 
> > In any case, here's the list of binary gf ops I think we should implement:
> > 
> > * clmul, clmulh (for gf operations that are > 64-bit, and other stuff)
> 
> see wiki, clmulr is missing.

ah, ok. yeah, we can add clmulr.
> 
> > * gfbmul RT, RA, RB
> 
> ack. is it worth doing a gfbmulh? looking at
> multGF2 it looks like it could return (res >> deg) & mask2
> just as easily to give the top half.

there is no top half. gfbmul returns the entire answer. if we tried to make a gfbmulh, it would always return 0 unless the reducing polynomial had a degree of  more than 64 (64 is the max in the current design).
> 
> > * gfbmuli RT, RA, imm # discard if we don't have enough encoding space
> 
> no chance. snowball, hell etc. typically imm
> takes an entire major opcode. mulli addi they
> have dedicated major opcodes. makes no sense
> except for less than degree 16, with gf jumping
> all over the place.
> 
> if constructing consts then addi and addis can be used.
> on presenting to OPF ISA WG they really want a gfbmuli
> i see no reason why not, but it does take an entire
> major op.
> 
> > * gfbmadd RT, RA, RB, RC
> 
> ack. remember it has to match madd and fmadd for REMAP
> purposes.  this is IMPORTANT.
> 
> > * gfbinv rt, ra
> >     input 0 gives result 0 even though that is division by zero,
> >     that's what's needed for AES.
> 
> ok great.
> 
> > gfbadd is covered by the existing xor instructions.
> 
> ack.
> 
> > gfbmaddsubr is not needed, since sub is identical to add, this is just two
> > gfbmadd instructions.
> 
> ack.
> 
> gftwinmuladd is missing.
> 
> btw see wiki page. please do read and
> refer to it, i get the impression you're not reading it
> regularly.

yeah, because i have no easy way of figuring out if it has changed (automatic notifications would be nice!), and i don't like rereading stuff i've already read 5 times in the hope that you might have changed it meanwhile... git log is nearly useless since nearly everything has no useful commit message.
> 
> > If that all sounds good, i can create a new tracking bug for binary gf
> > instructions and start working on implementing them.
> 
> the basics (lowlevel modules) and bugreport, yes.
> 
> btw do not consider pipeline designs to be "the peak epitomy
> of absolute performance", it is misleading.
> 
> multiple simpler FSMs with multiple RSes can be more efficient
> and take advantage of early-out and jump-ahead opportunities
> that pipelines cannot.  this makes them *less* efficient rather
> than more, particularly for long-running ops.

that depends...if the pipeline allows us to specialize stages to specific parts of the algorithm it can be faster (e.g. fpmul with one stage for unpacking/denormal-shifting, one stage for partial product computation and first part of carry-save adders, another stage of carry-save adders, final stage of carry-save adders and final carry-propagating add, stage rounding&packing&denormal handling) since each stage only needs to implement the parts it handles, making the stages run faster. if it were instead several totally separate FSMs for the same throughput, each FSM would need the logic for all stages of the algorithm, making the data path much more complex and possibly taking more time due to the complexity, reducing max clock speed.
> 
> [yes a FSM can be done which processes multiple chained combinatorial
> stages]
> 
> RSes need to farm-in and fan-out to pipelines (which have
> to be BIG muxes), where FSMs can wire directly. each RS has
> its own FSM, each FSM stores its result in the RS, no
> massive fan-in/out.

that's kinda deceptive, since the muxes with the massive fan-out/in are now in the scheduler logic and the busses for carrying intermediate register values around (i temporarily forgot the name) rather than in the pipeline fan-in fan-out.

for example, lets say the algorithm takes 10 cycles, and that every pipe/FSM needs 2 extra FUs in order to stay 100% utilized, and that there are 20 other FUs in the whole cpu:
pipeline:
12 FUs -> pipeline

12 -> 1 mux on input
1 -> 12 mux on output
(getting big)
something -> 32 mux to fill all cpu FUs from reg reads/operand forwarding
(pretty big)


10 FSMs (for same throughput):
3 FUs -> 10-step FSM
3 FUs -> 10-step FSM
3 FUs -> 10-step FSM
...
3 FUs -> 10-step FSM

30 FUs

10x 3 -> 1 muxes on input
10x 1 -> 3 muxes on output (meh, tiny, who cares)
something bigger -> 50 mux to fill all cpu FUs from reg reads/operand forwarding
(woah, huge!! aah, run!)

> 
> bottom line you need to think of the context in which the
> pipeline is used.
Comment 44 Luke Kenneth Casson Leighton 2022-03-08 10:47:03 GMT
(In reply to Jacob Lifshay from comment #43)
> (In reply to Luke Kenneth Casson Leighton from comment #42)
> > (In reply to Jacob Lifshay from comment #41)
> > > though, do we really need gfdiv? 
> > 
> > i'm sure i've seen modulo (remainder) used somewhere.
> 
> yes, in defining gf(2^n) in terms of gf(2) polynomials, the polynomial
> remainder is used. that is *not* a gfbrem.

bear in mind, i don't understand the difference, so i am trusting you to
ask and answer this question: why then have several people published
divmodGF2 algorithms in their GF2 libraries?

they must be used for something.

> well, if we're using a pipeline for gfbinv (which we totally want to if we
> want AES to be usably fast without needing 7 zillion gfbinv FSMs),

you've fundamentally misunderstood something about OoO design here which
i will cover in a mailing list post.

> there is no top half. gfbmul returns the entire answer. if we tried to make
> a gfbmulh, it would always return 0 unless the reducing polynomial had a
> degree of  more than 64 (64 is the max in the current design).

by that logic clmulh should not exist.  there is something missing
here which needs to be explained and clarified.

> yeah, because i have no easy way of figuring out if it has changed
> (automatic notifications would be nice!), and i don't like rereading stuff
> i've already read 5 times in the hope that you might have changed it
> meanwhile... git log is nearly useless since nearly everything has no useful
> commit message.

i re-read the same page sometimes 20 to 30 times a day. this is just good
practice because each re-read results in different insights.

i never consider it "beneath" me to re-read something that many times.

>  if it were instead
> several totally separate FSMs for the same throughput, each FSM would need
> the logic for all stages of the algorithm, making the data path much more
> complex and possibly taking more time due to the complexity, reducing max
> clock speed.

it is perfectly possible for a FSM design to itself farm-in and fan-out
to shared pipelines.

> > 
> > [yes a FSM can be done which processes multiple chained combinatorial
> > stages]
> > 
> > RSes need to farm-in and fan-out to pipelines (which have
> > to be BIG muxes), where FSMs can wire directly. each RS has
> > its own FSM, each FSM stores its result in the RS, no
> > massive fan-in/out.
> 
> that's kinda deceptive, since the muxes with the massive fan-out/in are now
> in the scheduler logic and the busses for carrying intermediate register
> values around (i temporarily forgot the name) rather than in the pipeline
> fan-in fan-out.

i'll answer this on-list. the absolute inviolate rule is that you absolutely
cannot have an untracked in-flight computation.  every register involved in
every computation has to be tracked, and every RS with an outstanding
computation has to be frozen until the result hits a regfile.
Comment 45 Luke Kenneth Casson Leighton 2022-03-08 16:18:48 GMT
as promised, a write-up on RSes and the balance between FSMs and pipelines
https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-March/004528.html
https://libre-soc.org/3d_gpu/pipeline_vs_fsms.jpg

frankly i'm astounded that i can hold this in my head and follow it easily,
yet i get confused on even simple algorithms and can't distinguish GF2
from GF(2^m) or follow why clmulh exists but multGF2 would always return
zero for the HI bits :)
Comment 46 Jacob Lifshay 2022-03-08 18:35:52 GMT
(In reply to Luke Kenneth Casson Leighton from comment #44)
> (In reply to Jacob Lifshay from comment #43)
> > (In reply to Luke Kenneth Casson Leighton from comment #42)
> > > (In reply to Jacob Lifshay from comment #41)
> > > > though, do we really need gfdiv? 
> > > 
> > > i'm sure i've seen modulo (remainder) used somewhere.
> > 
> > yes, in defining gf(2^n) in terms of gf(2) polynomials, the polynomial
> > remainder is used. that is *not* a gfbrem.
> 
> bear in mind, i don't understand the difference, so i am trusting you to
> ask and answer this question: why then have several people published
> divmodGF2 algorithms in their GF2 libraries?

think of it like this:

class GFp:
    """Galois Field GF(p) with the number of elements that is a prime `p`.
        these are isomorphic to the integers mod `p`, though it can be defined differently
        (e.g. by using `GFPrimeToTheK` with `k == 1`).
    """
    def __init__(self, p, v=None):
        if v is None:
            v = 0
        self.v = v % p
        self.p = p

    ...
    # no % or divmod here, since % would always return 0

class Polynomial(Generic[T]):
    """A polynomial with coefficients of type T.

        `coefficients[i]` is the coefficient for `x ** i`
    """
    def __init__(self, coefficients=()):
        self.coefficients = list(coefficients)
    ...
    def __divmod__(self, rhs):
        """Polynomial divmod makes sense for all polynomials, even if it doesn't make
            sense to divmod a coefficient.
        """
        ...
    @property
    def degree(self):
        retval = 0
        for i, c in enumerate(self.coefficients):
            if c != 0:
                retval = i
        return retval

class GFpPolynomial(Polynomial[GFp]):
    """Polynomial with coefficients that are GFp with matching p"""
    def __init__(self, coefficients, p):
        super().__init__(coefficients)
        self.p = p
        for i in self.coefficients:
            assert isinstance(i, GFp)
            assert i.p == p

class GFPrimeToTheK:
    """aka. GF(q) where q = p ** k. Galois Field with number of elements
        that is a prime `p` raised to the power `k` where `k`
        is a positive integer. This isn't a polynomial (though mathematicians
        kinda mix their abstraction layers together whenever convenient),
        but is usually defined as the polynomial remainders you get from
        dividing `GFpPolynomial`s by a specific reducing `GFpPolynomial`
        `reducing_polynomial` of degree k.
        `p`, `k`, and the `reducing_polynomial` define a specific Galois Field,
        with the addition of the remainder polynomial, you get a specific
        Galois Field element.
        Note that the general shape of a Galois Field doesn't depend on what
        specific reducing polynomial you've selected (they are all isomorphic
        to each other for a specific `p` and `k`), so when people don't care,
        they just leave the reducing polynomial unstated and up to the reader
        to pick.
    """
    def __init__(self, p, k, reducing_polynomial, remainder):
        self.p = p
        self.k = k
        self.reducing_polynomial = reducing_polynomial
        assert isinstance(reducing_polynomial, GFpPolynomial)
        assert isinstance(remainder, GFpPolynomial)
        assert p == reducing_polynomial.p
        assert k == reducing_polynomial.degree
        assert reducing_polynomial.is_irreducable()
        self.remainder = remainder % reducing_polynomial

    ...
    # no % or divmod here, since % would always return 0
> 
> they must be used for something.

well, the top result in google for divmod "gf2" for me is:
(changed to link to source code, yes i tried -- https doesn't seem to work)
http://www.diegm.uniud.it/~bernardi/Software/Cplusplus/galoislib_doc/gf2_8H-source.html#l00201

00201   void divmod(GF<FieldSize> a, GF<FieldSize> &quot, GF<FieldSize> &rem)
00202   {
00203     quot = a*inv(*this);
00204     rem  = GF<FieldSize>::zero();
00205   }

As you can see, the remainder is always zero.

I'd guess divmod is used here for API compatibility with Rings, where a remainder is actually useful, such as the Ring of integers.
Comment 47 Jacob Lifshay 2022-03-08 18:45:32 GMT
(In reply to Luke Kenneth Casson Leighton from comment #44)
> > there is no top half. gfbmul returns the entire answer. if we tried to make
> > a gfbmulh, it would always return 0 unless the reducing polynomial had a
> > degree of  more than 64 (64 is the max in the current design).
> 
> by that logic clmulh should not exist.  there is something missing
> here which needs to be explained and clarified.

clmul is polynomial multiplication where the coefficients are GF(2), not GF(2^k) multiplication.

GF(2^k) multiplication has the additional step of calculating the polynomial remainder of dividing by the reducing polynomial -- essentially taking what would be the top half and merging it into the bottom half, leaving the top half now empty. This is similar to why the integers mod 16 have zeros everywhere except the least significant hex digit, because the mod 16 operation removed them, leaving zeros.
Comment 48 Luke Kenneth Casson Leighton 2022-03-08 19:25:02 GMT
(In reply to Jacob Lifshay from comment #46)

> I'd guess divmod is used here for API compatibility with Rings, where a
> remainder is actually useful, such as the Ring of integers.

rright.  ok.  yes, i have seen code which provides divmod then later
a separate class embeds the one providing divmod into a GFPolynomial
one.

(In reply to Jacob Lifshay from comment #47)

> clmul is polynomial multiplication where the coefficients are GF(2), not
> GF(2^k) multiplication.

the RV xbitmanip spec (see wiki) describes clmul as "GF(2^k) multiplication
with the reducing polynomial set to 2^k"

does that sound about right? because if it is, it is basically a way
to "switch off" the "dividing by the reducing polynomial"

> GF(2^k) multiplication has the additional step of calculating the polynomial
> remainder of dividing by the reducing polynomial -- essentially taking what
> would be the top half and merging it into the bottom half, leaving the top
> half now empty. This is similar to why the integers mod 16 have zeros
> everywhere except the least significant hex digit, because the mod 16
> operation removed them, leaving zeros.

ok. vaguely starting to make sense.

if the reducing polynomial is set to 2^XLEN (switched off in effect)
does divmodGF2 and multGF2 effectively become "ring of integer" thingies?

i would like to see both included, in effect, preferably without having
to have separate HDL?
Comment 49 Jacob Lifshay 2022-03-08 20:09:29 GMT
(In reply to Luke Kenneth Casson Leighton from comment #48)
> (In reply to Jacob Lifshay from comment #46)
> 
> > I'd guess divmod is used here for API compatibility with Rings, where a
> > remainder is actually useful, such as the Ring of integers.
> 
> rright.  ok.  yes, i have seen code which provides divmod then later
> a separate class embeds the one providing divmod into a GFPolynomial
> one.
> 
> (In reply to Jacob Lifshay from comment #47)
> 
> > clmul is polynomial multiplication where the coefficients are GF(2), not
> > GF(2^k) multiplication.
> 
> the RV xbitmanip spec (see wiki) describes clmul as "GF(2^k) multiplication
> with the reducing polynomial set to 2^k"

that's kinda like saying "a prime number p, but divisible by 32" -- it doesn't really make sense...anyway...if you took the GF(2^k) multiplication algorithm and replaced the reducing polynomial with 2^k where k == XLEN, then, yes, that is the result of clmul, since clmul's result is truncated to XLEN bits. clmulh and clmulr would require a larger reducing polynomial to avoid truncating the product.

> does that sound about right? because if it is, it is basically a way
> to "switch off" the "dividing by the reducing polynomial"

yes. that said, clmul should be its own instruction, otherwise you would usually end up doing many cycles of extra work for the polynomial division by the reducing polynomial. multGF2 takes an iterative polynomial mul (iterative version of clmul) and merges it with a polynomial remainder operation, making it hard to run fast because remainders are serial operations (with most algorithms). clmul can and probably should be a combinatorial (with optional pipelining) multiply, like a Dadda multiplier -- much faster than remainder. In fact, we could modify our multiplier to also do clmul, just by masking out all carries.

> > GF(2^k) multiplication has the additional step of calculating the polynomial
> > remainder of dividing by the reducing polynomial -- essentially taking what
> > would be the top half and merging it into the bottom half, leaving the top
> > half now empty. This is similar to why the integers mod 16 have zeros
> > everywhere except the least significant hex digit, because the mod 16
> > operation removed them, leaving zeros.
> 
> ok. vaguely starting to make sense.
> 
> if the reducing polynomial is set to 2^XLEN (switched off in effect)
> does divmodGF2 and multGF2 effectively become "ring of integer" thingies?

it would be a ring of truncated polynomials with GF(2) coefficients. A Ring is any math thing that has addition, multiplication, and works kinda like the integers in that it has an additive identity 0, and a multiplicative identity 1, and iirc associativity and commutativity. Rings can be made of things that are totally not integers, like here, where they are truncated polynomials.

divmodGF2 was always a polynomial divmod with GF(2) coefficients, it wasn't ever operating on GF(2^n) things, other than being used to produce the polynomial inside a GF(2^n). It is totally unaffected by changing reducing polynomials, since those only matter for GF(2^n) operations.
> 
> i would like to see both included, in effect, preferably without having
> to have separate HDL?

we could do that...clmul would be really slow though.
Comment 50 Luke Kenneth Casson Leighton 2022-03-08 20:57:56 GMT
(In reply to Jacob Lifshay from comment #49)

> > the RV xbitmanip spec (see wiki) describes clmul as "GF(2^k) multiplication
> > with the reducing polynomial set to 2^k"
> 
> that's kinda like saying "a prime number p, but divisible by 32" -- it
> doesn't really make sense...anyway...

:)

> if you took the GF(2^k) multiplication
> algorithm and replaced the reducing polynomial with 2^k where k == XLEN,
> then, yes, that is the result of clmul, since clmul's result is truncated to
> XLEN bits. clmulh and clmulr would require a larger reducing polynomial to
> avoid truncating the product.

okaay.
 
> > does that sound about right? because if it is, it is basically a way
> > to "switch off" the "dividing by the reducing polynomial"
> 
> yes. that said, clmul should be its own instruction,

yes absolutely, messing about with setting the polynomial to 2^XLEN
before-hand would be a pain.

> operations (with most algorithms). clmul can and probably should be a
> combinatorial (with optional pipelining) multiply, like a Dadda multiplier
> -- much faster than remainder. In fact, we could modify our multiplier to
> also do clmul, just by masking out all carries.

oooOoo.
 
> it would be a ring of truncated polynomials with GF(2) coefficients. A Ring
> is any math thing that has addition, multiplication, and works kinda like
> the integers in that it has an additive identity 0, and a multiplicative
> identity 1, and iirc associativity and commutativity. Rings can be made of
> things that are totally not integers, like here, where they are truncated
> polynomials.

surprisingly, i get the Ring thing. i used it for a cryptographic algorithm
in 2003.  numbers that repeat, and are unique within their set, and
by a beautiful coincidence happen to have consistent arithmetic.
 
> divmodGF2 was always a polynomial divmod with GF(2) coefficients, it wasn't
> ever operating on GF(2^n) things, other than being used to produce the
> polynomial inside a GF(2^n). It is totally unaffected by changing reducing
> polynomials, since those only matter for GF(2^n) operations.

okaaay.

> > 
> > i would like to see both included, in effect, preferably without having
> > to have separate HDL?
> 
> we could do that...clmul would be really slow though.

best not then.

basically what i'm trying to advocate is to keep gfdiv and gfmod but not
name them... what did you use... a gfb- prefix?

can you update comment #0 so i have a full list? i'll then double-check
the wiki page for opcode allocations.
Comment 51 Jacob Lifshay 2022-03-08 21:06:01 GMT
[edit: delete unnecessary context]

(In reply to Luke Kenneth Casson Leighton from comment #50)
> (In reply to Jacob Lifshay from comment #49)
> > > 
> > > i would like to see both included, in effect, preferably without having
> > > to have separate HDL?
> > 
> > we could do that...clmul would be really slow though.
> 
> best not then.
> 
> basically what i'm trying to advocate is to keep gfdiv and gfmod but not
> name them... what did you use... a gfb- prefix?

we'd probably want to name it cldiv and clrem, since that way it matches with clmul. they would be polynomial divmod, where the polynomial coefficients are GF(2).

gfb implies GF(2^n), which depends on the reducing polynomial in order to make sense.
> 
> can you update comment #0 so i have a full list? i'll then double-check
> the wiki page for opcode allocations.

yeah. i'll leave the gfp (prime) list un-elaborated for now.
Comment 52 Jacob Lifshay 2022-03-08 21:09:33 GMT
(In reply to Luke Kenneth Casson Leighton from comment #50)
> > it would be a ring of truncated polynomials with GF(2) coefficients. A Ring
> > is any math thing that has addition, multiplication, and works kinda like
> > the integers in that it has an additive identity 0, and a multiplicative
> > identity 1, and iirc associativity and commutativity. Rings can be made of
> > things that are totally not integers, like here, where they are truncated
> > polynomials.
> 
> surprisingly, i get the Ring thing. i used it for a cryptographic algorithm
> in 2003.  numbers that repeat, and are unique within their set, and
> by a beautiful coincidence happen to have consistent arithmetic.

repetition isn't needed for a Ring, the consistent arithmetic part is the important part.
Comment 53 Jacob Lifshay 2022-03-08 21:22:28 GMT
filled out comment 0
Comment 54 Luke Kenneth Casson Leighton 2022-03-09 08:33:13 GMT
(In reply to Jacob Lifshay from comment #51)
 
> we'd probably want to name it cldiv and clrem, since that way it matches
> with clmul. they would be polynomial divmod, where the polynomial
> coefficients are GF(2).

that makes sense. like it.
 
> gfb implies GF(2^n), which depends on the reducing polynomial in order to
> make sense.
> > 
> > can you update comment #0 so i have a full list? i'll then double-check
> > the wiki page for opcode allocations.
> 
> yeah. i'll leave the gfp (prime) list un-elaborated for now.

do include them i need the names and qty.  and basic descriptions.

urr gfbmadd is 4op which are in short supply.  have to lose Rc=1
to get them to fit.
Comment 55 Jacob Lifshay 2022-03-09 08:48:55 GMT
(In reply to Luke Kenneth Casson Leighton from comment #54)
> (In reply to Jacob Lifshay from comment #51)
> > yeah. i'll leave the gfp (prime) list un-elaborated for now.
> 
> do include them i need the names and qty.

done.

> and basic descriptions.

uuh...should be pretty obvious?? i'll add them in the morning if it isn't.

> 
> urr gfbmadd is 4op which are in short supply.  have to lose Rc=1
> to get them to fit.

Rc isn't super useful for most GF operations, eq is the only CR bit that could occasionally be useful.
Comment 56 Luke Kenneth Casson Leighton 2022-03-09 09:18:41 GMT
(In reply to Jacob Lifshay from comment #55)

> uuh...should be pretty obvious?? 

we've already established that to me it isn't :)
am getting there but it would help to doublecheck.
GF2 and 2^m are currently mixed in with each other
in the wiki page

> i'll add them in the morning if it isn't.

star. eek. 4am for you. tsk tsk :)

> Rc isn't super useful for most GF operations, eq is the only CR bit that
> could occasionally be useful.

true but madd seems not to have it for similar space reasons iirc.
or is it OE=1 cant recall immediately
Comment 57 Jacob Lifshay 2022-03-09 20:32:59 GMT
(In reply to Luke Kenneth Casson Leighton from comment #56)
> (In reply to Jacob Lifshay from comment #55)
> 
> > uuh...should be pretty obvious?? 
> 
> we've already established that to me it isn't :)

i meant the stuff needed to allocate instruction encodings (e.g. gfpmadd uses 4 registers). I can fill in the pseudo-code on the wiki.

> am getting there but it would help to doublecheck.

yeah, I'll check, though it may have to wait till after the meeting today

> GF2 and 2^m are currently mixed in with each other
> in the wiki page
> 
> > i'll add them in the morning if it isn't.
> 
> star. eek. 4am for you. tsk tsk :)

it was actually around 1-2am... DST starts on the 13th here.
> 
> > Rc isn't super useful for most GF operations, eq is the only CR bit that
> > could occasionally be useful.
> 
> true but madd seems not to have it for similar space reasons iirc.
> or is it OE=1 cant recall immediately

icr, but the only madd instructions that might need OE=1 are the cl* instructions, all the GF(2^n) and GF(p) madd instructions can't overflow.
Comment 58 Jacob Lifshay 2022-03-10 02:24:55 GMT
I added all the proposed ops to the wiki, I still need to add pseudo-code for several of them.
Comment 59 Jacob Lifshay 2022-03-10 02:29:25 GMT
(In reply to Jacob Lifshay from comment #58)
> I added all the proposed ops to the wiki, I still need to add pseudo-code
> for several of them.

The encodings still need to be decided, should I let you (lkcl) figure that out, or should I figure them out myself?
Comment 60 Luke Kenneth Casson Leighton 2022-03-10 02:59:34 GMT
(In reply to Jacob Lifshay from comment #58)
> I added all the proposed ops to the wiki, I still need to add pseudo-code
> for several of them.

-# Galois Field 2^M
+# Instructions for Carry-less Operations aka. Polynomials with coefficients in `GF(2)`

clmul(rh) were already added.

prime-polynomial ones there's not enough room unless moving ternary
to another major opcode (sigh), the ??madd ones take up vast amounts
of space.

another option, the entirety of gf* and cl* moves to its own major opcode.

(In reply to Jacob Lifshay from comment #59)

> The encodings still need to be decided, should I let you (lkcl) figure that
> out, or should I figure them out myself?

best let me do it (tomorrow) because i know the "rules" i applied to them
and have the working-set in my head

yes go for it on the pseudocode, do move the clmul(rh) ones to the
(existing) carryless section, do keep an eye on that, read the
whole page several times (now you know why i said "read it
several times", because it's not just about the "changes", you
have to know *everything* in the page, particularly if you want
to correctly design the encodings.  yes, it's a frickin lot of work!)

yes drop pseudocode in, if it's missing.
Comment 61 Jacob Lifshay 2022-03-10 03:18:08 GMT
(In reply to Luke Kenneth Casson Leighton from comment #60)
> (In reply to Jacob Lifshay from comment #58)
> > I added all the proposed ops to the wiki, I still need to add pseudo-code
> > for several of them.
> 
> -# Galois Field 2^M
> +# Instructions for Carry-less Operations aka. Polynomials with coefficients
> in `GF(2)`
> 
> clmul(rh) were already added.
> 
> prime-polynomial ones there's not enough room unless moving ternary
> to another major opcode (sigh), the ??madd ones take up vast amounts
> of space.

we could change the madd ops to be 3-arg rather than 4:

if we did that, we'd want to have additional separate variants that allow you to overwrite the term input rather than a factor input (inspired by x86's fma ops).

# was gfpmsub
gfpmsubf RT, RA, RB # overwrites factor

(RT) = int_to_gfp((RT) * (RA) - (RB), GFPRIME)

gfpmsubt RT, RA, RB # overwrites term

(RT) = int_to_gfp((RA) * (RB) - (RT), GFPRIME)

# was gfpmsubr
gfpmsubrf RT, RA, RB # overwrites factor

(RT) = int_to_gfp((RB) - (RT) * (RA), GFPRIME)

gfpmsubrt RT, RA, RB # overwrites term

(RT) = int_to_gfp((RT) - (RA) * (RB), GFPRIME)

> 
> another option, the entirety of gf* and cl* moves to its own major opcode.
> 
> (In reply to Jacob Lifshay from comment #59)
> 
> > The encodings still need to be decided, should I let you (lkcl) figure that
> > out, or should I figure them out myself?
> 
> best let me do it (tomorrow) because i know the "rules" i applied to them
> and have the working-set in my head
> 
> yes go for it on the pseudocode, do move the clmul(rh) ones to the
> (existing) carryless section,

I intentionally started new sections, because i think the grouping/ordering makes more sense. I didn't delete the old sections yet cuz I figured we wanted to keep whatever info was there for reference until the new sections explain everything.

> yes drop pseudocode in, if it's missing.

I'm likely to also edit the existing/new pseudo-code to change naming and stuff to make it consistent with the instructions.
Comment 62 Jacob Lifshay 2022-03-10 03:23:53 GMT
(In reply to Jacob Lifshay from comment #61)
> > yes go for it on the pseudocode, do move the clmul(rh) ones to the
> > (existing) carryless section,
> 
> I intentionally started new sections, because i think the grouping/ordering
> makes more sense. I didn't delete the old sections yet cuz I figured we
> wanted to keep whatever info was there for reference until the new sections
> explain everything.

I want to rewrite the explanations to be more mathematically and conceptually consistent, since the current explanations confuse/mix a lot of the math concepts.
Comment 63 Luke Kenneth Casson Leighton 2022-03-10 03:48:51 GMT
(In reply to Jacob Lifshay from comment #61)
> (In reply to Luke Kenneth Casson Leighton from comment #60)
> > (In reply to Jacob Lifshay from comment #58)
> > > I added all the proposed ops to the wiki, I still need to add pseudo-code
> > > for several of them.
> > 
> > -# Galois Field 2^M
> > +# Instructions for Carry-less Operations aka. Polynomials with coefficients
> > in `GF(2)`
> > 
> > clmul(rh) were already added.
> > 
> > prime-polynomial ones there's not enough room unless moving ternary
> > to another major opcode (sigh), the ??madd ones take up vast amounts
> > of space.
> 
> we could change the madd ops to be 3-arg rather than 4:

i already explained clearly and non-negotiably the damaging
consequences and cost of that.

> if we did that, we'd want to have additional separate variants that allow
> you to overwrite the term input rather than a factor input (inspired by
> x86's fma ops).

which in turn would require a rewrite of SV REMAP to match it
and it took about 10 weeks to write.

i am puzzled that you do not remember this, it was only about 3
days ago.

matching the exact form (assembly, reg names) is an absolute
highest inviolate top priority over everything else.

except for the poly SPR which allows gf ops to match int arith

 
> I intentionally started new sections, because i think the grouping/ordering
> makes more sense. I didn't delete the old sections yet cuz I figured we
> wanted to keep whatever info was there for reference until the new sections
> explain everything.

makes sense, do mark them as such tho


> > yes drop pseudocode in, if it's missing.
> 
> I'm likely to also edit the existing/new pseudo-code to change naming and
> stuff to make it consistent with the instructions.

yep go for it. drop into gf.py to test them as well. gf.py should become
an easy to use/read reference, sandbox for other people,
and basis of tests (eventually)
Comment 64 Luke Kenneth Casson Leighton 2022-03-10 03:50:09 GMT
(In reply to Jacob Lifshay from comment #62)

> I want to rewrite the explanations to be more mathematically and
> conceptually consistent, since the current explanations confuse/mix a lot of
> the math concepts.

totally got it, v. relieved as well. started, missed some.
Comment 65 Jacob Lifshay 2022-03-10 03:59:36 GMT
(In reply to Luke Kenneth Casson Leighton from comment #63)
> (In reply to Jacob Lifshay from comment #61)
> > we could change the madd ops to be 3-arg rather than 4:
> 
> i already explained clearly and non-negotiably the damaging
> consequences and cost of that.

I meant the non-twin madd ops (which aren't used for fft iirc). the twin madd is used for fft and would still be 4-arg.
Comment 66 Jacob Lifshay 2022-03-10 04:04:52 GMT
(In reply to Luke Kenneth Casson Leighton from comment #63)
> (In reply to Jacob Lifshay from comment #61)
> > if we did that, we'd want to have additional separate variants that allow
> > you to overwrite the term input rather than a factor input (inspired by
> > x86's fma ops).
> 
> which in turn would require a rewrite of SV REMAP to match it
> and it took about 10 weeks to write.

couldn't we just use the 3-arg SVP64 prefix? this is exactly like rlwinm except it has 3 registers.
Comment 67 Luke Kenneth Casson Leighton 2022-03-10 10:58:11 GMT
(In reply to Jacob Lifshay from comment #65)
> (In reply to Luke Kenneth Casson Leighton from comment #63)
> > (In reply to Jacob Lifshay from comment #61)
> > > we could change the madd ops to be 3-arg rather than 4:
> > 
> > i already explained clearly and non-negotiably the damaging
> > consequences and cost of that.
> 
> I meant the non-twin madd ops (which aren't used for fft iirc).

ah yes, so sorry, yes those are used in Horizontal-First FFT REMAP
mode.  that was related to complex numbers, where the "usual"
single-for-loop-on-an-instruction would only work if we had actual
complex-mul-and-add [which would be complete overkill and swamp
us with work]

instead by using Horizontal-First it was possible to use *four*
separate and distinct mul-and-adds with intermediary variables
(as scalars)

there was also iirc some interaction between fmadd.. it turned
out to be possible to usefully set up all five REMAPs (RA RB RC, RT TS)
and by a beautiful/horrendous coincidence only some of those
are used by some operations (fmadds?) and then the next instruction
used others, without needing a 2nd REMAP in between.

having worked that all out in eight severely-intensive weeks,
i don't want to go anywhere near it, again.
Comment 68 Jacob Lifshay 2022-03-10 17:00:52 GMT
(In reply to Luke Kenneth Casson Leighton from comment #67)
> (In reply to Jacob Lifshay from comment #65)
> > (In reply to Luke Kenneth Casson Leighton from comment #63)
> > > (In reply to Jacob Lifshay from comment #61)
> > > > we could change the madd ops to be 3-arg rather than 4:
> > > 
> > > i already explained clearly and non-negotiably the damaging
> > > consequences and cost of that.
> > 
> > I meant the non-twin madd ops (which aren't used for fft iirc).
> 
> ah yes, so sorry, yes those are used in Horizontal-First FFT REMAP
> mode.  that was related to complex numbers, where the "usual"
> single-for-loop-on-an-instruction would only work if we had actual
> complex-mul-and-add [which would be complete overkill and swamp
> us with work]

well, since complex numbers don't use cl*, gfb*, or gfp*, are you aware of any other issues we might encounter? if not, I don't see why changing the madd ops to 3-arg wouldn't work.

if we're too low on encoding space, we could have just the most widely used *madd be 4-arg for extra flexibility, and *msub and *msubr would be 3-arg.
Comment 69 Jacob Lifshay 2022-03-10 17:12:17 GMT
also, whatever problems we encounter for 3-arg madd are likely shared with ternlogi which is also 3-arg that both reads/writes RT.
Comment 70 Luke Kenneth Casson Leighton 2022-03-10 17:33:01 GMT
(In reply to Jacob Lifshay from comment #69)
> also, whatever problems we encounter for 3-arg madd are likely shared with
> ternlogi which is also 3-arg that both reads/writes RT.

no because i cannot see a reason why anyone would use ternlogi with
the SV REMAP FFT schedules

(In reply to Jacob Lifshay from comment #68)

> well, since complex numbers don't use cl*, gfb*, or gfp*, are you aware of
> any other issues we might encounter? if not, I don't see why changing the
> madd ops to 3-arg wouldn't work.

please: stop pushing on this.

please accept the answer no

please do not waste my time persistently asking for an explanation when
an explanation has been provided

please accept the explanation

please stop wasting your time pursuing something to which you have been
told "no" five times.
Comment 71 Luke Kenneth Casson Leighton 2022-03-10 23:42:54 GMT
https://www.nayuki.io/page/number-theoretic-transform-integer-dft
https://www.nayuki.io/res/number-theoretic-transform-integer-dft/numbertheoretictransform.py

well, i am a blithering idiot.  NTT uses prime modulo arithmetic.
it was the two numbers being identical in GF2 that tipped me off.
that did not strike me as particularly useful.

vastly more investigation needed. sigh.
Comment 72 Jacob Lifshay 2022-03-10 23:49:41 GMT
(In reply to Luke Kenneth Casson Leighton from comment #71)
> https://www.nayuki.io/page/number-theoretic-transform-integer-dft
> https://www.nayuki.io/res/number-theoretic-transform-integer-dft/
> numbertheoretictransform.py
> 
> well, i am a blithering idiot.  NTT uses prime modulo arithmetic.

that's GF(p) FFT.

I expect carry-less FFT and GF(2^n) FFT to be useful as well.
Comment 73 Luke Kenneth Casson Leighton 2022-03-11 00:11:32 GMT
https://www.nayuki.io/page/montgomery-reduction-algorithm
apparently speeds up chains of modulo multiplies.
i recognise the reciprocal function in the python
version as the gf_invert which did not work.
Comment 74 Luke Kenneth Casson Leighton 2022-03-11 00:18:01 GMT
(In reply to Jacob Lifshay from comment #72)

> that's GF(p) FFT.

sigh that is the correct one used for Reed Solomon and ECC25519

> I expect carry-less FFT and GF(2^n) FFT to be useful as well.

Frobenius Additive FFT apparently.

https://news.ycombinator.com/item?id=21151646
Comment 75 Luke Kenneth Casson Leighton 2022-03-11 00:33:51 GMT
https://github.com/fast-crypto-lab/Frobenius_AFFT

that's totally opaque to me until i realised, it
is FFT with mul-add as modulo-mul and XOR.

then i also remembered that it is DCT that has the
mul-with-add-and-sub (which is meaningless for GF2
because add and sub are the same)

whereas FFT is:

  RT = MUL(RA, RC) + RB
  RS = RA - RB
Comment 76 Jacob Lifshay 2022-03-11 00:35:29 GMT
(In reply to Luke Kenneth Casson Leighton from comment #74)
> (In reply to Jacob Lifshay from comment #72)
> 
> > that's GF(p) FFT.
> 
> sigh that is the correct one used for Reed Solomon and ECC25519

Reed Solomon uses either carry-less or GF(2^n) iirc.

ECC uses either GF(2^n) or GF(p) (depending on the variety of ECC), except that the word size is usually > 64-bits, so we may need to use carry-less or big integer operations there.
Comment 77 Luke Kenneth Casson Leighton 2022-03-11 01:09:36 GMT
(In reply to Luke Kenneth Casson Leighton from comment #75)
> https://github.com/fast-crypto-lab/Frobenius_AFFT

i'm not going completely mad.  other way round: Frobenius Additive FFT
appears to use the same DCT mul-add-sub

Floating Twin Multiply-Add DCT [Single]
A-Form

fdmadds FRT,FRA,FRC,FRB (Rc=0)
fdmadds. FRT,FRA,FRC,FRB (Rc=1)
Pseudo-code:

FRS <- FPADD32(FRA, FRB)
sub <- FPSUB32(FRA, FRB)
FRT <- FPMUL32(FRC, sub)

https://libre-soc.org/openpower/isa/svfparith/

this is enough operations to justify moving them
to their own major opcode otherwise they overwhelm
the bitmanip one.

i will see how much space is in the FP opcodes
Comment 78 Luke Kenneth Casson Leighton 2022-03-13 22:32:10 GMT
gfbmadd RT, RA, RB, RC
(RT) = gfbadd(gfbmul((RA), (RB)), (RC))

jacob you modified the ordering and naming
of the registers.

when i said "these have to be exactly as the int/fp operations",
it means, "these HAVE to be EXACTLY as the int/fp operations".

please therefore restore them to the original conventions.

please look at the Power ISA spec (madd, fmadd, fmadds),
observe that the naming is as i originally specified  and make
absolutely certain that when implementing these that they are
wholly, one hundred percent, without exception, identically
laid out and utilising the exact same register order, naming,
and conventions.

we do not have any choice here, these conventions have already
been predefined by the previous ISA Architects. plus SV REMAP
is critically relying on the naming and ordering.
Comment 79 Jacob Lifshay 2022-03-13 23:15:14 GMT
(In reply to Luke Kenneth Casson Leighton from comment #78)
> gfbmadd RT, RA, RB, RC
> (RT) = gfbadd(gfbmul((RA), (RB)), (RC))
> 
> jacob you modified the ordering and naming
> of the registers.

well, what I put exactly matches maddld:
maddld RT,RA.RB,RC
prod0:127 ← (RA) × (RB)
sum0:127 ← prod + EXTS(RC)
RT ← sum64:127
Comment 80 Luke Kenneth Casson Leighton 2022-03-13 23:35:12 GMT
(In reply to Jacob Lifshay from comment #79)

> well, what I put exactly matches maddld:
> maddld RT,RA.RB,RC
> prod0:127 ← (RA) × (RB)
> sum0:127 ← prod + EXTS(RC)
> RT ← sum64:127

drat, drat.  and i based the GF twin ops on
FP ones... damnit that's going to have ramifications
for SV REMAP (madd* is not implemented yet)
last thing i want to dig into right now.

not sure what's best to do here.  the twin ops
can be left for later, as long as the opcode space
is allocated.
Comment 81 Jacob Lifshay 2022-03-15 08:58:00 GMT
I added separate python files for clmul, clmulh, clmulr, and cldivrem (helper function for cldiv and clrem), and added a unittest-based test (test_cl.py) for them all.

https://git.libre-soc.org/?p=libreriscv.git;a=tree;f=openpower/sv/bitmanip;hb=4852bd74b9e2ec7f4c8ae8c0677daa4402314a1b

I also changed bitmanip.mdwn to embed the python files into it, rather than having separately maintained code in the markdown, allowing us to test everything and import it much easier, since I'm planning on using the cf* python code in the definitions of the GF(2^n) ops.

I think this approach is better than having to constantly keep gf2.py and bitmanip.mdwn in sync and having the DIY tests in gf2.py where the user has to manually compare print outputs.
Comment 82 Jacob Lifshay 2022-03-16 04:19:21 GMT
I added python implementations and tests for all the gfb* instructions
Comment 83 Luke Kenneth Casson Leighton 2022-03-16 12:53:23 GMT
(In reply to Jacob Lifshay from comment #81)

> I think this approach is better than having to constantly keep gf2.py and
> bitmanip.mdwn in sync and having the DIY tests in gf2.py where the user has
> to manually compare print outputs.

absolutely.

(In reply to Jacob Lifshay from comment #82)
> I added python implementations and tests for all the gfb* instructions

great. btw i altered the twin-instruction to be mul-add for RT
and just add for RS. this matches with the DCT twin-instruction
interestingly, and i believe is the basis of additive NTT.

about which: i am beginning to question the value of adding GF-Prime.
reason is: ecc25519 requires 256-bit arithmetic and we ain't addin'
tha' so have to use bigint math anyway.

therefore if doing bigint math is there anything that can be added
which will speed that up?
Comment 84 Jacob Lifshay 2022-03-16 13:59:07 GMT
(In reply to Luke Kenneth Casson Leighton from comment #83)
> (In reply to Jacob Lifshay from comment #82)
> > I added python implementations and tests for all the gfb* instructions
> 
> great. btw i altered the twin-instruction to be mul-add for RT
> and just add for RS. this matches with the DCT twin-instruction
> interestingly, and i believe is the basis of additive NTT.

imho that needs some testing to ensure it gives the same answers as other algorithms, such as that one I found in sympy, otherwise we may need both (or even more) kinds of twin-mul-add-subr.

> 
> about which: i am beginning to question the value of adding GF-Prime.

yeah, it may not be used much by common cryptography, but it definitely isn't useless...assuming it's fast enough, it would definitely be useful for modular arithmetic for small (<=64-bit) moduli, which is used in a bunch of core algorithms for computer algebra systems (e.g. polynomial factorization by the currently fastest known method, which is used for exact computations involving algebraic numbers (the "real" type in SMTLIB2, the interface language used by most of the formal proof software we've been using such as z3, yices, etc.). algebraic numbers are also used by the simple-soft-float library I wrote. algebraic numbers are what you get if you ask a symbolic math library to find all roots of some arbitrary polynomial with rational coefficients, which is pretty core functionality to solving most equations.)

> reason is: ecc25519 requires 256-bit arithmetic and we ain't addin'
> tha' so have to use bigint math anyway.
> 
> therefore if doing bigint math is there anything that can be added
> which will speed that up?

hmm, for generic bigint math, a 128 by 64-bit div and rem (used by bigint div/rem)? openpower only provides a version where the 128-bit dividend has all zeros in the lower half, requiring a bunch of extra instructions to resynthesize what the hardware can already do at low additional cost.

A 128-bit to 64-bit shift like x86 shld https://www.felixcloutier.com/x86/shld would also be nice (used by bigint shifting).

also, some instructions for getting the high half of a 64-bit * 64-bit + 128-bit muladd/mulsub (used for bigint * 64-bit number, and probably bigint*bigint), but that would require some thought to avoid being 4-in 1-out instructions.
Comment 85 Jacob Lifshay 2022-03-18 05:07:57 GMT
I added python implementations for all the gfp* instructions, as well as tests (which pass).

I also added descriptions of the FFT twin mul-adds (all 3 of them) so that I could finish deleting the redundant sections.

According to Wikipedia, NTT is the correct name only for GF(p) (or other rings of integers mod m with some restrictions), the others are just weird DFTs.
https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general)#Number-theoretic_transform

It occurred to me, that the gfp* instructions should really be named the modular arithmetic instructions, since they're all useful for composite moduli too. GF(p) is only defined if p is a prime. gfpinv will have more inputs than just zero that don't work if the modulus isn't a prime.