Create a suite of REMAP modes suitable for triangular, rhombus, and modulo patterns: xxxx xxxx 1234 xxx xxxx 4123 xx xxxx 3412 x xxxx 2341 --- * TODO: note the inner loop notification used by "svstep." in Vertical-First https://bugs.libre-soc.org/show_bug.cgi?id=1044#c16 * TODO: design spec up the three REMAP modes which involves a new svshape3 * TODO: write python "yield" versions (see dct fft and other REMAP modes) with mini demos / unit tests * TODO: confirm that the spec changes are ok (see comment #5) * TODO: (IN A BRANCH FOR REVIEW) make the spec changes * TODO: (in same branch): using the yielder(s) add actual REMAP unit tests * TODO: review branch * TODO: write PR/blog about bigmul (budget under another bug, TODO link) * TODO: commit and close Basic algorithm for rhombus: void mul( uint64_t *product, const uint64_t *a, const uint64_t *b, size_t a_sz, size_t b_sz ) { for(size_t bi = 0; bi < a_sz; bi++) { product[bi] = 0; } for(size_t ai = 0; ai < a_sz; ai++) { uint64_t carry = 0; for(size_t bi = 0; bi < a_sz; bi++) { uint128_t v = (uint128_t)a[ai] * (uint128_t)b[bi]; v += (uint128_t)carry; v += (uint128_t)product[ai + bi]; carry = v >> 64; product[ai + bi] = (uint64_t)v; } product[ai + b_sz] = carry; } } --- instruction forms key: * ModS - summation with modulo * ModM - matrix with modulo * Sq - Square * Tr - Triangle * pr permute x/y y/x * col column / rw * sk1 skip X * invxy invert x, invert y svshape3 TODO -------- 814 | 0:5|6:10 |11:15 |16:20 | 21:25 26| 27:31 | Form | 815 | -- | -- | --- | ---- | ---------- | ------| -------- | 816 | PO | x | y | z/XXX| XX/yx/XX/X | XO |TODO-Form | based on shshape (svshape SVxd,SVyd,SVzd,SVrm,vf) but z here is only 2 bits, setting vf as well is too much... oh wait: 16-17: z 18: ModM/ModS 19: Sq/Tr 20: col 21: sk 22: pr 23: SVyx (invxy) 24: 0 25-26: si (SVSHAPE0-3 Index) however all bits 21:25 can be just an operand "SVrm" 331 # 1.6.35 SVM-FORM 332 |0 |6 |11 |16 |21 |25 |26 |31 | 333 | PO | SVxd | SVyd | SVzd | SVrm |vf | XO | new fields, SVz2 and sm2 (2-bit SVzd and submode2). SVrm can remain as-is and be described in a separate table |16 18|19 20|21 24| |sm2 |SVz2 |SVrm | SVrm: |0 4| |sk pr SVyz col vf| svshape4, 2nd mode, fit in PO5 -------- | 0:5|6:8|9:10|11:15|16:20| 21 22:23 24:25 |26:27|28 |29:30|31 | Form | | -- |---|----| --- | ----| -------------- | --- |-- |-----| | -------- | | PO |SVo|si | y | x | sk /sm2 / sm | XO |SVyx|XO |SVyx|SVI2-Form | instruction operands for svshape4 svshape4 x,y,SVo,SVyx,sk,sm2,sm,si where * sm2 (submode2) covers rs/xx/yy below (MSB0 order: sm = submode[1:2]) (rs: reserved) * si (SVSHAPE INDEX 0-3) * sm (submode) mods/m sq/tr as below |0:5 |6:11 |12:17 |18 19 20 |21 22:23|24:27 | 28:29 |30:31|mode | |xdimsz|ydimsz| zdimsz |rs/nx/ny |sk1/invxy|offset| 0b00 |0b11 |ModS/Sq| |xdimsz|ydimsz| zdimsz |rs/pr/col|sk1/invxy|offset| 0b01 |0b11 |ModM/Sq| |xdimsz|ydimsz| zdimsz |rs/nx/ny |sk1/invxy|offset| 0b10 |0b11 |ModS/Tr| |xdimsz|ydimsz| zdimsz |rs/pr/col|sk1/invxy|offset| 0b11 |0b11 |ModM/Tr|
(In reply to Jacob Lifshay from comment #0) > Create a REMAP mode > for(size_t ai = 0; ai < a_sz; ai++) { > uint64_t carry = 0; > for(size_t bi = 0; bi < a_sz; bi++) { > uint128_t v = (uint128_t)a[ai] * (uint128_t)b[bi]; the basics for the bigmulremap fn become: for i in range(asz): for j in range(bsz): yield i, j, i+j however additional modes then include options to do yield i, j, ((i+j) % bsz) plus "reversing" options (horiz, vert, h&v) and perhaps 2D mirroring as well. that's 5 bits of mode which is quite enough. i know that konstantinos may have something to add about the motion-estimate algorithm he did, i know there is a similar pattern but it is... i//2, j, i//2+j or something like that. this may be a bit much though. have to be very careful as there is not much space left in svshape.
space in SHAPE SPRs https://libre-soc.org/openpower/sv/remap/ |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |mode |Matrix | |xdimsz|ydimsz|SVGPR | 11/ |sk1/invxy|offset|elwidth|0b00 |Indexed| |xdimsz|mode | zdimsz | submode2| invxyz |offset|submode|0b01 |DCT/FFT| | rsvd |rsvd |xdimsz | rsvd | invxyz |offset|submode|0b10 |Red/Sum| | | | | | | | |0b11 |rsvd | => | rsvd |rsvd |xdimsz | 00 | invxyz |offset|submode|0b10 |Red/Sum| | rsvd |ydimsz|xdimsz | 01 | invxy 0 |offset|mirror |0b10 |Bigmul| or | rsvd |rsvd |xdimsz | rsvd | invxy 0 |offset|submode|0b10 |Red/Sum| | rsvd |ydimsz|xdimsz | rsvd | invxy 1 |offset|submode|0b10 |Bigmul | have to check z of invxyz is free
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=33a02e6612065f290d840e15a596dfc2177de5e5;hb=f746c31efca189fb09aec2e661dba686616a846f#l320 unused (not set). i did tell you repeatedly to enable inversion of both x and y in parallel-reduce REMAP. z is free so can be used to shoe-horn bigmul in it. next step, the instruction format
svshape SVxd,SVyd,SVzd,SVRM,vf 0:5 6:10 11:15 16:20 21:24 25 26:31 name PO SVxd SVyd SVzd SVRM vf XO svshape Fields: SVxd - SV REMAP "xdim" (X-dimension) SVyd - SV REMAP "ydim" (Y-dimension, sometimes used for sub-mode selection) SVzd - SV REMAP "zdim" (Z-dimension) SVRM - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.) ok so SVyd is the de-facto selector for sub-modes: SVyd "bit 0" can be used for selecting bigmul/prefix, under SVRM=0b0111, whereupon bits 1-2 can be used for inversion and 2-3 for mirror... damn damn damn 5 bits are needed: one for long-mul, one for carrysave ah hang on i know: "/mrr" performs a full reverse, which in combination with inverting either x or y order *back* again (single bit) you get the full complement. i think. so: * bit 0 select bigmul or Prefx * bit 1 select carrysave or longmul * bit 2 select invert x (or y?) * bit 3 mirror x (5 45 345 2345 12345 012345 instead of 0 01 012 ...) * bit 4 mirror y (ditto)
we will need an svshape3. damn. https://lists.libre-soc.org/pipermail/libre-soc-dev/2023-September/005711.html
okaaay, there are some necessary steps here, but there are *two* branches needed which is a royal pain in the ass (openpower-isa, spec). what i'm proposing - bear in mind this is still PO22 - is to move bincrflut and binlut to the "- 10 00 |0" row, leaving their current locations free for svshape3 and 4. that way it is possible to do a svshape3, bringing in more space for REMAP options. which will be identical to svshape. | NN | | | | | - 10 00 |0 | rsvd | rsvd | | NN | | | | | - 11 00 |0 | rsvd | rsvd | | NN | RT | RA | RB | RC | nh 00 00 |1 | binlut | VA-Form | | NN | RT | RA | RB | /BFA/ | 0 01 00 |1 | bincrflut | VA-Form | | NN | | | | | 1 01 00 |1 | svindex | SVI-Form | | NN | RT | RA | RB | mode | L 10 00 |1 | bmask | BM2-Form | | NN | | | | | 0 11 00 |1 | svshape | SVM-Form | | NN | | | | | 1 11 00 |1 | svremap | SVRM-Form | git diffs: * https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=261c0c4a * https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ac538421
I came up with a somewhat different bigint mul remap algorithm, it has a few issues though...I also rebased the branch on latest master. commit a3c1930de7990c8babcb3908ed7650e1d08eafb6 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Dec 21 21:10:11 2023 -0800 tests/bigint/powmod: initial version of bigint multiply remap it has some issues around being able to encode scalar RS but vector RT, and doesn't match the scalar * vector multiplication pattern, but is quite compact. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=73d87b35dd8df841596fff4f5153464362882917;hb=a3c1930de7990c8babcb3908ed7650e1d08eafb6#l102 python version: y = [0] * (a_sz + b_sz) ca = 0 for i in range(a_sz * b_sz): # no need to clear ca between ai outer loops, since the partial # products can't get big enough to have a carry out, so ca will # always be zero when (i % b_sz == 0). # That said, hardware will probably want to pattern-match this to # remove the unnecessary dependency through ca. y[a_plus_b_idx[i]], t = maddedu( a[a_idx[i]], b[b_idx[i]], y[a_plus_b_idx[i]]) y[a_plus_b_plus_1_idx[i]], ca = adde( y[a_plus_b_plus_1_idx[i]], t, ca)
(In reply to Jacob Lifshay from comment #7) > I came up with a somewhat different bigint mul remap algorithm, it has a few > issues though...I also rebased the branch on latest master. so, basically, we have to decide if we want a really compact but somewhat problematic loop, or a somewhat bigger loop (algorithm in comment #0 and copied below) that's likely easier for hw to handle (since it matches the scalar * vector multiply pattern closer): void mul( uint64_t *product, const uint64_t *a, const uint64_t *b, size_t a_sz, size_t b_sz ) { for(size_t bi = 0; bi < a_sz; bi++) { product[bi] = 0; } for(size_t ai = 0; ai < a_sz; ai++) { uint64_t carry = 0; for(size_t bi = 0; bi < a_sz; bi++) { uint128_t v = (uint128_t)a[ai] * (uint128_t)b[bi]; v += (uint128_t)carry; v += (uint128_t)product[ai + bi]; carry = v >> 64; product[ai + bi] = (uint64_t)v; } product[ai + b_sz] = carry; } }
(In reply to Jacob Lifshay from comment #7) > I came up with a somewhat different bigint mul remap algorithm, it has a few > issues though...I also rebased the branch on latest master. > > commit a3c1930de7990c8babcb3908ed7650e1d08eafb6 > Author: Jacob Lifshay <programmerjake@gmail.com> > Date: Thu Dec 21 21:10:11 2023 -0800 > > tests/bigint/powmod: initial version of bigint multiply remap > > it has some issues around being able to encode scalar RS but > vector RT, and doesn't match the scalar * vector multiplication > pattern, but is quite compact. i was expecting / anticipating use of vertical-first, this is different / cool, how does it work? > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/ > bigint/powmod.py;h=73d87b35dd8df841596fff4f5153464362882917; > hb=a3c1930de7990c8babcb3908ed7650e1d08eafb6#l102 please remove this: 105 # run this file in a debugger to see all the intermediate values. (use log statements instead) > python version: > y = [0] * (a_sz + b_sz) > ca = 0 > for i in range(a_sz * b_sz): > # no need to clear ca between ai outer loops, since the partial > # products can't get big enough to have a carry out, so ca will > # always be zero when (i % b_sz == 0). > # That said, hardware will probably want to pattern-match this to > # remove the unnecessary dependency through ca. > y[a_plus_b_idx[i]], t = maddedu( > a[a_idx[i]], b[b_idx[i]], y[a_plus_b_idx[i]]) > y[a_plus_b_plus_1_idx[i]], ca = adde( > y[a_plus_b_plus_1_idx[i]], t, ca) reduce these variable names down to fit on one line. e.g. ab1i then add a comment on initialisation. i'll take a closer look, as it will take some time for the impact to sink in. if done as a "yielder" i think i will get it immediately. like this (which is entirely standalone executable, *no* use of *any* externl modules, this is very deliberate) https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;h=a15c5bf7871507b34a6e2a043e73284e60e22bec;hb=a3c1930de7990c8babcb3908ed7650e1d08eafb6#l85 that's the next code-morph step.
(In reply to Luke Kenneth Casson Leighton from comment #9) > (In reply to Jacob Lifshay from comment #7) > > I came up with a somewhat different bigint mul remap algorithm, it has a few > > issues though...I also rebased the branch on latest master. > > > > commit a3c1930de7990c8babcb3908ed7650e1d08eafb6 > > Author: Jacob Lifshay <programmerjake@gmail.com> > > Date: Thu Dec 21 21:10:11 2023 -0800 > > > > tests/bigint/powmod: initial version of bigint multiply remap > > > > it has some issues around being able to encode scalar RS but > > vector RT, and doesn't match the scalar * vector multiplication > > pattern, but is quite compact. > > i was expecting / anticipating use of vertical-first, this is > different / cool, how does it work? it is vertical first. basically, I realized that, if you do it in the right sequence, you can compute a[:] * b[:] by just computing each partial product a[ai] * b[bi] and adding each one in the right place in the final result: y = [0] * ... for ai, bi in right_sequence(): partial_product = a[ai] * b[bi] y += partial_product << (word_size * (ai + bi)) this becomes a vertical first loop: loop: sv.maddedu *y, *a, *b, *y # RS is stored in scalar t sv.adde *(y + 1), *(y + 1), t svstep. sv.bdnz ... loop > if done as a "yielder" i think i will get it immediately. > like this (which is entirely standalone executable, *no* > use of *any* externl modules, this is very deliberate) well, notice it builds the list of remap indexes, and then the loop that actually operates on the data just goes through those indexes in sequence, doing (combination python and sv asm syntax): *y, t = maddedu(*a, *b, *y) *(y + 1), ca = adde(*(y + 1), t, ca) so the "yielder" (if you mean the code that produces remap indexes) is just the loops building the indexes, with: some_list.append(the_index) replaced with: yield the_index.
ok gimme an hour or so to do a code-morph, it will also help me to understand. there's something cool going on here
(In reply to Jacob Lifshay from comment #10) > (In reply to Luke Kenneth Casson Leighton from comment #9) > > (In reply to Jacob Lifshay from comment #7) > > > I came up with a somewhat different bigint mul remap algorithm, it has a few > > > issues though...I also rebased the branch on latest master. > > > > > > commit a3c1930de7990c8babcb3908ed7650e1d08eafb6 > > > Author: Jacob Lifshay <programmerjake@gmail.com> > > > Date: Thu Dec 21 21:10:11 2023 -0800 > > > > > > tests/bigint/powmod: initial version of bigint multiply remap > > > > > > it has some issues around being able to encode scalar RS but > > > vector RT, and doesn't match the scalar * vector multiplication > > > pattern, but is quite compact. > > > > i was expecting / anticipating use of vertical-first, this is > > different / cool, how does it work? > > it is vertical first. ahhh > basically, I realized that, if you do it in the right sequence, you can > compute a[:] * b[:] by just computing each partial product a[ai] * b[bi] and > adding each one in the right place in the final result: > > y = [0] * ... > for ai, bi in right_sequence(): > partial_product = a[ai] * b[bi] > y += partial_product << (word_size * (ai + bi)) ... assuming y is a massive long "thing" of bitsize wordsize*(ai+bi) but the subdivisions are over wordsize boundaries. i start to get it. neat. > this becomes a vertical first loop: > loop: > sv.maddedu *y, *a, *b, *y # RS is stored in scalar t > sv.adde *(y + 1), *(y + 1), t > svstep. > sv.bdnz ... loop briiilliant. i love it! that's the kind of laughably-short assembler we need :) > well, notice it builds the list of remap indexes, and then the loop that > actually operates on the data just goes through those indexes in sequence, see diff below, i split out those two roles explicitly separately, as that's the next step in the code-morph sequence. > doing (combination python and sv asm syntax): > *y, t = maddedu(*a, *b, *y) > *(y + 1), ca = adde(*(y + 1), t, ca) love it. > so the "yielder" (if you mean the code that produces remap indexes) is just > the loops building the indexes, with: > some_list.append(the_index) > > replaced with: > yield the_index. yes. done. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=aa6f10d1f2 that's how all the ISACaller REMAP yields work, but if there's no yielder module (e.g. remapyield.py) then ISACaller can't *have* that functionality and we can't do the tests :) okaay next task now we have the actual algorithm python_mul_remap_yielder(a_sz, b_sz) is to work out how it's going to fit into SVSHAPE0-3. that process is started at comment #2
(In reply to Luke Kenneth Casson Leighton from comment #12) > okaay next task now we have the actual algorithm > python_mul_remap_yielder(a_sz, b_sz) > is to work out how it's going to fit into SVSHAPE0-3. > that process is started at comment #2 before we get ahead of ourselves, we need to make sure that remap sequence actually works, which requires figuring out how maddedu can have scalar RS and vector RT, RA, RB, and RC: > > doing (combination python and sv asm syntax): > > *y, t = maddedu(*a, *b, *y)
i had an idea for SVSHAPE: if we can get a matrix mode with 3 spare bits (e.g. a flag that repurposes some bits somewhere), then we can use them like so: 000 normal matrix mode 001 x 010 y 011 x + y 100 z 101 x + z 110 y + z 111 x + y + z this gives us all the different things matrix mode can do for each of the x,y,z axes, and lets us extract just one axis, or sum axes together. invxyz lets us subtract them too.
ok so this is the primary "yielder" function which fascinatingly, the first two (ai, bi) are covered by the standard matrix and svindex REMAPs, the other two are covered by exactly the middle of the three patterns from comment #0 but just offset by one in the case of the 4th. def python_mul_remap_yielder(a_sz, b_sz): for ai in range(a_sz): for bi in range(b_sz): yield ai, bi, ai + bi, ai + bi + 1 i'm almost tempted to suggest that the offset (+1) be a parameter, but... hmmm... is there space to do that (in SVSHAPE0-3 i mean)? that's going to be the next critical task, but i feel that we may need a few more examples here so as not to run out of bits in SVSHAPEn, hmm hmm not sure, tempted also to just go for it anyway next task definitely to choose a format from comment #2 > space in SHAPE SPRs > > https://libre-soc.org/openpower/sv/remap/ > > > |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | > |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | > |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |mode |Matrix | > | rsvd |rsvd |xdimsz | rsvd | invxy 0 |offset|submode|0b10 |Red/Sum| > | rsvd |ydimsz|xdimsz | rsvd | invxy 1 |offset|submode|0b10 |Bigmul | still TODO: > have to check z of invxyz is free (and jacob you still need to implement inversion in Red/Sum, i'm not going to do it for you. it doesn't matter what *you* think it might or might not be used for: we present people with the option and they explore over the next 1-10+ years) * the offset there (bits 24-27) will work out (the "+1" for example) * rsvd bits 0:5 can be the "modulo" which if set to 0 means "no modulo" this will give the repeating modulo pattern * permute (bits 18:20) will say what the order of the for-loops is ("for ai for bi" vs "for bi for ai") * submode specifies triangle or rhombus * invxy says whether ai goes 0-xdimsz or xdimsz-0 and likewise bi it's all there i believe?
oo hang on, i have my head round this now :) so i have a question: at the cross-over point to move to the next row (the outer loop) is it *guaranteed* that ca=0 such that it will not interfere with the new row? i.e. at the end of each row, which is mul-adding the most significant word-partial-result, will the adde *alway* produce a ca=0? (i'm catching up on other comments)
(In reply to Jacob Lifshay from comment #14) > i had an idea for SVSHAPE: if we can get a matrix mode with 3 spare bits > (e.g. a flag that repurposes some bits somewhere), then we can use them like > so: > 000 normal matrix mode > 001 x > 010 y > 011 x + y > 100 z > 101 x + z > 110 y + z > 111 x + y + z *deep breath*... i love it! > this gives us all the different things matrix mode can do for each of the > x,y,z axes, and lets us extract just one axis, or sum axes together. invxyz > lets us subtract them too. aaawesome. no that's a bloody good idea. let me walk through it on SVSHAPE (from comment #2) and also do some clarification https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=0b934a4d48d > https://libre-soc.org/openpower/sv/remap/ > > > |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | > |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | > |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |0b00 |Matrix | > |xdimsz|ydimsz|SVGPR | 11/ |sk1/invxy|offset|elwidth|0b00 |Indexed| > |xdimsz|mode | zdimsz | submode2| invxyz |offset|submode|0b01 |DCT/FFT| > | rsvd |rsvd |xdimsz | rsvd | invxyz |offset|submode|0b10 |Red/Sum| > | | | | | | | |0b11 |rsvd | we have mode=0b11 free! i wasn't planning to allocate it as it's extremely precious space in what is a seriously loaded 32-bit SPR, but the idea of adding up (and subtracting) the three dimensions x y z is a damn good one, let's roll with it. |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |0b01 |BigMul | hmm hmm hmm skip doesn't make sense here, so can be used as a submode |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|submode|0b01 |BigMul | if a dimension is zero (which remember encoding is 0=1 1=2 2=3 because the for-loops go "for i = 0 to xdimsz INCLUSIVE" then it can mean "disabled". submode possibilities: * submode=0b00 - triangular, no modulo * submode=0b01 - rhombus, no modulo * submode=0b10 - modulo-with-the=-ffset-added in the X dimension * submode=0b11 - modulo-with-the-offset-added in the Y dimension there's not enough encoding space to do "modulo-with-offset in Z" and i would really like to leave permute=0b110/0b111 for another encoding (reserved) for future REMAPs. (in the case of matrix - mode=0b00 - "Indexed" REMAP was embedded into mode=0b00,permute=0b110/0b111 and for symmetry as well as saving hardware gates i'd like the same thing to be done for this BigMul REMAP as well) awesome idea jacob, to add up the values. actually really simple to implement, as well. but let's not complicate remapyield.py, let's take a copy of it ok?
(In reply to Luke Kenneth Casson Leighton from comment #16) > oo hang on, i have my head round this now :) so i have a question: > at the cross-over point to move to the next row (the outer loop) > is it *guaranteed* that ca=0 such that it will not interfere with > the new row? yes afaict. I did an exhaustive check for word sizes of 2 bits and 3-word by 3-word multiplication (code below), so this formally proves that case. I have no reason to think that doesn't extend to more words and/or bigger words, so it should work. exhaustive check code (not pretty, but works): >>> def maddedu(a, b, c): return (a*b+c)%4,(a*b+c)//4 ... >>> def adde(a,b,c): return (a+b+c)%4,(a+b+c)//4 ... >>> def mul(a,b): ... y=[0]*(len(a)+len(b)) ... ca=0 ... for ai in range(len(a)): ... for bi in range(len(b)): ... y[ai+bi],t=maddedu(a[ai],b[bi],y[ai+bi]) ... y[ai+bi+1],ca=adde(y[ai+bi+1],t,ca) ... return y ... >>> def i2l(v,l): return [(v >> (i * 2)) % 4 for i in range(l)] ... >>> def l2i(l): return sum(v << (i * 2) for i, v in enumerate(l)) ... >>> for a in range(64): ... for b in range(64): ... expected = a * b ... v = l2i(mul(i2l(a,3),i2l(b,3))) ... assert expected == v ... >>>
(In reply to Luke Kenneth Casson Leighton from comment #17) > (In reply to Jacob Lifshay from comment #14) > > i had an idea for SVSHAPE: if we can get a matrix mode with 3 spare bits > > (e.g. a flag that repurposes some bits somewhere), then we can use them like > > so: > > 000 normal matrix mode > > 001 x > > 010 y > > 011 x + y > > 100 z > > 101 x + z > > 110 y + z > > 111 x + y + z > > *deep breath*... i love it! we can take all the combinations that duplicate something already encodable in matrix mode and just declare them reserved, so at least 000 and 001, icr if matrix mode can produce a sequence like 000011112222333300001111... which would cover both 010 and 100 too.
(In reply to Jacob Lifshay from comment #19) > we can take all the combinations that duplicate something already encodable > in matrix mode and just declare them reserved, so at least 000 and 001, icr > if matrix mode can produce a sequence like 000011112222333300001111... which > would cover both 010 and 100 too. if all 4 options with zero or one axis selected are redundant, then we can save a bit and the encoding can be as follows: 00 x + y 01 x + z 10 y + z 11 x + y + z basically we remove the bit for z and set it to 1 if x or y are set, otherwise set z to 0 and x and y to 1.
(In reply to Jacob Lifshay from comment #19) > we can take all the combinations that duplicate something already encodable > in matrix mode ah no. not a good idea. the encoding is complex enough as it is. the point is that this is a RISC encoding not a CISC encoding. it's right at the critical juncture between decode and issue. hence why i said *do not* take out the reversing on reduce/sum because it matches with other inversion, and you actually have to have *extra gates* to *EXCLUDE* the "i don't see why anyone would want this" logic that caused you (without consultation) to remove the reverse-direction logic from remap_preduce.py (and still not yet add it back, as i've reminded you needs to be done, many times since) > and just declare them reserved, so at least 000 and 001, icr > if matrix mode can produce a sequence like 000011112222333300001111... of course! that's absolutely critical. it's a degenerate case of an X by Y matrix Schedule-walk, where Y=1. look at remapyield.py - see "skip" mode. above sequence is produced by skip=0b110,x=4,y=1,VL=20 (larger VL for the dots)
(In reply to Jacob Lifshay from comment #20) > then we can save a bit and complexify the encoding which is just not happening at a critical juncture. top priority lesson to observe here, jacob: think multi-issue and how long the gate chain is on complex decode, right where this is deciding register reservation/allocation.
(In reply to Jacob Lifshay from comment #18) > (In reply to Luke Kenneth Casson Leighton from comment #16) > > oo hang on, i have my head round this now :) so i have a question: > > at the cross-over point to move to the next row (the outer loop) > > is it *guaranteed* that ca=0 such that it will not interfere with > > the new row? > > yes afaict. I did an exhaustive check for word sizes of 2 bits and 3-word by > 3-word multiplication (code below), so this formally proves that case. I > have no reason to think that doesn't extend to more words and/or bigger > words, so it should work. i know that the multiply itself, even with add-64-bit, cannot exceed the boundary of 128. and therefore we know also that an N-mul (N=256,512,etc) cannot exceed the boundary of N*2 bits and therefore i believe it reasonable to assume that an Nx64 (N mod 64 == 0) multiply would also not exceed the boundary of.. err.... N+1? which means that the last digit (N+1) of that add is never going to carry. therefore, logically, if NxM (N%64==0 and M%64==0) is constructed from a sequence of M//64 adds, each of length N+1 but shifted over (long-mul style) then the last in each of those N+1 things is never going to involve a carry... but there *is* a mul-and-add involved, and i recall that even a mul-and-add cannot produce a carry. * 0xffff * 0xffff = 0xfffe0001 * 0xffff * 0xffff + 0xffff = 0xffff0000 * 0xffff * 0xffff + 0xffff + 1 = 0xffff0001 so even if you have mul-and-add-*and-carry* it still never exceeds 2x the bits. because if you do (0xffff+1) * (0xffff+1) it is a^2 + 2ab + b^2 compared to 0xffff*0xffff and lay it out as a 2D square (a on x-axis b on y-axis) only the bottom-left quadrant is full but the other 3 quadrants (representing the ab + ab + b^2) are entirely empty. a*b=0x0000ffff b*b= 1 a^2=0xfffe0001 b*a=0x0000ffff vs ........ ........ 0xfffe0001 ........ * b=1 therefore b^2 = 1 * a=0xffff therefore a^2 you have 0xfffe0001 but you have to have *two* lots of ab (0xffff) to reach 0x10000*0x10000 which as long as you do not try to do mul-add with 2 x 0xffff you are *guaranteed* never to carry-over on a madd operation. with maddedu only doing RC=0xffff, and not even taking a CA flag, we're good.
commit 1ece0cc3abbea9b39df6b58e4bb07b158f556caf (HEAD -> bug_1155_bigmul_remap) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Sat Dec 23 17:38:59 2023 +0000 bug 1155: add new idea by jacob to do x+y+z mode for bigmul remap https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=1ece0cc3a took a look, started from a copy of remapyield.py and i believe it can be slotted in very simply with minimal disruption to remapyield.py instead. see how it goes.
(In reply to Jacob Lifshay from comment #20) > if all 4 options with zero or one axis selected are redundant, then we can > save a bit and the encoding can be as follows: > > 00 x + y > 01 x + z > 10 y + z > 11 x + y + z > > basically we remove the bit for z and set it to 1 if x or y are set, > otherwise set z to 0 and x and y to 1. damnit damnit, just realised that actually for the *matrix* style (2D/3D) schedule, summing is just as useful *even of triangles*, plus actually producing *triangle* schedules (without summing) is also useful. whiiich... takes up the entirety of "permute=0b110/111"... arg arg... ok so what i am currently thinking is: * mode=0b00 is current "Matrix" mode * mode=0b11 is *still matrix schedule* but is "summation" (x+y+z) BUT.. ... under *sub*-modes it can go into "Triangular" (Pascal's Triangle). where mode=0b00 you would get: 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 instead you would get: 0 4 8 12 5 9 13 10 14 15 (or maybe) 0 1 5 2 6 10 3 7 11 15 so N*(N+1)/2 numbers instead of N*N but they are *still* in NxN form then within this, when x=1 or y=1 or z=1, this triggers the "triangular" thoughts?
(In reply to Jacob Lifshay from comment #20) > if all 4 options with zero or one axis selected are redundant, then we can > save a bit and the encoding can be as follows: > > 00 x + y > 01 x + z > 10 y + z > 11 x + y + z > > basically we remove the bit for z and set it to 1 if x or y are set, > otherwise set z to 0 and x and y to 1. this takes 1 nor gate, 1 not gate, and 2 or gates and saves 1 encoding bit -- I think that's a good tradeoff: xi yi | | @---|-----+ | | | | @-----|----------+ | | | | \-___-/ | | | | | | | | | | \---/ | | O | | | | | @-------|---@------|---+ | | | | | \---/ \-___-/ \-___-/ \ / | | | | O | | | | | \---/ \---/ | | | z x y (In reply to Luke Kenneth Casson Leighton from comment #22) > and complexify the encoding i don't think it's enough to matter here since it can be expanded in parallel with other parts such as checking the PO, thereby being unlikely to cause extra latency, which is the part that matters -- total gate count is basically unaffected and is secondary anyway. (In reply to Jacob Lifshay from comment #19) > we can take all the combinations that duplicate I mostly meant just the 3 add-selection bits, i agree finding all possible combinations is excessive, however I think not removing any redundancy when it's easy to save a bit is excessive in the other direction.
(In reply to Luke Kenneth Casson Leighton from comment #25) > thoughts? triangles are cool, needs further thought. I'm thinking we'd want a way to do new_x = (x + y) % x_sz for diagonal matrix stuff: e.g.: original matrix indexes: 0 1 2 3 4 5 6 7 8 9 A B new indexes: 0 1 2 3 -- rotated right 0 5 6 7 4 -- rotated right 1 A B 8 9 -- rotated right 2
(In reply to Jacob Lifshay from comment #27) > (In reply to Luke Kenneth Casson Leighton from comment #25) > > thoughts? > > triangles are cool, needs further thought. I'm thinking we'd want a way to > do new_x = (x + y) % x_sz for diagonal matrix stuff: yep. see comment #0 3rd option except it was just going to be sum (x+y). > new indexes: > 0 1 2 3 -- rotated right 0 > 5 6 7 4 -- rotated right 1 > A B 8 9 -- rotated right 2 ok *matrix* rotated, iinteresting, like it. ok so there are also some AV patterns to cover, which are: 01234567 0123456789a -------- ----------- 0|3456789a 0| 01234567 1|3456789a 1| 01234567 2|23456789 2| 01234567 3|23456789-> 3| 01234567 4|12345678 4| 01234567 5|12345678 5| 01234567 6|01234567 6|01234567 7|01234567 7|01234567 which is a "skip=0110" mode with x=7,y=7,z=2 which gives the (twice) repeated row. i.e where it was *believed* that skip could be dropped, actually it can't. hm. |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | |xdimsz|ydimsz| zdimsz | submode2|sk1/invxy|offset|submode|0b11 |bigmul | 5 bits for submodes. sk1 to give that repetition. invxy for mirroring. (like Indexed Mode which has mirroring) let's see what is possible: * submode[0]: standard matrix (x*xdimsz + y) or bigmul (x+y) * submode[1]: square or triangle * submode2[0] 0b00 modulo then add offset 0b01 add offset then modulo * submode2[1:2] triange 0b00 x+y 0b01 y+z 0b10 z+x 0b11 RSVD square 0b00 x,y,z 0b01 y,x,z 0b10 z,x,y 0b11 z,y,x i am reasonably certain there are better uses for last 2 bits of submode2 offset is signed. modulo it puts bigger numbers out first
(In reply to Luke Kenneth Casson Leighton from comment #28) > offset is signed. no hang on offset need not be signed, it just needs to be less than the modulo. so can be unsigned.
(In reply to Luke Kenneth Casson Leighton from comment #28) idea (bear in mind modulo on everything, so -ve nums will wrap) * submode2[1] triangle 0b0 x 0b1 -x * submode2[2] triangle 0b0 y 0b1 -y * submode2[1] square 0b0 x,y 0b1 y,x * submode2[2] square 0b0 modulo shift on rows, 0b1 modulo shift on cols and the modulo is VL (not MAXVL) which can be an extra parameter. reminder: * submode2[0] 0b00 modulo then add offset 0b01 add offset then modulo so modulo (0..VL-1) plus offset has to be less than MAXVL which is fine.
this is doing my head in :) will edit shortly https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=745c7d22 |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |0b00 |Matrix | |xdimsz|ydimsz|SVGPR | 11/ |sk1/invxy|offset|elwidth|0b00 |Indexed| |xdimsz|mode | zdimsz | submode2| invxyz |offset|submode|0b01 |DCT/FFT| | rsvd |rsvd |xdimsz | rsvd | invxyz |offset|submode|0b10 |Red/Sum| |xdimsz|ydimsz| zdimsz |rs/nx/ny |sk1/invxy|offset| 0b00 |0b11 |ModS/Sq| |xdimsz|ydimsz| zdimsz |rs/pr/col|sk1/invxy|offset| 0b01 |0b11 |ModM/Sq| |xdimsz|ydimsz| zdimsz |rs/nx/ny |sk1/invxy|offset| 0b10 |0b11 |ModS/Tr| |xdimsz|ydimsz| zdimsz |rs/pr/col|sk1/invxy|offset| 0b11 |0b11 |ModM/Tr| key: * ModS - summation with modulo * ModM - matrix with modulo * Sq - Square * Tr - Triangle * rs reserved (must be zero) * pr permute x/y y/x * col column / rw * sk1 skip X * invxy invert x, invert y sk1/invxy is the same as Indexed, exactly, so is no extra gates the priority is X/Y however Z is also possible, *and* skip allows for repeating (see comment #28)
(In reply to Luke Kenneth Casson Leighton from comment #31) > this is doing my head in :) will edit shortly > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/ > decoder/isa/remapyield.py;hb=745c7d22 > > > |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | > |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | > |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |mode |Matrix | mode should be 0b00 here...
(In reply to Jacob Lifshay from comment #32) > mode should be 0b00 here... good catch, next time just edit it? thoughts on the capabilities? anything more complex there is always Indexed.
(In reply to Luke Kenneth Casson Leighton from comment #33) > thoughts on the capabilities? It looks pretty good... I'm too tired at the moment for any further deep thoughts, sorry. However, I did notice that you seem to have 2 different places to select between triangle/square -- both bits 28 and 29...that seems redundant to me: (In reply to Luke Kenneth Casson Leighton from comment #31) > |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode | > |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- | > |xdimsz|ydimsz| zdimsz | nx / ny |sk1/invxy|offset|trngl/0|0b11 |ModuloT| > |xdimsz|ydimsz| zdimsz |rs/pr/col|sk1/invxy|offset|trngl/1|0b11 |ModuloM|
some additional thoughts on the multiply algorithm from comment #10: we'll want to have an additional instruction afterward that clears the temporary values, since this allows the hardware to not have to compute them, which may be much more complex than just computing the product directly, especially if the hardware uses a different multiplication algorithm internally. (In reply to Jacob Lifshay from comment #10) > loop: > sv.maddedu *y, *a, *b, *y # RS is stored in scalar t > sv.adde *(y + 1), *(y + 1), t > svstep. > sv.bdnz ... loop basically, we want to clear `t` and `CA32` afterward so the hardware can pattern-match the whole thing and use whatever multiplication algorithm it likes, without having to compute the correct `t`/`CA32` values, which would basically require an extra multiplier just to get the right values. this gives hardware designers the freedom to use whatever multiplication hardware is the most efficient or fastest or etc. without being forced to produce values for `t` and `CA32` that're just ignored anyway. we don't need to clear CA since it is known to be zero (but the hardware needs to be able to recognize that CA is zero on entry, or it could just speculate that CA is zero and if it isn't the loop can be restarted and run in scalar mode) We can just use subfc t, t, t to clear CA, CA32 and t in one simple instruction.
so, should I start implementing the current design? if it's not ready yet I think I'll work on the already established parts of the bigint presentation while I'm waiting.
(In reply to Jacob Lifshay from comment #36) > so, should I start implementing the current design? i appreciate it's holidays (i've spent 40% of the past week asleep!) gentle but very firm reminder that you are to read and re-read the bug comments *online*, and when you do so you will observe things that you missed when tired, such as comment #31 *already implementing* the modulo REMAP Schedule. > if it's not ready yet I > think I'll work on the already established parts of the bigint presentation > while I'm waiting. good idea. the next phase here is to design svshape3 to take the largest portion of possible *most useful* options that will fit into the limited space. until that's done (the assessment not the coding)... basically this can't be rushed. the number of modes means that svshape3 will need to be more like svindex 810 # svindex instruction <a name="svindex"> </a> 811 812 SVI-Form 813 814 | 0:5|6:10 |11:15 |16:20 | 21:25 | 26:31 | Form | 815 | -- | -- | --- | ---- | ----------- | ------| -------- | 816 | PO | SVG | rmm | SVd | ew/yx/mm/sk | XO | SVI-Form | 817 818 * svindex SVG,rmm,SVd,ew,SVyx,mm,sk something like: 814 | 0:5|6:10 |11:15 |16:20 | 21:25 | 26:31 | Form | 815 | -- | -- | --- | ---- | ----------- | ------| -------- | 816 | PO | x | y | z | XX/yx/XX/sk | XO |TODO-Form | where the two XXs give some options. although i would be tempted to cut down z to range 1-4 and have more bits for modes 814 | 0:5|6:10 |11:15 |16:20 | 21:25 | 26:31 | Form | 815 | -- | -- | --- | ---- | ----------- | ------| -------- | 816 | PO | x | y | z/XXX| XX/yx/XX/sk | XO |TODO-Form | the modes included from submode and submode2 are EIGHT bits and that's excluding the offset (!) so it may be necessary to also design svshape4 *as well* to cover an offset-priority version just as svshape2 is a 2D offset priority version of the 3D svshape instruction. daaaamn.
svshape4 (daaamn combined with svshape3 that's a lot of opcode space) modulo with offset priority 945 | 0:5|6:9 |10|11:15 |16:20 | 21:24 | 25 | 26:31 | Form | 946 | -- |----|--| --- | ----- | ------ | -- | ------| -------- | 947 | PO |offs|yx| XXXX | SVd | YYYY | sk | XO | SVM2-Form | 948 949 * svshape4 offs,yx,XXXX,SVd,sk,YYYY by using MAXVL to indirectly compute one of the dimensions it is possible although inconvenient to have quite a few bits spare and cover all dimensions. perhaps even enough to set z=1-4 as well... yes. maybe even some spare. svshape3: 814 | 0:5|6:10 |11:15 |16:20 | 21:25 | 26:31 | Form | 815 | -- | -- | --- | ---- | ----------- | ------| -------- | 816 | PO | x | y | z/XXX| XX/yx/XX/sk | XO |TODO-Form | svshape4: 945 | 0:5|6:9 |10|11:15 |16:20 | 21:24 | 25 | 26:31 | Form | 946 | -- |----|--| --- | ----- | ------ | -- | ------| -------- | 947 | PO |offs|yx| XXXX | SVd | YYYY | sk | XO | SVM2-Form | fitting as closely with svshape2 also saves decoder gates. hmmm... really, SVd needs to move to bits 6:10 across svshape2 svshape3 and svshape4. damn damn damn. i really don't like making changes like this, but it is what it s.
(In reply to Jacob Lifshay from comment #35) > some additional thoughts on the multiply algorithm from comment #10: > > we'll want to have an additional instruction afterward that clears the > temporary values, since this allows the hardware to not have to compute > them, which may be much more complex than just computing the product > directly, especially if the hardware uses a different multiplication > algorithm internally. (see caveat first... ) ok so again a gentle but firm reminder: we are effectively designing the hardware-equivalent of a Universal Software API. how and what decisions hardware implementors make is *not our problem* as designers of that *Universal* Software API. illustration: if software developers start measuring SV "performance" on a processor-by-processor basis from even the same manufacturer and customise assembler based on differences just like they are forced to do with SIMD ISAs then we have CATASTROPHICALLY failed, as ISA Designers, to do our goddamn job correctly. i therefore remind you *once again* not to make ANY assumptions at the hardware implementation level that could impact or force a particular decision at the SOFTWARE level. now the caveat (which is a reminder of what i repeated before, many times) we *do* actually have to think through, from the implementor's persective, if the Software API (aka SV) is actually viable. but - butbutbut - not from just ONE implementation but ***ALL*** possible implementations. i stopped you from putting assumptions about hardware in the past, into a presentation, and am slightly peeved to have to remind you again, but please *for goodness sake* under *no circumstances* are you to write a presentation or engage with anyone on the basis of assuming, or let someone assume, that there shall and will and will ONLY BE just one and exclusively one hardware implementation... ... or, much much worse, that ALL hardware implementations are FORCED to have, for example, a JIT macro-op Fusion Hardware Engine. you are effectively right now telling Embedded Implementors, "you MUST HAVE a Macro-Op Fusion Engine. you MUST direct this complex pattern of SIX INSTRUCTIONS at a dedicated back-end. you have NO CHOICE in this matter". yeh? we have to be so so careful here. i know you said "if", but the instructions you mention are more of a "Programmer's Note" that *some* implementations *may* have optimised Macro-Op Fusion. they will be extremely high-end ones. spotting and matching a sequence of six to seven instructions is pretty extreme.
while we're designing new svshape instructions, we should also think about register arguments (GPR and/or VL) so we can set dynamic x/y/z dimensions and/or dynamic offsets (some reasonable selection of those, in particular we need some kind of dynamic offset for bigint shifts of more than 64-bit, and dynamic x dimension would be really useful for parallel prefix-sum and parallel reduction).
(In reply to Luke Kenneth Casson Leighton from comment #39) > we *do* actually have to think through, from the implementor's persective, > if the Software API (aka SV) is actually viable. but - butbutbut - not > from just ONE implementation but ***ALL*** possible implementations. yes, which is why I think we should design it so if some implementation decides it likes some other multiplication algorithm and wants to translate, it isn't required to produce some other expensive outputs that basically no software will care about, but are practically required for correctness -- the temporaries. if the cpu has to look 3000 instructions into the future to see that the temporary is ultimately overwritten, no reasonable cpu for a very long time will be able to do that, so will be required to compute the temporary values which are quite expensive. from a SV consistency point of view, we can't just declare the temporaries as undefined (since there isn't a good reason why a simple loop over scalar multiplies should suddenly be allowed to give arbitrary random results), so the next best option is to have a cheap simple instruction to clear the temporaries after -- unless we can come up with a bigint mul algorithm that naturally overwrites all temporaries (e.g. the only things changed are the product's registers), or where the values left in the temporaries are cheap to compute. obviously if you're using a simple scalar cpu, you can omit the clear-temporaries instruction, but if you could be using a high-end cpu, imo it's worth having.
(In reply to Luke Kenneth Casson Leighton from comment #39) > you are effectively right now telling Embedded Implementors, "you > MUST HAVE a Macro-Op Fusion Engine. you MUST direct this complex > pattern of SIX INSTRUCTIONS at a dedicated back-end. you have NO CHOICE > in this matter". I never said the hardware has to, I've always just said it *can*, not *must*. also, it probably doesn't have to match all the instructions, just the 4 instructions in the loop and matches on the current SV state. actually, while thinking about this, I think we'll want an instruction that has the combined functionality of bdnz and svstep. maybe one of the branch instruction's bits can be used for that for sv.bc?
(In reply to Jacob Lifshay from comment #40) > while we're designing new svshape instructions, we should also think about > register arguments (GPR and/or VL) so we can set dynamic x/y/z dimensions i said no and it means no. please read the spec. > and/or dynamic offsets (some reasonable selection of those, in particular we > need some kind of dynamic offset for bigint shifts of more than 64-bit, and > dynamic x dimension would be really useful for parallel prefix-sum and > parallel reduction). not happening, i have said this before and it is inviolate and not to be discussed further under this bugreport as it is firmly out of scope. they can write directly to SVSHAPEs and take the consequences after reading the spec's warnings.
(In reply to Jacob Lifshay from comment #42) > I never said the hardware has to, i know, i apologise, i'm a little jumpy on this topic, things need to be spelled out explicitly (i am partly writing for future readers here). > I've always just said it *can*, not > *must*. also, it probably doesn't have to match all the instructions, just > the 4 instructions in the loop and matches on the current SV state. 4 is still far too complex. even just 2 is complex enough. good to have the discussion but please bear in mind *all* possible potential implementations have to be considered, not just one. ISA design is tedious extreme-patient work. > actually, while thinking about this, I think we'll want an instruction that > has the combined functionality of bdnz and svstep. good thought/idea: already tried it. started to break RISC paradigm. far too complex both in hardware and in terms of combined functionality. mostly it was down to number of register hazards exceeding 3-in 2-out if i recall, but also that branch analysis goes completely sideways. > maybe one of the branch > instruction's bits can be used for that for sv.bc? there are already 512 combined options which is alarming the OPF (with good reason) so no. the unit tests are going to be a massive job. good to have the discussion but can we please keep this bugreport in scope and on topic? its budget is already quite small
(In reply to Jacob Lifshay from comment #41) > software will care about, but are practically required for correctness -- > the temporaries. too complex to discuss and completely out of scope for this bugreport. the focus of this bugreport, 1155, is the svshape3 and svshape4 instructions, when we are under time pressure and a limited budget. shelve it please. > if the cpu has to look 3000 instructions into the future to see that the > temporary is ultimately overwritten, if you're looking 3,000 instructions into the future then you already have a massively-optimised CPU and are loooong into "low-hanging fruit". 5% or above you pay attention to, first. 0.33% is just not worth bothering with vs the cost. drop it please: mentally do the cost-benefit analysis ok?
(In reply to Luke Kenneth Casson Leighton from comment #45) > if you're looking 3,000 instructions into the future then you already > have a massively-optimised CPU and are loooong into "low-hanging fruit". > 5% or above you pay attention to, first. 0.33% is just not worth > bothering with vs the cost. i meant that no one's making a cpu that can look 3000 instructions ahead anytime soon, so we should consider ways to inform the cpu that those results are unused.
(In reply to Jacob Lifshay from comment #46) > i meant that no one's making a cpu that can look 3000 instructions ahead > anytime soon, so we should consider ways to inform the cpu that those > results are unused. DROP it. keep this bugreport on topic and in scope. you are making me angry by not respecting that we have to FOCUS on meeting our obligations and committments to NLnet on a hard immovable deadline. back to the topic for which this SMALL budget is intended: swapping svindex SVd and offset is too much, and too disruptive. it will need its own budget and to be Scheduled for an atomic change including and especially binutils. so this means it is free and clear to do svshape3 and 4, but with temporary operand positions. can you get that done as quickly as possible?
(In reply to Luke Kenneth Casson Leighton from comment #47) > can you get that done as quickly as possible? i've started editing comment #0 to begin a rapid prototype phase on bitpositions of fields. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;h=1edbb6a5ea5333e0eae3d1e9d4e0c66132c41e6c;hb=fa603a1e9f2259d86acf4e9451937a000d099289#l335
edited comment #0 again to jam submode3 and 4 into a 7 bit XO: bits 24,26:31 when bit 24 is 0 it s svshape3 1 4 the entire sv*remap set is a mess but that needs resolving separately.
jacob so sorry, i just realised that these need to be "targetted". svshape3 can have "modes suitable for bigint", but svshape4 has to be able to set 1 or more of SVSHAPE0-3 the logic behind that is like the matrix mode, you can do matrix multiply in 3 instructions (SVrm=0b0000 for that one) and it rewrites *three* SVSHAPE SPRs in one hit, to do that. but svoffset you can set up much more tailored arrangements and i had forgotten about that. therefore the design of svshape3 and 4 need really to be done in conjunction with the ed25519 and/or pow-mod to get a clear picture / use-case.
reservations for svshape3&4, at a 5-bit XO: https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=e980fd1c7 https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=61ef110be 5-bit XO means XO is in bits 27-31, which i have edited in comment #0 for svshape4 already, TODO svshape3. svshape4 i suggest something different from svoffset and svindex: just specify si=0-3 for SVSHAPE0-3. no attempt to set multiple SHAPEs like how svoffset and svindex do (it takes 6 bits to do that, also setting up SVSTATE RA/RB/RC/RT/EA which is too much to attempt with these complex Modulo REMAPs).
(In reply to Luke Kenneth Casson Leighton from comment #50) > therefore the design of svshape3 and 4 need really to be > done in conjunction with the ed25519 and/or pow-mod to > get a clear picture / use-case. ok, I'll work on trying to fit them into the bigint mul and powmod algorithms soon, probably tomorrow.
(In reply to Jacob Lifshay from comment #52) > (In reply to Luke Kenneth Casson Leighton from comment #50) > > therefore the design of svshape3 and 4 need really to be > > done in conjunction with the ed25519 and/or pow-mod to > > get a clear picture / use-case. > > ok, I'll work on trying to fit them into the bigint mul and powmod > algorithms soon, probably tomorrow. awesome. i recommend doing individual unit tests first (test_caller_svp64_bigmul.py or test_caller_svp64_svshape.py) working up to that vertical-first loop idea, keep it to a real short loop of say 2x2 or 3x2 64-bit otherwise it takes too long to run. we need *small* unit tests, running for *no more than 30 seconds total*, first. these act as "demos" and end up with significant comments (see strncpy in ldst test for what i am looking for) write and use svshape4 initially as that's the one that can have (needs) "explicit" control (targetting individual SVSHAPEs but also including an offset). remapyield.py does now have modulo, triangle, and rhombus capability, you'll need to tie it in to ISACaller just like the other remaps. set mode=0b11 rather than mode=0b00 once we know how many svshape4s are needed, plus an svremap instruction, *then* we have enough proper working konwledge to design svshape3. to give you some idea, this is the schedule for ed25519 https://bugs.libre-soc.org/show_bug.cgi?id=773#c3
(In reply to Jacob Lifshay from comment #52) > ok, I'll work on trying to fit them into the bigint mul and powmod > algorithms soon, probably tomorrow. I started getting ready to work on them, but ran into a bug that looks like it blocks most work due to making SVSHAPE unusable for simulations afaict: random test case that fails (I expect there to be lots): pytest -v src/openpower/decoder/isa/test_caller_svp64_dct.py::DCTTestCase::test_sv_remap_fpmadds_idct_outer_8 src/openpower/decoder/isa/svshape.py:147: in skip if self.is_indexed() or self.is_triangle(): src/openpower/decoder/isa/svshape.py:54: in is_triangle return self.mode == 0b11 and self.skip in [0b00, 0b01] src/openpower/decoder/isa/svshape.py:147: in skip if self.is_indexed() or self.is_triangle(): E RecursionError: maximum recursion depth exceeded while calling a Python object !!! Recursion detected (same locals & position) it isn't obvious if is_triangle should be accessing: return self.fsi['skip'].asint(msb0=True) or: inv = self.fsi['invxyz'].asint(msb0=True) return (inv & 0b100) >> 2 so I'm leaving it up to you to fix. I'll work on the algorithms next time I get a chance (probably not friday as I've been waking up too late so may not have time before sunset, so maybe on sunday?)
(In reply to Jacob Lifshay from comment #54) > so I'm leaving it up to you to fix. I'll work on the algorithms next time I > get a chance (probably not friday as I've been waking up too late so may not > have time before sunset, so maybe on sunday?) Luke, if you don't get around to it, I'll add a commit disabling is_triangle (with a FIXME comment) so SVSTATE works enough to run the simulator. You can then revert immediately before fixing it, but please don't push a revert until you have the commits fixing it (or just push to a different branch).
(In reply to Jacob Lifshay from comment #55) > then revert immediately before fixing it, but please don't push a revert > until you have the commits fixing it (or just push to a different branch). not a problem, i was experimenting.
(In reply to Luke Kenneth Casson Leighton from comment #56) > (In reply to Jacob Lifshay from comment #55) > > > then revert immediately before fixing it, but please don't push a revert > > until you have the commits fixing it (or just push to a different branch). > > not a problem, i was experimenting. pushed: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=dd21d8be43d9aff018b366d9891e6b19f2bb0a5e commit dd21d8be43d9aff018b366d9891e6b19f2bb0a5e Author: Jacob Lifshay <programmerjake@gmail.com> Date: Tue Jan 9 00:50:28 2024 -0800 svshape.py: disable is_triangle for now since it causes RecursionError
added WIP version of mul remap algorithm, it still needs proper SVSHAPE values filled in (I'm using mtspr for now till we have working svshape[34]) and a few other misc. things. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=7350d55815e5b99e8a3b997840c498df73af7648 commit 7350d55815e5b99e8a3b997840c498df73af7648 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Wed Jan 10 01:16:52 2024 -0800 openpower/test/bigint/mul_remap.py: add WIP mul_remap.py
(In reply to Jacob Lifshay from comment #58) > added WIP version of mul remap algorithm, it still needs proper SVSHAPE > values filled in (I'm using mtspr for now that's a smart idea, should work fine as long as "modes" are properly respected by remapyield.py ... they shouulld be... you added... ah no, here, in svshape.py def get_iterator(self): log ("SVSHAPE get_iterator", self.mode, self.ydimsz, self.is_indexed()) if self.mode == 0b00: iterate_fn = iterate_indices should be "if self.mode in [0b00, 0b11]" and that should do the trick yes, svshape3 instruction can come later, semi-designed in comment #0 see "svshape3 TODO"
(In reply to Luke Kenneth Casson Leighton from bug #1157 comment #52) > jacob how are you getting on with svshape3? bug #1155 comment #0 i think i ran out of time, so i'm planning on working on the presentation, i'll get back to this after. sadoon can use svindex for now.
(In reply to Jacob Lifshay from comment #60) > i think i ran out of time, so i'm planning on working on the presentation, > i'll get back to this after. sadoon can use svindex for now. ok. sensible decision. i *may* take care of it. if you get the pr. done first let me know?
(In reply to Luke Kenneth Casson Leighton from comment #61) > ok. sensible decision. i *may* take care of it. if you get the pr. > done first let me know? sounds good.
ok, I started working on the algorithm and quickly ran into the issue that we need to implement EXTRA2_MODE support in all of the simulator and assembler/disassembler and insndb...this is because sv.maddedu needs to put RS in RT + MAXVL, which isn't what happens currently, currently we just hardcode the simulator to behave as if EXTRA2_MODE = 1 for an arbitrary subset of instructions. We can't just change the simulator to hardcode EXTRA2_MODE = 0 for maddedu, since all the existing sv.maddedu operations in our tests depend on it behaving like EXTRA2_MODE = 1. # use SVSHAPE0 for RA, SVSHAPE1 for RB, # SVSHAPE2 for RC/RT, SVSHAPE3 for RS "svremap 0o37, 0, 1, 2, 2, 3, 0", # RS is scalar by using constant remap "sv.maddedu *4, *32, *36, *4", # FIXME: need to set EXTRA2_MODE to 0 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/mul_remap.py;hb=9d45047d7e7bc8678cd3ea30e1e883249f1fa344
(In reply to Jacob Lifshay from comment #63) > assembler/disassembler and insndb...this is because sv.maddedu needs to put > RS in RT + MAXVL, which isn't what happens currently, currently we just > hardcode the simulator to behave as if EXTRA2_MODE = 1 for an arbitrary > subset of instructions. rats. which... yep i was avoiding that as it's quite a lot of small changes in lots of areas. including binutils. > We can't just change the simulator to hardcode EXTRA2_MODE = 0 for maddedu, > since all the existing sv.maddedu operations in our tests depend on it > behaving like EXTRA2_MODE = 1. yyep. sigh. an alternative temp workaround is to add an maddedu2. this would be a hell of a lot less work.
(In reply to Luke Kenneth Casson Leighton from comment #64) > (In reply to Jacob Lifshay from comment #63) > > > assembler/disassembler and insndb...this is because sv.maddedu needs to put > > RS in RT + MAXVL, which isn't what happens currently, currently we just > > hardcode the simulator to behave as if EXTRA2_MODE = 1 for an arbitrary > > subset of instructions. > > rats. which... yep i was avoiding that as it's quite a lot of small > changes in lots of areas. including binutils. oh, wait, I just thought of an easy workaround: just set SVSHAPE3's offset to -1, which puts RS in RT - 1, this works since SVSHAPE3 would have otherwise been set to always give zero as the remapped index.
(In reply to Jacob Lifshay from comment #65) > just set SVSHAPE3's offset to -1, which puts RS in RT - 1, i love obscure tricks like this, esp. when i don't understand thm :)
i made a quvk start on svshape4 this evening but ran out of time, will work on it tomorrow. basically svshape3 will be targetted at setting p multiple shapes to do a specific (commonlyused) job but svshape4 is more general
hiya jacob just added svshape4, untested, added SVI2-Form, fields, and the pseudocode. should be useable. different from other sv* as it specifies (si) which SVSHAPE{si} to use. ran out of bits to do anything sophisticated basically :) https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=116a6cfd3
(In reply to Luke Kenneth Casson Leighton from comment #68) > hiya jacob just added svshape4, untested, looks nice, I'll have to try it out! I'm planning on working on thit some more on monday.
(In reply to Jacob Lifshay from comment #69) > (In reply to Luke Kenneth Casson Leighton from comment #68) > > hiya jacob just added svshape4, untested, > > looks nice, I'll have to try it out! hell of a lot simpler than the other ones, it is just copying bits from operands to SPR, one operand says which SPR to target. feel free to modify/bugfix it indiscriminately :)
I added a skip_case_slow decorator that skips unless the RUN_SLOW env var is set and applied it to the long powmod test, running the whole test suite with RUN_SLOW unset now takes 3min41s on my desktop. I made it so CI runs with RUN_SLOW set, so it should still test everything. Motivation: https://bugs.libre-soc.org/show_bug.cgi?id=676#c37 https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=4873bfd82621172db7753ed98c56e3308f7d303c commit 4873bfd82621172db7753ed98c56e3308f7d303c (HEAD -> bug_1155_bigmul_remap, origin/bug_1155_bigmul_remap) Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Feb 19 20:35:03 2024 -0800 powmod: mark case_powmod_256 with skip_case_slow running full test suite with RUN_SLOW unset now takes 3min41s on my desktop -- substantially faster. commit 10703bbd41ad4d1f5e527101d0a5746d7e851e7b Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Feb 19 20:34:39 2024 -0800 .gitlab-ci.yml: always set RUN_SLOW for CI commit adfef6ed6a57f5214eeaa55c055b4eb78d2826c2 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Feb 19 20:33:16 2024 -0800 test/common: add skip_case_slow that skips unless RUN_SLOW env var is set
(In reply to Luke Kenneth Casson Leighton from comment #66) > (In reply to Jacob Lifshay from comment #65) > > > just set SVSHAPE3's offset to -1, which puts RS in RT - 1, > > i love obscure tricks like this, esp. when i don't understand thm :) well, perfect: I said we don't need signed offset, but turns out we still do. Me saying we don't need signed offsets: https://bugs.libre-soc.org/show_bug.cgi?id=911#c5 switching to 4-bit signed offsets (instead of unsigned) should be doable, since most applications of offsets can just reverse the trick mentioned in bug #911 comment #5. here, we *do* need signed offsets since the base register is RC and we're offsetting RS, so we can't change the base register. if we don't have signed-offsets that limits the maximum product size to 14 registers, since we have to put the RS result outside of the product and the max offset we can set is 15. to be honest, I'm thinking it would be better to just use the obvious algorithm from comment #0 (reproduced below in case comment #0 is edited) instead of trying to cram it all in one vertical first loop, since that VF-loop has very annoying dependency chains through CA, and because the algorithm from comment #0 also lets us have a bigint mul-add (just don't clear product, voila, an additional add input!), saving even more time. Also, the algorithm from comment #0 lets us use the exact same pattern (the bi loop) as a scalar-vector mul using horizontal first sv.maddedu, likely saving CPU complexity too. Algorithm from comment #0: void mul( uint64_t *product, const uint64_t *a, const uint64_t *b, size_t a_sz, size_t b_sz ) { for(size_t bi = 0; bi < a_sz; bi++) { product[bi] = 0; } for(size_t ai = 0; ai < a_sz; ai++) { uint64_t carry = 0; for(size_t bi = 0; bi < a_sz; bi++) { uint128_t v = (uint128_t)a[ai] * (uint128_t)b[bi]; v += (uint128_t)carry; v += (uint128_t)product[ai + bi]; carry = v >> 64; product[ai + bi] = (uint64_t)v; } product[ai + b_sz] = carry; } }
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0bb801c11404225735c7f25b1ecc6ea70db349e8 I'll switch to using svshape4 once we know the SVSHAPE values work...eliminating extra things that could break. commit 0bb801c11404225735c7f25b1ecc6ea70db349e8 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Feb 19 21:34:23 2024 -0800 mul_remap: WIP -- fill in SVSHAPE[0-2], SVSHAPE3 needs work still see: https://bugs.libre-soc.org/show_bug.cgi?id=1155#c72 commit d8138c79ff5b746f14a5188f46b2e8d79ec00701 Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Feb 19 21:34:03 2024 -0800 remapyield: add some more demos
(In reply to Jacob Lifshay from comment #72) > well, perfect: I said we don't need signed offset, but turns out we still do. joy. luckily the whole purpose of the NLnet grants is to find these things out :) hmm hmm, find another bit. https://libre-soc.org/openpower/sv/draft_opcode_tables/ i know, move svshape3 to draft/temp-PO5 then use a 4-bit XO for svshape4, something like: | 0:5|6:9|10 |11:15 |16:20 | 21 22:23 24:25 |26:27| 28:31 | Form | | -- |---|----| --- | ---- | -------------- | -- | ------| -------- | | PO |SVo|SVyx| y | x | sk /sm2 / sm | si | XO |TODO-Form | where si and SVyz have swapped places compared to what i did previously.
(In reply to Luke Kenneth Casson Leighton from comment #74) > (In reply to Jacob Lifshay from comment #72) > > > well, perfect: I said we don't need signed offset, but turns out we still do. > > joy. luckily the whole purpose of the NLnet grants is to find these > things out :) > > hmm hmm, find another bit. we don't need another bit for now, 4-bit signed is sufficient for this. though having more bits would definitely be useful when trying to set offset dynamically, for bigint shifts by >64-bits.
(In reply to Jacob Lifshay from comment #75) > we don't need another bit for now, 4-bit signed is sufficient for this. there's only 3 in svshape4. svshape4, 2nd mode: -------- 945 | 0:5|6:8|9:10| 946 | -- |---|----| 947 | PO |SVo|si | it's pretty normal to get confused between the instruction(s) and the SVSHAPEs tht the instructions update/modify.
(In reply to Luke Kenneth Casson Leighton from comment #76) > (In reply to Jacob Lifshay from comment #75) > > > we don't need another bit for now, 4-bit signed is sufficient for this. > > there's only 3 in svshape4. ok, doesn't matter for this case since it uses matrix mode to get a constant with offset -1, not rhombus, triangle, or other modes, so it'll use svshape not svshape4. SVSHAPE3 = SVSHAPE(0) SVSHAPE3.order = (0, 2, 1) # so something is non-zero SVSHAPE3.offset = -1 # FIXME: offset is unsigned
editing svshape4 *again* :) SVyx needs to be 2-bit, split across 28 & 31, to fit into PO5 (temporary) https://libre-soc.org/openpower/sv/draft_opcode_tables/
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=9500dc9b0 annoying, WIP, svshape4 is all over the shop. moving it to PO5 and getting the fields sorted out.
holy cow this is getting complex :) currently (remapyield.py example) this is achievable: triangle 5 5 5 1 iterate_indices 5 tri True mod True offs True dims 1 5 5 mode 0b11 sm2 0b0 _skip 3 inv [0, 0, 0] 0->0 11 1->1 1 2->2 11 3->2 1 4->3 1 5->4 11 6->3 1 7->4 1 8->5 1 9->6 11 10->4 1 11->5 1 12->6 1 13->7 1 14->8 111 but what is actually needed is 0 10 210 3210 working through it
After doing some research, it looks like we only promised NLnet that we'd work on demos of cryptographic algorithms, not which specific ones we'd work on, or what kind of demos those would be. NLnet only has #773 on the MOU, which this bug is part of. Therefore I think we can say that the mul_remap.py source code and the discussion on this bug qualifies as a demo, so we can close this bug as completed. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/mul_remap.py;h=d11ef619650882df2234a4c75d64fa17638a063b;hb=9500dc9b04f3f80bd3f0fbf1c83887f07548818c For additional work outside of the cryptorouter grant, we can create a new bug if/when we get around to that.
(In reply to Jacob Lifshay from comment #81) > which this bug is part of. Therefore I think we can say that the > mul_remap.py source code and the discussion on this bug qualifies as a demo, > so we can close this bug as completed. yeah i'd agree with that logic.
assigning to 2021-02-052@nlnet.nl - a payment has already been made to jacob, therefore review has already been carried out.