Bug 1155 - O(n^2) multiplication REMAP mode(s)
Summary: O(n^2) multiplication REMAP mode(s)
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's second ASIC
Classification: Unclassified
Component: source code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Jacob Lifshay
URL: https://libre-soc.org/openpower/sv/re...
Depends on:
Blocks: 1044 1151 1157
  Show dependency treegraph
 
Reported: 2023-09-09 00:12 BST by Jacob Lifshay
Modified: 2024-05-22 05:45 BST (History)
4 users (show)

See Also:
NLnet milestone: NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks: 1000
budget (EUR) for this task, excluding subtasks' budget: 1000
parent task for budget allocation: 773
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
lkcl={amount=300, submitted=2024-05-21} [jacob] amount = 700 submitted = 2024-05-13


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2023-09-09 00:12:35 BST
Create a suite of REMAP modes suitable for triangular, rhombus, and modulo
patterns:

xxxx  xxxx    1234
 xxx   xxxx   4123
  xx    xxxx  3412
   x     xxxx 2341

---

* TODO: note the inner loop notification used by "svstep." in Vertical-First
  https://bugs.libre-soc.org/show_bug.cgi?id=1044#c16
* TODO: design spec up the three REMAP modes which involves a new svshape3
* TODO: write python "yield" versions (see dct fft and other REMAP modes)
        with mini demos / unit tests
* TODO: confirm that the spec changes are ok (see comment #5)
* TODO: (IN A BRANCH FOR REVIEW) make the spec changes
* TODO: (in same branch): using the yielder(s) add actual REMAP unit tests
* TODO: review branch
* TODO: write PR/blog about bigmul (budget under another bug, TODO link)
* TODO: commit and close

Basic algorithm for rhombus:

void mul(
    uint64_t *product,
    const uint64_t *a,
    const uint64_t *b,
    size_t a_sz,
    size_t b_sz
) {
    for(size_t bi = 0; bi < a_sz; bi++) {
        product[bi] = 0;
    }
    for(size_t ai = 0; ai < a_sz; ai++) {
        uint64_t carry = 0;
        for(size_t bi = 0; bi < a_sz; bi++) {
            uint128_t v = (uint128_t)a[ai] * (uint128_t)b[bi];
            v += (uint128_t)carry;
            v += (uint128_t)product[ai + bi];
            carry = v >> 64;
            product[ai + bi] = (uint64_t)v;
        }
        product[ai + b_sz] = carry;
    }
}

---

instruction forms

key:

* ModS - summation with modulo
* ModM - matrix with modulo
* Sq - Square
* Tr - Triangle
* pr        permute x/y y/x
* col       column / rw
* sk1       skip X
* invxy     invert x, invert y

svshape3 TODO
--------

 814 | 0:5|6:10 |11:15  |16:20 | 21:25    26| 27:31 |  Form    |
 815 | -- | --  | ---   | ---- | ---------- | ------| -------- |
 816 | PO | x   | y     | z/XXX| XX/yx/XX/X | XO    |TODO-Form |

based on shshape
(svshape SVxd,SVyd,SVzd,SVrm,vf)
 but z here is only 2 bits, setting vf as well is too much... oh wait:

16-17: z
18: ModM/ModS
19: Sq/Tr
20: col
21: sk
22: pr
23: SVyx (invxy)
24: 0
25-26: si (SVSHAPE0-3 Index)

however all bits 21:25 can be just an operand "SVrm"

 331 # 1.6.35 SVM-FORM
 332     |0     |6        |11      |16    |21    |25 |26    |31  |
 333     | PO   |  SVxd   |   SVyd | SVzd | SVrm |vf |   XO      |

new fields, SVz2 and sm2 (2-bit SVzd and submode2).
SVrm can remain as-is and be described in a separate table

|16  18|19 20|21  24|
|sm2   |SVz2 |SVrm  |

SVrm:

|0               4|
|sk pr SVyz col vf|


svshape4, 2nd mode, fit in PO5
--------

| 0:5|6:8|9:10|11:15|16:20| 21 22:23 24:25 |26:27|28  |29:30|31  |  Form    |
| -- |---|----| --- | ----| -------------- | --- |--  |-----|    | -------- |
| PO |SVo|si  | y   | x   | sk /sm2 / sm   | XO  |SVyx|XO   |SVyx|SVI2-Form |

instruction operands for svshape4

    svshape4 x,y,SVo,SVyx,sk,sm2,sm,si

where

* sm2 (submode2) covers rs/xx/yy below
  (MSB0 order: sm = submode[1:2])
  (rs: reserved)
* si (SVSHAPE INDEX 0-3)
* sm (submode) mods/m sq/tr as below

|0:5   |6:11  |12:17    |18 19 20 |21  22:23|24:27 | 28:29 |30:31|mode   |
|xdimsz|ydimsz| zdimsz  |rs/nx/ny |sk1/invxy|offset| 0b00  |0b11 |ModS/Sq|
|xdimsz|ydimsz| zdimsz  |rs/pr/col|sk1/invxy|offset| 0b01  |0b11 |ModM/Sq|
|xdimsz|ydimsz| zdimsz  |rs/nx/ny |sk1/invxy|offset| 0b10  |0b11 |ModS/Tr|
|xdimsz|ydimsz| zdimsz  |rs/pr/col|sk1/invxy|offset| 0b11  |0b11 |ModM/Tr|
Comment 1 Luke Kenneth Casson Leighton 2023-09-09 00:28:57 BST
(In reply to Jacob Lifshay from comment #0)
> Create a REMAP mode

>     for(size_t ai = 0; ai < a_sz; ai++) {
>         uint64_t carry = 0;
>         for(size_t bi = 0; bi < a_sz; bi++) {
>             uint128_t v = (uint128_t)a[ai] * (uint128_t)b[bi];

the basics for the bigmulremap fn become:
   for i in range(asz):
      for j in range(bsz):
          yield i, j, i+j

however additional modes then include options to
do yield i, j, ((i+j) % bsz) plus "reversing" options
(horiz, vert, h&v) and perhaps 2D mirroring as well.
that's 5 bits of mode which is quite enough.

i know that konstantinos may have something to
add about the motion-estimate algorithm he did, i
know there is a similar pattern but it is...
i//2, j, i//2+j or something like that. this may be
a bit much though.

have to be very careful as there is not much space
left in svshape.
Comment 2 Luke Kenneth Casson Leighton 2023-09-09 12:21:09 BST
space in SHAPE SPRs

https://libre-soc.org/openpower/sv/remap/


|0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
|----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
|xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |Matrix |
|xdimsz|ydimsz|SVGPR    | 11/     |sk1/invxy|offset|elwidth|0b00 |Indexed|
|xdimsz|mode  | zdimsz  | submode2| invxyz  |offset|submode|0b01 |DCT/FFT|
| rsvd |rsvd  |xdimsz   | rsvd    | invxyz  |offset|submode|0b10 |Red/Sum|
|      |      |         |         |         |      |       |0b11 |rsvd   |

=>

| rsvd |rsvd  |xdimsz   | 00      | invxyz  |offset|submode|0b10 |Red/Sum|
| rsvd |ydimsz|xdimsz   | 01      | invxy 0 |offset|mirror |0b10 |Bigmul|

or


| rsvd |rsvd  |xdimsz   | rsvd    | invxy 0 |offset|submode|0b10 |Red/Sum|
| rsvd |ydimsz|xdimsz   | rsvd    | invxy 1 |offset|submode|0b10 |Bigmul |

have to check z of invxyz is free
Comment 3 Luke Kenneth Casson Leighton 2023-09-09 12:27:56 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=33a02e6612065f290d840e15a596dfc2177de5e5;hb=f746c31efca189fb09aec2e661dba686616a846f#l320

unused (not set). i did tell you repeatedly to enable inversion of both
x and y in parallel-reduce REMAP.

z is free so can be used to shoe-horn bigmul in it.  next step, the
instruction format
Comment 4 Luke Kenneth Casson Leighton 2023-09-09 12:38:39 BST
svshape SVxd,SVyd,SVzd,SVRM,vf
0:5	6:10	11:15	16:20	21:24	25	26:31	name
PO	SVxd	SVyd	SVzd	SVRM	vf	XO	svshape

Fields:

SVxd - SV REMAP "xdim" (X-dimension)
SVyd - SV REMAP "ydim" (Y-dimension, sometimes used for sub-mode selection)
SVzd - SV REMAP "zdim" (Z-dimension)
SVRM - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)

ok so SVyd is the de-facto selector for sub-modes: SVyd "bit 0" can
be used for selecting bigmul/prefix, under SVRM=0b0111, whereupon
bits 1-2 can be used for inversion and 2-3 for mirror...

damn damn damn 5 bits are needed: one for long-mul, one for carrysave

ah hang on i know: "/mrr" performs a full reverse, which in combination
with inverting either x or y order *back* again (single bit) you get
the full complement. i think.

so:

* bit 0 select bigmul or Prefx
* bit 1 select carrysave or longmul
* bit 2 select invert x (or y?)
* bit 3 mirror x (5 45 345 2345 12345 012345 instead of 0 01 012 ...)
* bit 4 mirror y (ditto)
Comment 5 Luke Kenneth Casson Leighton 2023-10-01 08:44:28 BST
we will need an svshape3. damn.
https://lists.libre-soc.org/pipermail/libre-soc-dev/2023-September/005711.html
Comment 6 Luke Kenneth Casson Leighton 2023-10-06 09:49:51 BST
okaaay, there are some necessary steps here, but there are *two* branches
needed which is a royal pain in the ass (openpower-isa, spec).

what i'm proposing - bear in mind this is still PO22 - is to move
bincrflut and binlut to the "- 10 00 |0" row, leaving their current
locations free for svshape3 and 4.

that way it is possible to do a svshape3, bringing in more space for
REMAP options. which will be identical to svshape.

| NN |    |     |      |         | -  10  00 |0 | rsvd     | rsvd  |
| NN |    |     |      |         | -  11  00 |0 | rsvd     | rsvd  |
| NN | RT | RA  | RB   | RC      | nh 00  00 |1 | binlut |  VA-Form      |
| NN | RT | RA  | RB   | /BFA/   | 0  01  00 |1 | bincrflut  | VA-Form     |
| NN |    |     |      |         | 1  01  00 |1 | svindex    | SVI-Form    |
| NN | RT | RA  | RB   | mode    | L  10  00 |1 | bmask    | BM2-Form     |
| NN |    |     |      |         | 0  11  00 |1 | svshape    | SVM-Form    |
| NN |    |     |      |         | 1  11  00 |1 | svremap   | SVRM-Form   |

git diffs:

* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=261c0c4a
* https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ac538421
Comment 7 Jacob Lifshay 2023-12-22 05:19:28 GMT
I came up with a somewhat different bigint mul remap algorithm, it has a few issues though...I also rebased the branch on latest master.

commit a3c1930de7990c8babcb3908ed7650e1d08eafb6
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Dec 21 21:10:11 2023 -0800

    tests/bigint/powmod: initial version of bigint multiply remap
    
    it has some issues around being able to encode scalar RS but
    vector RT, and doesn't match the scalar * vector multiplication
    pattern, but is quite compact.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=73d87b35dd8df841596fff4f5153464362882917;hb=a3c1930de7990c8babcb3908ed7650e1d08eafb6#l102

python version:
    y = [0] * (a_sz + b_sz)
    ca = 0
    for i in range(a_sz * b_sz):
        # no need to clear ca between ai outer loops, since the partial
        # products can't get big enough to have a carry out, so ca will
        # always be zero when (i % b_sz == 0).
        # That said, hardware will probably want to pattern-match this to
        # remove the unnecessary dependency through ca.
        y[a_plus_b_idx[i]], t = maddedu(
            a[a_idx[i]], b[b_idx[i]], y[a_plus_b_idx[i]])
        y[a_plus_b_plus_1_idx[i]], ca = adde(
            y[a_plus_b_plus_1_idx[i]], t, ca)
Comment 8 Jacob Lifshay 2023-12-22 05:24:39 GMT
(In reply to Jacob Lifshay from comment #7)
> I came up with a somewhat different bigint mul remap algorithm, it has a few
> issues though...I also rebased the branch on latest master.

so, basically, we have to decide if we want a really compact but somewhat problematic loop, or a somewhat bigger loop (algorithm in comment #0 and copied below) that's likely easier for hw to handle (since it matches the scalar * vector multiply pattern closer):

void mul(
    uint64_t *product,
    const uint64_t *a,
    const uint64_t *b,
    size_t a_sz,
    size_t b_sz
) {
    for(size_t bi = 0; bi < a_sz; bi++) {
        product[bi] = 0;
    }
    for(size_t ai = 0; ai < a_sz; ai++) {
        uint64_t carry = 0;
        for(size_t bi = 0; bi < a_sz; bi++) {
            uint128_t v = (uint128_t)a[ai] * (uint128_t)b[bi];
            v += (uint128_t)carry;
            v += (uint128_t)product[ai + bi];
            carry = v >> 64;
            product[ai + bi] = (uint64_t)v;
        }
        product[ai + b_sz] = carry;
    }
}
Comment 9 Luke Kenneth Casson Leighton 2023-12-22 08:23:09 GMT
(In reply to Jacob Lifshay from comment #7)
> I came up with a somewhat different bigint mul remap algorithm, it has a few
> issues though...I also rebased the branch on latest master.
> 
> commit a3c1930de7990c8babcb3908ed7650e1d08eafb6
> Author: Jacob Lifshay <programmerjake@gmail.com>
> Date:   Thu Dec 21 21:10:11 2023 -0800
> 
>     tests/bigint/powmod: initial version of bigint multiply remap
>     
>     it has some issues around being able to encode scalar RS but
>     vector RT, and doesn't match the scalar * vector multiplication
>     pattern, but is quite compact.

i was expecting / anticipating use of vertical-first, this is
different / cool, how does it work?


> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/
> bigint/powmod.py;h=73d87b35dd8df841596fff4f5153464362882917;
> hb=a3c1930de7990c8babcb3908ed7650e1d08eafb6#l102

please remove this:
 105     # run this file in a debugger to see all the intermediate values.

(use log statements instead)

> python version:
>     y = [0] * (a_sz + b_sz)
>     ca = 0
>     for i in range(a_sz * b_sz):
>         # no need to clear ca between ai outer loops, since the partial
>         # products can't get big enough to have a carry out, so ca will
>         # always be zero when (i % b_sz == 0).
>         # That said, hardware will probably want to pattern-match this to
>         # remove the unnecessary dependency through ca.
>         y[a_plus_b_idx[i]], t = maddedu(
>             a[a_idx[i]], b[b_idx[i]], y[a_plus_b_idx[i]])
>         y[a_plus_b_plus_1_idx[i]], ca = adde(
>             y[a_plus_b_plus_1_idx[i]], t, ca)

reduce these variable names down to fit on one line.

e.g. ab1i then add a comment on initialisation.

i'll take a closer look, as it will take some time for the impact to
sink in.

if done as a "yielder" i think i will get it immediately.
like this (which is entirely standalone executable, *no*
use of *any* externl modules, this is very deliberate)

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;h=a15c5bf7871507b34a6e2a043e73284e60e22bec;hb=a3c1930de7990c8babcb3908ed7650e1d08eafb6#l85

that's the next code-morph step.
Comment 10 Jacob Lifshay 2023-12-22 08:36:02 GMT
(In reply to Luke Kenneth Casson Leighton from comment #9)
> (In reply to Jacob Lifshay from comment #7)
> > I came up with a somewhat different bigint mul remap algorithm, it has a few
> > issues though...I also rebased the branch on latest master.
> > 
> > commit a3c1930de7990c8babcb3908ed7650e1d08eafb6
> > Author: Jacob Lifshay <programmerjake@gmail.com>
> > Date:   Thu Dec 21 21:10:11 2023 -0800
> > 
> >     tests/bigint/powmod: initial version of bigint multiply remap
> >     
> >     it has some issues around being able to encode scalar RS but
> >     vector RT, and doesn't match the scalar * vector multiplication
> >     pattern, but is quite compact.
> 
> i was expecting / anticipating use of vertical-first, this is
> different / cool, how does it work?

it is vertical first.

basically, I realized that, if you do it in the right sequence, you can compute a[:] * b[:] by just computing each partial product a[ai] * b[bi] and adding each one in the right place in the final result:

y = [0] * ...
for ai, bi in right_sequence():
    partial_product = a[ai] * b[bi]
    y += partial_product << (word_size * (ai + bi))

this becomes a vertical first loop:
loop:
    sv.maddedu *y, *a, *b, *y  # RS is stored in scalar t
    sv.adde *(y + 1), *(y + 1), t
    svstep.
    sv.bdnz ... loop


> if done as a "yielder" i think i will get it immediately.
> like this (which is entirely standalone executable, *no*
> use of *any* externl modules, this is very deliberate)

well, notice it builds the list of remap indexes, and then the loop that actually operates on the data just goes through those indexes in sequence, doing (combination python and sv asm syntax):
*y, t = maddedu(*a, *b, *y)
*(y + 1), ca = adde(*(y + 1), t, ca)

so the "yielder" (if you mean the code that produces remap indexes) is just the loops building the indexes, with:
some_list.append(the_index)

replaced with:
yield the_index.
Comment 11 Luke Kenneth Casson Leighton 2023-12-22 08:58:51 GMT
ok gimme an hour or so to do a code-morph, it will
also help me to understand. there's something cool
going on here
Comment 12 Luke Kenneth Casson Leighton 2023-12-22 09:50:15 GMT
(In reply to Jacob Lifshay from comment #10)
> (In reply to Luke Kenneth Casson Leighton from comment #9)
> > (In reply to Jacob Lifshay from comment #7)
> > > I came up with a somewhat different bigint mul remap algorithm, it has a few
> > > issues though...I also rebased the branch on latest master.
> > > 
> > > commit a3c1930de7990c8babcb3908ed7650e1d08eafb6
> > > Author: Jacob Lifshay <programmerjake@gmail.com>
> > > Date:   Thu Dec 21 21:10:11 2023 -0800
> > > 
> > >     tests/bigint/powmod: initial version of bigint multiply remap
> > >     
> > >     it has some issues around being able to encode scalar RS but
> > >     vector RT, and doesn't match the scalar * vector multiplication
> > >     pattern, but is quite compact.
> > 
> > i was expecting / anticipating use of vertical-first, this is
> > different / cool, how does it work?
> 
> it is vertical first.

ahhh

> basically, I realized that, if you do it in the right sequence, you can
> compute a[:] * b[:] by just computing each partial product a[ai] * b[bi] and
> adding each one in the right place in the final result:
> 
> y = [0] * ...
> for ai, bi in right_sequence():
>     partial_product = a[ai] * b[bi]
>     y += partial_product << (word_size * (ai + bi))

... assuming y is a massive long "thing" of bitsize wordsize*(ai+bi)
but the subdivisions are over wordsize boundaries. i start to get it. neat.
 
> this becomes a vertical first loop:
> loop:
>     sv.maddedu *y, *a, *b, *y  # RS is stored in scalar t
>     sv.adde *(y + 1), *(y + 1), t
>     svstep.
>     sv.bdnz ... loop

briiilliant. i love it! that's the kind of laughably-short assembler
we need :)
 
> well, notice it builds the list of remap indexes, and then the loop that
> actually operates on the data just goes through those indexes in sequence,

see diff below, i split out those two roles explicitly separately,
as that's the next step in the code-morph sequence.

> doing (combination python and sv asm syntax):
> *y, t = maddedu(*a, *b, *y)
> *(y + 1), ca = adde(*(y + 1), t, ca)

love it.

> so the "yielder" (if you mean the code that produces remap indexes) is just
> the loops building the indexes, with:
> some_list.append(the_index)
> 
> replaced with:
> yield the_index.

yes. done.

   https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=aa6f10d1f2

that's how all the ISACaller REMAP yields work, but if there's
no yielder module (e.g. remapyield.py) then ISACaller can't *have* that
functionality and we can't do the tests :)

okaay next task now we have the actual algorithm python_mul_remap_yielder(a_sz, b_sz)
is to work out how it's going to fit into SVSHAPE0-3.
that process is started at comment #2
Comment 13 Jacob Lifshay 2023-12-22 11:07:54 GMT
(In reply to Luke Kenneth Casson Leighton from comment #12)
> okaay next task now we have the actual algorithm
> python_mul_remap_yielder(a_sz, b_sz)
> is to work out how it's going to fit into SVSHAPE0-3.
> that process is started at comment #2

before we get ahead of ourselves, we need to make sure that remap sequence actually works, which requires figuring out how maddedu can have scalar RS and vector RT, RA, RB, and RC:

> > doing (combination python and sv asm syntax):
> > *y, t = maddedu(*a, *b, *y)
Comment 14 Jacob Lifshay 2023-12-22 11:58:32 GMT
i had an idea for SVSHAPE: if we can get a matrix mode with 3 spare bits (e.g. a flag that repurposes some bits somewhere), then we can use them like so:
000 normal matrix mode
001 x
010 y
011 x + y
100 z
101 x + z
110 y + z
111 x + y + z

this gives us all the different things matrix mode can do for each of the x,y,z axes, and lets us extract just one axis, or sum axes together. invxyz lets us subtract them too.
Comment 15 Luke Kenneth Casson Leighton 2023-12-22 13:30:55 GMT
ok so this is the primary "yielder" function which
fascinatingly, the first two (ai, bi) are covered
by the standard matrix and svindex REMAPs, the other
two are covered by exactly the middle of the three
patterns from comment #0 but just offset by one
in the case of the 4th.

def python_mul_remap_yielder(a_sz, b_sz):
    for ai in range(a_sz):
        for bi in range(b_sz):
            yield ai, bi, ai + bi, ai + bi + 1

i'm almost tempted to suggest that the offset (+1)
be a parameter, but... hmmm... is there space to do
that (in SVSHAPE0-3 i mean)?

that's going to be the next critical task, but i feel
that we may need a few more examples here so as not
to run out of bits in SVSHAPEn, hmm hmm not sure,
tempted also to just go for it

anyway next task definitely to choose a format from comment #2

> space in SHAPE SPRs
> 
> https://libre-soc.org/openpower/sv/remap/
> 
> 
> |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
> |----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
> |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |Matrix |
> | rsvd |rsvd  |xdimsz   | rsvd    | invxy 0 |offset|submode|0b10 |Red/Sum|
> | rsvd |ydimsz|xdimsz   | rsvd    | invxy 1 |offset|submode|0b10 |Bigmul |

still TODO:
 
> have to check z of invxyz is free

(and jacob you still need to implement inversion in Red/Sum, i'm
 not going to do it for you. it doesn't matter what *you* think it
 might or might not be used for: we present people with the option
 and they explore over the next 1-10+ years)

* the offset there (bits 24-27) will work out (the "+1" for example)
* rsvd bits 0:5 can be the "modulo" which if set to 0 means "no modulo"
  this will give the repeating modulo pattern
* permute (bits 18:20) will say what the order of the for-loops is
  ("for ai for bi" vs "for bi for ai")
* submode specifies triangle or rhombus
* invxy says whether ai goes 0-xdimsz or xdimsz-0 and likewise bi

it's all there i believe?
Comment 16 Luke Kenneth Casson Leighton 2023-12-22 14:12:28 GMT
oo hang on, i have my head round this now :) so i have a question:
at the cross-over point to move to the next row (the outer loop)
is it *guaranteed* that ca=0 such that it will not interfere with
the new row?

i.e. at the end of each row, which is mul-adding the most significant
word-partial-result, will the adde *alway* produce a ca=0?

(i'm catching up on other comments)
Comment 17 Luke Kenneth Casson Leighton 2023-12-22 14:30:50 GMT
(In reply to Jacob Lifshay from comment #14)
> i had an idea for SVSHAPE: if we can get a matrix mode with 3 spare bits
> (e.g. a flag that repurposes some bits somewhere), then we can use them like
> so:
> 000 normal matrix mode
> 001 x
> 010 y
> 011 x + y
> 100 z
> 101 x + z
> 110 y + z
> 111 x + y + z

*deep breath*... i love it!

> this gives us all the different things matrix mode can do for each of the
> x,y,z axes, and lets us extract just one axis, or sum axes together. invxyz
> lets us subtract them too.

aaawesome. no that's a bloody good idea.

let me walk through it on SVSHAPE (from comment #2)
and also do some clarification
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=0b934a4d48d

> https://libre-soc.org/openpower/sv/remap/
> 
> 
> |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
> |----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
> |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |0b00 |Matrix |
> |xdimsz|ydimsz|SVGPR    | 11/     |sk1/invxy|offset|elwidth|0b00 |Indexed|
> |xdimsz|mode  | zdimsz  | submode2| invxyz  |offset|submode|0b01 |DCT/FFT|
> | rsvd |rsvd  |xdimsz   | rsvd    | invxyz  |offset|submode|0b10 |Red/Sum|
> |      |      |         |         |         |      |       |0b11 |rsvd   |

we have mode=0b11 free!

i wasn't planning to allocate it as it's extremely precious space in what
is a seriously loaded 32-bit SPR, but the idea of adding up (and subtracting)
the three dimensions x y z is a damn good one, let's roll with it.

|0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
|----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
|xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |0b01 |BigMul |

hmm hmm hmm skip doesn't make sense here, so can be used as a submode

|0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
|----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
|xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|submode|0b01 |BigMul |

if a dimension is zero (which remember encoding is 0=1 1=2 2=3 because
the for-loops go "for i = 0 to xdimsz INCLUSIVE" then it can mean "disabled".

submode possibilities:

* submode=0b00 - triangular, no modulo
* submode=0b01 - rhombus, no modulo
* submode=0b10 - modulo-with-the=-ffset-added in the X dimension
* submode=0b11 - modulo-with-the-offset-added in the Y dimension

there's not enough encoding space to do "modulo-with-offset in Z"
and i would really like to leave permute=0b110/0b111 for another
encoding (reserved) for future REMAPs.

(in the case of matrix - mode=0b00 - "Indexed" REMAP was embedded
 into mode=0b00,permute=0b110/0b111 and for symmetry as well as
 saving hardware gates i'd like the same thing to be done for
 this BigMul REMAP as well)

awesome idea jacob, to add up the values. actually really simple
to implement, as well. but let's not complicate remapyield.py,
let's take a copy of it ok?
Comment 18 Jacob Lifshay 2023-12-22 23:59:04 GMT
(In reply to Luke Kenneth Casson Leighton from comment #16)
> oo hang on, i have my head round this now :) so i have a question:
> at the cross-over point to move to the next row (the outer loop)
> is it *guaranteed* that ca=0 such that it will not interfere with
> the new row?

yes afaict. I did an exhaustive check for word sizes of 2 bits and 3-word by 3-word multiplication (code below), so this formally proves that case. I have no reason to think that doesn't extend to more words and/or bigger words, so it should work.

exhaustive check code (not pretty, but works):

>>> def maddedu(a, b, c): return (a*b+c)%4,(a*b+c)//4
... 
>>> def adde(a,b,c): return (a+b+c)%4,(a+b+c)//4
... 
>>> def mul(a,b):
...  y=[0]*(len(a)+len(b))
...  ca=0
...  for ai in range(len(a)):
...   for bi in range(len(b)):
...    y[ai+bi],t=maddedu(a[ai],b[bi],y[ai+bi])
...    y[ai+bi+1],ca=adde(y[ai+bi+1],t,ca)
...  return y
... 
>>> def i2l(v,l): return [(v >> (i * 2)) % 4 for i in range(l)]
... 
>>> def l2i(l): return sum(v << (i * 2) for i, v in enumerate(l))
... 
>>> for a in range(64):
...  for b in range(64):
...   expected = a * b
...   v = l2i(mul(i2l(a,3),i2l(b,3)))
...   assert expected == v
... 
>>>
Comment 19 Jacob Lifshay 2023-12-23 00:05:03 GMT
(In reply to Luke Kenneth Casson Leighton from comment #17)
> (In reply to Jacob Lifshay from comment #14)
> > i had an idea for SVSHAPE: if we can get a matrix mode with 3 spare bits
> > (e.g. a flag that repurposes some bits somewhere), then we can use them like
> > so:
> > 000 normal matrix mode
> > 001 x
> > 010 y
> > 011 x + y
> > 100 z
> > 101 x + z
> > 110 y + z
> > 111 x + y + z
> 
> *deep breath*... i love it!

we can take all the combinations that duplicate something already encodable in matrix mode and just declare them reserved, so at least 000 and 001, icr if matrix mode can produce a sequence like 000011112222333300001111... which would cover both 010 and 100 too.
Comment 20 Jacob Lifshay 2023-12-23 00:08:34 GMT
(In reply to Jacob Lifshay from comment #19)
> we can take all the combinations that duplicate something already encodable
> in matrix mode and just declare them reserved, so at least 000 and 001, icr
> if matrix mode can produce a sequence like 000011112222333300001111... which
> would cover both 010 and 100 too.

if all 4 options with zero or one axis selected are redundant, then we can save a bit and the encoding can be as follows:

00 x + y
01 x + z
10 y + z
11 x + y + z

basically we remove the bit for z and set it to 1 if x or y are set, otherwise set z to 0 and x and y to 1.
Comment 21 Luke Kenneth Casson Leighton 2023-12-23 13:16:09 GMT
(In reply to Jacob Lifshay from comment #19)

> we can take all the combinations that duplicate something already encodable
> in matrix mode

ah no. not a good idea. the encoding is complex enough as it is.
the point is that this is a RISC encoding not a CISC encoding.
it's right at the critical juncture between decode and issue.

hence why i said *do not* take out the reversing on reduce/sum because
it matches with other inversion, and you actually have to have *extra gates*
to *EXCLUDE* the "i don't see why anyone would want this" logic that
caused you (without consultation) to remove the reverse-direction logic
from remap_preduce.py (and still not yet add it back, as i've reminded
you needs to be done, many times since)


> and just declare them reserved, so at least 000 and 001, icr
> if matrix mode can produce a sequence like 000011112222333300001111...

of course! that's absolutely critical. it's a degenerate case of an X by Y
matrix Schedule-walk, where Y=1. look at remapyield.py - see "skip" mode.

above sequence is produced by

    skip=0b110,x=4,y=1,VL=20 (larger VL for the dots)
Comment 22 Luke Kenneth Casson Leighton 2023-12-23 13:26:21 GMT
(In reply to Jacob Lifshay from comment #20)

> then we can save a bit

and complexify the encoding which is just not happening at a critical juncture.
top priority lesson to observe here, jacob: think multi-issue and how long
the gate chain is on complex decode, right where this is deciding register
reservation/allocation.
Comment 23 Luke Kenneth Casson Leighton 2023-12-23 17:23:05 GMT
(In reply to Jacob Lifshay from comment #18)
> (In reply to Luke Kenneth Casson Leighton from comment #16)
> > oo hang on, i have my head round this now :) so i have a question:
> > at the cross-over point to move to the next row (the outer loop)
> > is it *guaranteed* that ca=0 such that it will not interfere with
> > the new row?
> 
> yes afaict. I did an exhaustive check for word sizes of 2 bits and 3-word by
> 3-word multiplication (code below), so this formally proves that case. I
> have no reason to think that doesn't extend to more words and/or bigger
> words, so it should work.

i know that the multiply itself, even with add-64-bit, cannot exceed
the boundary of 128.

and therefore we know also that an N-mul (N=256,512,etc) cannot exceed
the boundary of N*2 bits

and therefore i believe it reasonable to assume that an Nx64 (N mod 64 == 0)
multiply would also not exceed the boundary of.. err.... N+1?

which means that the last digit (N+1) of that add is never going to carry.

therefore, logically, if NxM (N%64==0 and M%64==0) is constructed from
a sequence of M//64 adds, each of length N+1 but shifted over (long-mul
style) then the last in each of those N+1 things is never going to involve
a carry...

but there *is* a mul-and-add involved, and i recall that even a mul-and-add
cannot produce a carry.

* 0xffff * 0xffff              = 0xfffe0001
* 0xffff * 0xffff + 0xffff     = 0xffff0000
* 0xffff * 0xffff + 0xffff + 1 = 0xffff0001

so even if you have mul-and-add-*and-carry* it still never exceeds 2x the bits.

because if you do (0xffff+1) * (0xffff+1) it is a^2 + 2ab + b^2
compared to 0xffff*0xffff and lay it out as a 2D square (a on x-axis
b on y-axis) only the bottom-left quadrant is full but the other
3 quadrants (representing the ab + ab + b^2) are entirely empty.

   a*b=0x0000ffff b*b=         1
   a^2=0xfffe0001 b*a=0x0000ffff

vs

     ........   ........
   0xfffe0001   ........

* b=1 therefore b^2 = 1
* a=0xffff therefore a^2 you have 0xfffe0001

but you have to have *two* lots of ab (0xffff) to reach 0x10000*0x10000
which as long as you do not try to do mul-add with 2 x 0xffff you are
*guaranteed* never to carry-over on a madd operation.

with maddedu only doing RC=0xffff, and not even taking a CA flag,
we're good.
Comment 24 Luke Kenneth Casson Leighton 2023-12-23 17:40:34 GMT
commit 1ece0cc3abbea9b39df6b58e4bb07b158f556caf (HEAD -> bug_1155_bigmul_remap)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sat Dec 23 17:38:59 2023 +0000

    bug 1155: add new idea by jacob to do x+y+z mode for bigmul remap

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=1ece0cc3a

took a look, started from a copy of remapyield.py and i believe it can be
slotted in very simply with minimal disruption to remapyield.py instead.
see how it goes.
Comment 25 Luke Kenneth Casson Leighton 2023-12-23 19:13:57 GMT
(In reply to Jacob Lifshay from comment #20)

> if all 4 options with zero or one axis selected are redundant, then we can
> save a bit and the encoding can be as follows:
> 
> 00 x + y
> 01 x + z
> 10 y + z
> 11 x + y + z
> 
> basically we remove the bit for z and set it to 1 if x or y are set,
> otherwise set z to 0 and x and y to 1.

damnit damnit, just realised that actually for the *matrix* style
(2D/3D) schedule, summing is just as useful *even of triangles*,
plus actually producing *triangle* schedules (without summing)
is also useful.

whiiich... takes up the entirety of "permute=0b110/111"... arg arg...

ok so what i am currently thinking is:

* mode=0b00 is current "Matrix" mode 
* mode=0b11 is *still matrix schedule* but is "summation" (x+y+z) BUT..

... under *sub*-modes it can go into "Triangular" (Pascal's Triangle).

where mode=0b00 you would get:

0  4 8  12
1  5 9  13
2  6 10 14
3  7 11 15

instead you would get:

0  4 8  12
   5 9  13
     10 14
        15

(or maybe)

0  
1  5
2  6 10
3  7 11 15

so N*(N+1)/2 numbers instead of N*N but they are *still* in NxN form

then within this, when x=1 or y=1 or z=1, this triggers the "triangular"

thoughts?
Comment 26 Jacob Lifshay 2023-12-24 03:38:07 GMT
(In reply to Jacob Lifshay from comment #20)
> if all 4 options with zero or one axis selected are redundant, then we can
> save a bit and the encoding can be as follows:
> 
> 00 x + y
> 01 x + z
> 10 y + z
> 11 x + y + z
> 
> basically we remove the bit for z and set it to 1 if x or y are set,
> otherwise set z to 0 and x and y to 1.

this takes 1 nor gate, 1 not gate, and 2 or gates and saves 1 encoding bit -- I think that's a good tradeoff:

 xi  yi
 |   |
 @---|-----+
 |   |     |
 |   @-----|----------+
 |   |     |          |
\-___-/    |          |
|     |    |          |
|     |    |          |
 \---/     |          |
   O       |          |
   |       |          |
   @-------|---@------|---+
   |       |   |      |   |
 \---/    \-___-/    \-___-/
  \ /     |     |    |     |
   O      |     |    |     |
   |       \---/      \---/
   |         |          |
   z         x          y

(In reply to Luke Kenneth Casson Leighton from comment #22)
> and complexify the encoding

i don't think it's enough to matter here since it can be expanded in parallel with other parts such as checking the PO, thereby being unlikely to cause extra latency, which is the part that matters -- total gate count is basically unaffected and is secondary anyway.

(In reply to Jacob Lifshay from comment #19)
> we can take all the combinations that duplicate

I mostly meant just the 3 add-selection bits, i agree finding all possible combinations is excessive, however I think not removing any redundancy when it's easy to save a bit is excessive in the other direction.
Comment 27 Jacob Lifshay 2023-12-24 03:54:01 GMT
(In reply to Luke Kenneth Casson Leighton from comment #25)
> thoughts?

triangles are cool, needs further thought. I'm thinking we'd want a way to do new_x = (x + y) % x_sz for diagonal matrix stuff:

e.g.:
original matrix indexes:
0 1 2 3
4 5 6 7
8 9 A B

new indexes:
0 1 2 3 -- rotated right 0
5 6 7 4 -- rotated right 1
A B 8 9 -- rotated right 2
Comment 28 Luke Kenneth Casson Leighton 2023-12-25 00:44:48 GMT
(In reply to Jacob Lifshay from comment #27)
> (In reply to Luke Kenneth Casson Leighton from comment #25)
> > thoughts?
> 
> triangles are cool, needs further thought. I'm thinking we'd want a way to
> do new_x = (x + y) % x_sz for diagonal matrix stuff:

yep. see comment #0 3rd option
except it was just going to be sum (x+y).

> new indexes:
> 0 1 2 3 -- rotated right 0
> 5 6 7 4 -- rotated right 1
> A B 8 9 -- rotated right 2

ok *matrix* rotated, iinteresting, like it.

ok so there are also some AV patterns to cover, which are:

  01234567     0123456789a
  --------     -----------
0|3456789a   0|   01234567
1|3456789a   1|   01234567
2|23456789   2|  01234567
3|23456789-> 3|  01234567
4|12345678   4| 01234567
5|12345678   5| 01234567
6|01234567   6|01234567
7|01234567   7|01234567

which is a "skip=0110" mode with x=7,y=7,z=2 which gives the (twice)
repeated row. i.e where it was *believed* that skip could be dropped,
actually it can't.

hm.

|0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
|----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
|xdimsz|ydimsz| zdimsz  | submode2|sk1/invxy|offset|submode|0b11 |bigmul |

5 bits for submodes. sk1 to give that repetition. invxy for mirroring.
(like Indexed Mode which has mirroring)

let's see what is possible:

* submode[0]: standard matrix (x*xdimsz + y) or bigmul (x+y)
* submode[1]: square or triangle

* submode2[0] 0b00 modulo then add offset  0b01 add offset then modulo
* submode2[1:2] triange 0b00 x+y   0b01 y+z   0b10 z+x   0b11 RSVD
                square  0b00 x,y,z 0b01 y,x,z 0b10 z,x,y 0b11 z,y,x

i am reasonably certain there are better uses for last 2 bits of submode2
offset is signed. modulo it puts bigger numbers out first
Comment 29 Luke Kenneth Casson Leighton 2023-12-25 16:14:55 GMT
(In reply to Luke Kenneth Casson Leighton from comment #28)

> offset is signed.

no hang on offset need not be signed, it just needs to be
less than the modulo. so can be unsigned.
Comment 30 Luke Kenneth Casson Leighton 2023-12-25 17:09:17 GMT
(In reply to Luke Kenneth Casson Leighton from comment #28)

idea (bear in mind modulo on everything, so -ve nums will wrap)

 * submode2[1] triangle 0b0 x                   0b1 -x
 * submode2[2] triangle 0b0 y                   0b1 -y

 * submode2[1] square 0b0 x,y                   0b1 y,x
 * submode2[2] square 0b0 modulo shift on rows, 0b1 modulo shift on cols

and the modulo is VL (not MAXVL) which can be an extra parameter.
reminder:

* submode2[0] 0b00 modulo then add offset  0b01 add offset then modulo

so modulo (0..VL-1) plus offset has to be less than MAXVL which is fine.
Comment 31 Luke Kenneth Casson Leighton 2023-12-27 10:50:06 GMT
this is doing my head in :) will edit shortly

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=745c7d22


|0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
|----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
|xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |0b00 |Matrix |
|xdimsz|ydimsz|SVGPR    | 11/     |sk1/invxy|offset|elwidth|0b00 |Indexed|
|xdimsz|mode  | zdimsz  | submode2| invxyz  |offset|submode|0b01 |DCT/FFT|
| rsvd |rsvd  |xdimsz   | rsvd    | invxyz  |offset|submode|0b10 |Red/Sum|
|xdimsz|ydimsz| zdimsz  |rs/nx/ny |sk1/invxy|offset| 0b00  |0b11 |ModS/Sq|
|xdimsz|ydimsz| zdimsz  |rs/pr/col|sk1/invxy|offset| 0b01  |0b11 |ModM/Sq|
|xdimsz|ydimsz| zdimsz  |rs/nx/ny |sk1/invxy|offset| 0b10  |0b11 |ModS/Tr|
|xdimsz|ydimsz| zdimsz  |rs/pr/col|sk1/invxy|offset| 0b11  |0b11 |ModM/Tr|

key:

* ModS - summation with modulo
* ModM - matrix with modulo
* Sq - Square
* Tr - Triangle
* rs        reserved (must be zero)
* pr        permute x/y y/x
* col       column / rw
* sk1       skip X
* invxy     invert x, invert y

sk1/invxy is the same as Indexed, exactly, so is no extra gates

the priority is X/Y however Z is also possible, *and* skip allows
for repeating (see comment #28)
Comment 32 Jacob Lifshay 2023-12-27 11:04:42 GMT
(In reply to Luke Kenneth Casson Leighton from comment #31)
> this is doing my head in :) will edit shortly
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/
> decoder/isa/remapyield.py;hb=745c7d22
> 
> 
> |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
> |----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
> |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |Matrix |

mode should be 0b00 here...
Comment 33 Luke Kenneth Casson Leighton 2023-12-27 18:52:53 GMT
(In reply to Jacob Lifshay from comment #32)

> mode should be 0b00 here...

good catch, next time just edit it?

thoughts on the capabilities? anything more complex there is always
Indexed.
Comment 34 Jacob Lifshay 2023-12-28 03:33:02 GMT
(In reply to Luke Kenneth Casson Leighton from comment #33)
> thoughts on the capabilities?

It looks pretty good... I'm too tired at the moment for any further deep thoughts, sorry.

However, I did notice that you seem to have 2 different places to select between triangle/square -- both bits 28 and 29...that seems redundant to me:

(In reply to Luke Kenneth Casson Leighton from comment #31)
> |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
> |----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
> |xdimsz|ydimsz| zdimsz  | nx / ny |sk1/invxy|offset|trngl/0|0b11 |ModuloT|
> |xdimsz|ydimsz| zdimsz  |rs/pr/col|sk1/invxy|offset|trngl/1|0b11 |ModuloM|
Comment 35 Jacob Lifshay 2023-12-28 03:48:43 GMT
some additional thoughts on the multiply algorithm from comment #10:

we'll want to have an additional instruction afterward that clears the temporary values, since this allows the hardware to not have to compute them, which may be much more complex than just computing the product directly, especially if the hardware uses a different multiplication algorithm internally.

(In reply to Jacob Lifshay from comment #10)
> loop:
>     sv.maddedu *y, *a, *b, *y  # RS is stored in scalar t
>     sv.adde *(y + 1), *(y + 1), t
>     svstep.
>     sv.bdnz ... loop

basically, we want to clear `t` and `CA32` afterward so the hardware can pattern-match the whole thing and use whatever multiplication algorithm it likes, without having to compute the correct `t`/`CA32` values, which would basically require an extra multiplier just to get the right values. this gives hardware designers the freedom to use whatever multiplication hardware is the most efficient or fastest or etc. without being forced to produce values for `t` and `CA32` that're just ignored anyway.

we don't need to clear CA since it is known to be zero (but the hardware needs to be able to recognize that CA is zero on entry, or it could just speculate that CA is zero and if it isn't the loop can be restarted and run in scalar mode)

We can just use subfc t, t, t to clear CA, CA32 and t in one simple instruction.
Comment 36 Jacob Lifshay 2023-12-29 20:17:09 GMT
so, should I start implementing the current design? if it's not ready yet I think I'll work on the already established parts of the bigint presentation while I'm waiting.
Comment 37 Luke Kenneth Casson Leighton 2023-12-30 08:48:27 GMT
(In reply to Jacob Lifshay from comment #36)
> so, should I start implementing the current design? 

i appreciate it's holidays (i've spent 40% of the past week asleep!)

gentle but very firm reminder that you are to read and re-read the
bug comments *online*, and when you do so you will observe things
that you missed when tired, such as comment #31 *already implementing*
the modulo REMAP Schedule.

> if it's not ready yet I
> think I'll work on the already established parts of the bigint presentation
> while I'm waiting.

good idea.

the next phase here is to design svshape3 to take the largest portion
of possible *most useful* options that will fit into the limited
space.

until that's done (the assessment not the coding)...

basically this can't be rushed.

the number of modes means that svshape3 will need to be more like svindex

 810 # svindex instruction  <a name="svindex"> </a>
 811 
 812 SVI-Form
 813 
 814 | 0:5|6:10 |11:15  |16:20 | 21:25       | 26:31 |  Form    |
 815 | -- | --  | ---   | ---- | ----------- | ------| -------- |
 816 | PO | SVG | rmm   | SVd  | ew/yx/mm/sk | XO    | SVI-Form |
 817 
 818 * svindex SVG,rmm,SVd,ew,SVyx,mm,sk

something like:

 814 | 0:5|6:10 |11:15  |16:20 | 21:25       | 26:31 |  Form    |
 815 | -- | --  | ---   | ---- | ----------- | ------| -------- |
 816 | PO | x   | y     | z    | XX/yx/XX/sk | XO    |TODO-Form |


where the two XXs give some options. although i would be tempted
to cut down z to  range 1-4 and have more bits for modes


 814 | 0:5|6:10 |11:15  |16:20 | 21:25       | 26:31 |  Form    |
 815 | -- | --  | ---   | ---- | ----------- | ------| -------- |
 816 | PO | x   | y     | z/XXX| XX/yx/XX/sk | XO    |TODO-Form |

the modes included from submode and submode2 are EIGHT bits and
that's excluding the offset (!) so it may be necessary to also
design svshape4 *as well* to cover an offset-priority version
just as svshape2 is a 2D offset priority version of the 3D svshape
instruction.

daaaamn.
Comment 38 Luke Kenneth Casson Leighton 2023-12-30 13:35:25 GMT
svshape4 (daaamn combined with svshape3 that's a lot of opcode space)

modulo with offset priority

 945 | 0:5|6:9 |10|11:15  |16:20  | 21:24  | 25 | 26:31 |  Form      |
 946 | -- |----|--| ---   | ----- | ------ | -- | ------| --------   |
 947 | PO |offs|yx| XXXX  | SVd   | YYYY   | sk | XO    | SVM2-Form  |
 948 
 949 * svshape4 offs,yx,XXXX,SVd,sk,YYYY

by using MAXVL to indirectly compute one of the dimensions
it is possible although inconvenient to have quite a few bits
spare and cover all dimensions. perhaps even enough to set
z=1-4 as well... yes. maybe even some spare.

svshape3:

 814 | 0:5|6:10 |11:15  |16:20 | 21:25       | 26:31 |  Form    |
 815 | -- | --  | ---   | ---- | ----------- | ------| -------- |
 816 | PO | x   | y     | z/XXX| XX/yx/XX/sk | XO    |TODO-Form |

svshape4:

 945 | 0:5|6:9 |10|11:15  |16:20  | 21:24  | 25 | 26:31 |  Form      |
 946 | -- |----|--| ---   | ----- | ------ | -- | ------| --------   |
 947 | PO |offs|yx| XXXX  | SVd   | YYYY   | sk | XO    | SVM2-Form  |

fitting as closely with svshape2 also saves decoder gates.

hmmm... really, SVd needs to move to bits 6:10 across svshape2
svshape3 and svshape4.

damn damn damn. i really don't like making changes like this, but
it is what it s.
Comment 39 Luke Kenneth Casson Leighton 2023-12-30 17:19:53 GMT
(In reply to Jacob Lifshay from comment #35)
> some additional thoughts on the multiply algorithm from comment #10:
> 
> we'll want to have an additional instruction afterward that clears the
> temporary values, since this allows the hardware to not have to compute
> them, which may be much more complex than just computing the product
> directly, especially if the hardware uses a different multiplication
> algorithm internally.

(see caveat first... ) ok so again a gentle but firm reminder: we are
effectively designing the hardware-equivalent of a Universal Software API.

how and what decisions hardware implementors make is *not our problem*
as designers of that *Universal* Software API.

illustration: if software developers start measuring SV "performance"
on a processor-by-processor basis from even the same manufacturer and
customise assembler based on differences just like they are forced to do
with SIMD ISAs then we have CATASTROPHICALLY failed, as ISA Designers,
to do our goddamn job correctly.

i therefore remind you *once again* not to make ANY assumptions at
the hardware implementation level that could impact or force a particular
decision at the SOFTWARE level.

now the caveat (which is a reminder of what i repeated before, many times)

we *do* actually have to think through, from the implementor's persective,
if the Software API (aka SV) is actually viable. but - butbutbut - not
from just ONE implementation but ***ALL*** possible implementations.

i stopped you from putting assumptions about hardware in the past, into
a presentation, and am slightly peeved to have to remind you again,
but please *for goodness sake* under *no circumstances* are you to
write a presentation or engage with anyone on the basis of assuming,
or let someone assume, that there shall and will and will ONLY BE just
one and exclusively one hardware implementation...

... or, much much worse, that ALL hardware implementations are FORCED
to have, for example, a JIT macro-op Fusion Hardware Engine.

you are effectively right now telling Embedded Implementors, "you
MUST HAVE a Macro-Op Fusion Engine.  you MUST direct this complex
pattern of SIX INSTRUCTIONS at a dedicated back-end. you have NO CHOICE
in this matter".

yeh?

we have to be so so careful here. i know you said "if", but the
instructions you mention are more of a "Programmer's Note" that
*some* implementations *may* have optimised Macro-Op Fusion.

they will be extremely high-end ones.  spotting and matching a
sequence of six to seven instructions is pretty extreme.
Comment 40 Jacob Lifshay 2023-12-31 05:09:23 GMT
while we're designing new svshape instructions, we should also think about register arguments (GPR and/or VL) so we can set dynamic x/y/z dimensions and/or dynamic offsets (some reasonable selection of those, in particular we need some kind of dynamic offset for bigint shifts of more than 64-bit, and dynamic x dimension would be really useful for parallel prefix-sum and parallel reduction).
Comment 41 Jacob Lifshay 2023-12-31 05:23:31 GMT
(In reply to Luke Kenneth Casson Leighton from comment #39)
> we *do* actually have to think through, from the implementor's persective,
> if the Software API (aka SV) is actually viable. but - butbutbut - not
> from just ONE implementation but ***ALL*** possible implementations.

yes, which is why I think we should design it so if some implementation decides it likes some other multiplication algorithm and wants to translate, it isn't required to produce some other expensive outputs that basically no software will care about, but are practically required for correctness -- the temporaries.

if the cpu has to look 3000 instructions into the future to see that the temporary is ultimately overwritten, no reasonable cpu for a very long time will be able to do that, so will be required to compute the temporary values which are quite expensive.

from a SV consistency point of view, we can't just declare the temporaries as undefined (since there isn't a good reason why a simple loop over scalar multiplies should suddenly be allowed to give arbitrary random results),
so the next best option is to have a cheap simple instruction to clear the temporaries after -- unless we can come up with a bigint mul algorithm that naturally overwrites all temporaries (e.g. the only things changed are the product's registers), or where the values left in the temporaries are cheap to compute.

obviously if you're using a simple scalar cpu, you can omit the clear-temporaries instruction, but if you could be using a high-end cpu, imo it's worth having.
Comment 42 Jacob Lifshay 2023-12-31 05:30:13 GMT
(In reply to Luke Kenneth Casson Leighton from comment #39)
> you are effectively right now telling Embedded Implementors, "you
> MUST HAVE a Macro-Op Fusion Engine.  you MUST direct this complex
> pattern of SIX INSTRUCTIONS at a dedicated back-end. you have NO CHOICE
> in this matter".

I never said the hardware has to, I've always just said it *can*, not *must*. also, it probably doesn't have to match all the instructions, just the 4 instructions in the loop and matches on the current SV state.

actually, while thinking about this, I think we'll want an instruction that has the combined functionality of bdnz and svstep. maybe one of the branch instruction's bits can be used for that for sv.bc?
Comment 43 Luke Kenneth Casson Leighton 2023-12-31 06:58:03 GMT
(In reply to Jacob Lifshay from comment #40)
> while we're designing new svshape instructions, we should also think about
> register arguments (GPR and/or VL) so we can set dynamic x/y/z dimensions

i said no and it means no. please read the spec.

> and/or dynamic offsets (some reasonable selection of those, in particular we
> need some kind of dynamic offset for bigint shifts of more than 64-bit, and
> dynamic x dimension would be really useful for parallel prefix-sum and
> parallel reduction).

not happening, i have said this before and it is inviolate and not
to be discussed further under this bugreport as it is firmly out of
scope. they can write directly to SVSHAPEs and take the consequences
after reading the spec's warnings.
Comment 44 Luke Kenneth Casson Leighton 2023-12-31 07:08:06 GMT
(In reply to Jacob Lifshay from comment #42)

> I never said the hardware has to,

i know, i apologise, i'm a little jumpy on this topic, things
need to be spelled out explicitly (i am partly writing for future
readers here).

> I've always just said it *can*, not
> *must*. also, it probably doesn't have to match all the instructions, just
> the 4 instructions in the loop and matches on the current SV state.

4 is still far too complex. even just 2 is complex enough.
good to have the discussion but please bear in mind *all*
possible potential implementations have to be considered, not
just one.

ISA design is tedious extreme-patient work.

> actually, while thinking about this, I think we'll want an instruction that
> has the combined functionality of bdnz and svstep.

good thought/idea: already tried it. started to break RISC paradigm.
far too complex both in hardware and in terms of combined functionality.

mostly it was down to number of register hazards exceeding 3-in 2-out
if i recall, but also that branch analysis goes completely sideways.

> maybe one of the branch
> instruction's bits can be used for that for sv.bc?

there are already 512 combined options which is alarming the OPF
(with good reason) so no. the unit tests are going to be a massive
job.

good to have the discussion but can we please keep this bugreport in
scope and on topic? its budget is already quite small
Comment 45 Luke Kenneth Casson Leighton 2023-12-31 07:14:51 GMT
(In reply to Jacob Lifshay from comment #41)

> software will care about, but are practically required for correctness --
> the temporaries.

too complex to discuss and completely out of scope for this bugreport.
the focus of this bugreport, 1155, is the svshape3 and svshape4
instructions, when we are under time pressure and a limited budget.

shelve it please.

> if the cpu has to look 3000 instructions into the future to see that the
> temporary is ultimately overwritten,

if you're looking 3,000 instructions into the future then you already
have a massively-optimised CPU and are loooong into "low-hanging fruit".
5% or above you pay attention to, first. 0.33% is just not worth
bothering with vs the cost.

drop it please: mentally do the cost-benefit analysis ok?
Comment 46 Jacob Lifshay 2023-12-31 07:18:12 GMT
(In reply to Luke Kenneth Casson Leighton from comment #45)
> if you're looking 3,000 instructions into the future then you already
> have a massively-optimised CPU and are loooong into "low-hanging fruit".
> 5% or above you pay attention to, first. 0.33% is just not worth
> bothering with vs the cost.

i meant that no one's making a cpu that can look 3000 instructions ahead anytime soon, so we should consider ways to inform the cpu that those results are unused.
Comment 47 Luke Kenneth Casson Leighton 2024-01-01 17:43:19 GMT
(In reply to Jacob Lifshay from comment #46)

> i meant that no one's making a cpu that can look 3000 instructions ahead
> anytime soon, so we should consider ways to inform the cpu that those
> results are unused.

DROP it.

keep this bugreport on topic and in scope. you are making me
angry by not respecting that we have to FOCUS on meeting our
obligations and committments to NLnet on a hard immovable
deadline.

back to the topic for which this SMALL budget is intended:

swapping svindex SVd and offset is too much, and too disruptive.
it will need its own budget and to be Scheduled for an atomic
change including and especially binutils.

so this means it is free and clear to do svshape3 and 4, but
with temporary operand positions.

can you get that done as quickly as possible?
Comment 48 Luke Kenneth Casson Leighton 2024-01-01 18:12:20 GMT
(In reply to Luke Kenneth Casson Leighton from comment #47)

> can you get that done as quickly as possible?

i've started editing comment #0 to begin a rapid prototype
phase on bitpositions of fields.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;h=1edbb6a5ea5333e0eae3d1e9d4e0c66132c41e6c;hb=fa603a1e9f2259d86acf4e9451937a000d099289#l335
Comment 49 Luke Kenneth Casson Leighton 2024-01-03 00:14:02 GMT
edited comment #0 again to jam submode3 and 4 into a 7 bit XO:
    bits 24,26:31

when bit 24 is 0 it s svshape3
               1             4

the entire sv*remap set is a mess but that needs resolving separately.
Comment 50 Luke Kenneth Casson Leighton 2024-01-03 23:11:43 GMT
jacob so sorry, i just realised that these need to be "targetted".
svshape3 can have "modes suitable for bigint", but svshape4 has
to be able to set 1 or more of SVSHAPE0-3

the logic behind that is like the matrix mode, you can do
matrix multiply in 3 instructions (SVrm=0b0000 for that one)
and it rewrites *three* SVSHAPE SPRs in one hit, to do that.

but svoffset you can set up much more tailored arrangements
and i had forgotten about that.

therefore the design of svshape3 and 4 need really to be
done in conjunction with the ed25519 and/or pow-mod to
get a clear picture / use-case.
Comment 51 Luke Kenneth Casson Leighton 2024-01-04 01:53:08 GMT
reservations for svshape3&4, at a 5-bit XO:

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=e980fd1c7
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=61ef110be

5-bit XO means XO is in bits 27-31, which i have edited in comment #0
for svshape4 already, TODO svshape3.

svshape4 i suggest something different from svoffset and svindex: just
specify si=0-3 for SVSHAPE0-3. no attempt to set multiple SHAPEs
like how svoffset and svindex do (it takes 6 bits to do that, also
setting up SVSTATE RA/RB/RC/RT/EA which is too much to attempt with
these complex Modulo REMAPs).
Comment 52 Jacob Lifshay 2024-01-04 07:45:33 GMT
(In reply to Luke Kenneth Casson Leighton from comment #50)
> therefore the design of svshape3 and 4 need really to be
> done in conjunction with the ed25519 and/or pow-mod to
> get a clear picture / use-case.

ok, I'll work on trying to fit them into the bigint mul and powmod algorithms soon, probably tomorrow.
Comment 53 Luke Kenneth Casson Leighton 2024-01-04 08:09:13 GMT
(In reply to Jacob Lifshay from comment #52)
> (In reply to Luke Kenneth Casson Leighton from comment #50)
> > therefore the design of svshape3 and 4 need really to be
> > done in conjunction with the ed25519 and/or pow-mod to
> > get a clear picture / use-case.
> 
> ok, I'll work on trying to fit them into the bigint mul and powmod
> algorithms soon, probably tomorrow.

awesome. i recommend doing individual unit tests first
(test_caller_svp64_bigmul.py or test_caller_svp64_svshape.py)
working up to
that vertical-first loop idea, keep it to a real short
loop of say 2x2 or 3x2 64-bit otherwise it takes too long
to run. we need *small* unit tests, running for
*no more than 30 seconds total*, first. these act as "demos"
and end up with significant comments (see strncpy in ldst test
for what i am looking for)

write and use svshape4 initially as that's the one that can
have (needs) "explicit" control (targetting individual SVSHAPEs
but also including an offset).

remapyield.py does now have modulo, triangle, and rhombus capability,
you'll need to tie it in to ISACaller just like the other remaps.
set mode=0b11 rather than mode=0b00

once we know how many svshape4s are needed, plus an svremap instruction,
*then* we have enough proper working konwledge to design svshape3.

to give you some idea, this is the schedule for ed25519
https://bugs.libre-soc.org/show_bug.cgi?id=773#c3
Comment 54 Jacob Lifshay 2024-01-05 08:37:04 GMT
(In reply to Jacob Lifshay from comment #52)
> ok, I'll work on trying to fit them into the bigint mul and powmod
> algorithms soon, probably tomorrow.

I started getting ready to work on them, but ran into a bug that looks like it blocks most work due to making SVSHAPE unusable for simulations afaict:

random test case that fails (I expect there to be lots):
pytest -v src/openpower/decoder/isa/test_caller_svp64_dct.py::DCTTestCase::test_sv_remap_fpmadds_idct_outer_8

src/openpower/decoder/isa/svshape.py:147: in skip
    if self.is_indexed() or self.is_triangle():
src/openpower/decoder/isa/svshape.py:54: in is_triangle
    return self.mode == 0b11 and self.skip in [0b00, 0b01]
src/openpower/decoder/isa/svshape.py:147: in skip
    if self.is_indexed() or self.is_triangle():
E   RecursionError: maximum recursion depth exceeded while calling a Python object
!!! Recursion detected (same locals & position)

it isn't obvious if is_triangle should be accessing:
return self.fsi['skip'].asint(msb0=True)
or:
inv = self.fsi['invxyz'].asint(msb0=True)
return (inv & 0b100) >> 2

so I'm leaving it up to you to fix. I'll work on the algorithms next time I get a chance (probably not friday as I've been waking up too late so may not have time before sunset, so maybe on sunday?)
Comment 55 Jacob Lifshay 2024-01-07 20:02:34 GMT
(In reply to Jacob Lifshay from comment #54)
> so I'm leaving it up to you to fix. I'll work on the algorithms next time I
> get a chance (probably not friday as I've been waking up too late so may not
> have time before sunset, so maybe on sunday?)

Luke, if you don't get around to it, I'll add a commit disabling is_triangle (with a FIXME comment) so SVSTATE works enough to run the simulator. You can then revert immediately before fixing it, but please don't push a revert until you have the commits fixing it (or just push to a different branch).
Comment 56 Luke Kenneth Casson Leighton 2024-01-08 01:11:18 GMT
(In reply to Jacob Lifshay from comment #55)

> then revert immediately before fixing it, but please don't push a revert
> until you have the commits fixing it (or just push to a different branch).

not a problem, i was experimenting.
Comment 57 Jacob Lifshay 2024-01-09 08:54:51 GMT
(In reply to Luke Kenneth Casson Leighton from comment #56)
> (In reply to Jacob Lifshay from comment #55)
> 
> > then revert immediately before fixing it, but please don't push a revert
> > until you have the commits fixing it (or just push to a different branch).
> 
> not a problem, i was experimenting.

pushed:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=dd21d8be43d9aff018b366d9891e6b19f2bb0a5e

commit dd21d8be43d9aff018b366d9891e6b19f2bb0a5e
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Tue Jan 9 00:50:28 2024 -0800

    svshape.py: disable is_triangle for now since it causes RecursionError
Comment 58 Jacob Lifshay 2024-01-10 09:23:03 GMT
added WIP version of mul remap algorithm, it still needs proper SVSHAPE values filled in (I'm using mtspr for now till we have working svshape[34]) and a few other misc. things.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=7350d55815e5b99e8a3b997840c498df73af7648

commit 7350d55815e5b99e8a3b997840c498df73af7648
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Jan 10 01:16:52 2024 -0800

    openpower/test/bigint/mul_remap.py: add WIP mul_remap.py
Comment 59 Luke Kenneth Casson Leighton 2024-01-10 16:23:13 GMT
(In reply to Jacob Lifshay from comment #58)
> added WIP version of mul remap algorithm, it still needs proper SVSHAPE
> values filled in (I'm using mtspr for now 

that's a smart idea, should work fine as long as "modes"
are properly respected by remapyield.py ... they shouulld be...
you added... ah no, here, in svshape.py

    def get_iterator(self):
        log ("SVSHAPE get_iterator", self.mode, self.ydimsz, self.is_indexed())
        if self.mode == 0b00:
            iterate_fn = iterate_indices

should be "if self.mode in [0b00, 0b11]"

and that should do the trick

yes, svshape3 instruction can come later, semi-designed in comment #0
see "svshape3 TODO"
Comment 60 Jacob Lifshay 2024-01-16 19:35:25 GMT
(In reply to Luke Kenneth Casson Leighton from bug #1157 comment #52)
> jacob how are you getting on with svshape3? bug #1155 comment #0

i think i ran out of time, so i'm planning on working on the presentation, i'll get back to this after. sadoon can use svindex for now.
Comment 61 Luke Kenneth Casson Leighton 2024-01-16 19:57:13 GMT
(In reply to Jacob Lifshay from comment #60)

> i think i ran out of time, so i'm planning on working on the presentation,
> i'll get back to this after. sadoon can use svindex for now.

ok. sensible decision. i *may* take care of it. if you get the pr.
done first let me know?
Comment 62 Jacob Lifshay 2024-01-16 20:13:59 GMT
(In reply to Luke Kenneth Casson Leighton from comment #61)
> ok. sensible decision. i *may* take care of it. if you get the pr.
> done first let me know?

sounds good.
Comment 63 Jacob Lifshay 2024-02-15 04:39:30 GMT
ok, I started working on the algorithm and quickly ran into the issue that we need to implement EXTRA2_MODE support in all of the simulator and assembler/disassembler and insndb...this is because sv.maddedu needs to put RS in RT + MAXVL, which isn't what happens currently, currently we just hardcode the simulator to behave as if EXTRA2_MODE = 1 for an arbitrary subset of instructions.

We can't just change the simulator to hardcode EXTRA2_MODE = 0 for maddedu, since all the existing sv.maddedu operations in our tests depend on it behaving like EXTRA2_MODE = 1.

        # use SVSHAPE0 for RA, SVSHAPE1 for RB,
        # SVSHAPE2 for RC/RT, SVSHAPE3 for RS
        "svremap 0o37, 0, 1, 2, 2, 3, 0",
        # RS is scalar by using constant remap
        "sv.maddedu *4, *32, *36, *4",  # FIXME: need to set EXTRA2_MODE to 0

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/mul_remap.py;hb=9d45047d7e7bc8678cd3ea30e1e883249f1fa344
Comment 64 Luke Kenneth Casson Leighton 2024-02-15 05:03:20 GMT
(In reply to Jacob Lifshay from comment #63)

> assembler/disassembler and insndb...this is because sv.maddedu needs to put
> RS in RT + MAXVL, which isn't what happens currently, currently we just
> hardcode the simulator to behave as if EXTRA2_MODE = 1 for an arbitrary
> subset of instructions.

rats. which... yep i was avoiding that as it's quite a lot of small
changes in lots of areas. including binutils.

> We can't just change the simulator to hardcode EXTRA2_MODE = 0 for maddedu,
> since all the existing sv.maddedu operations in our tests depend on it
> behaving like EXTRA2_MODE = 1.

yyep. sigh. 

an alternative temp workaround is to add an maddedu2. this would
be a hell of a lot less work.
Comment 65 Jacob Lifshay 2024-02-15 09:22:53 GMT
(In reply to Luke Kenneth Casson Leighton from comment #64)
> (In reply to Jacob Lifshay from comment #63)
> 
> > assembler/disassembler and insndb...this is because sv.maddedu needs to put
> > RS in RT + MAXVL, which isn't what happens currently, currently we just
> > hardcode the simulator to behave as if EXTRA2_MODE = 1 for an arbitrary
> > subset of instructions.
> 
> rats. which... yep i was avoiding that as it's quite a lot of small
> changes in lots of areas. including binutils.

oh, wait, I just thought of an easy workaround:

just set SVSHAPE3's offset to -1, which puts RS in RT - 1, this works since SVSHAPE3 would have otherwise been set to always give zero as the remapped index.
Comment 66 Luke Kenneth Casson Leighton 2024-02-15 15:40:48 GMT
(In reply to Jacob Lifshay from comment #65)

> just set SVSHAPE3's offset to -1, which puts RS in RT - 1, 

i love obscure tricks like this, esp. when i don't understand thm :)
Comment 67 Luke Kenneth Casson Leighton 2024-02-16 00:45:53 GMT
i made a quvk start on svshape4 this evening but ran out of time,
will work on it tomorrow. basically svshape3 will be targetted
at setting p multiple shapes to do a specific (commonlyused) job
but svshape4 is more general
Comment 68 Luke Kenneth Casson Leighton 2024-02-18 18:59:52 GMT
hiya jacob just added svshape4, untested, added SVI2-Form, fields, and
the pseudocode. should be useable. different from other sv* as it
specifies (si) which SVSHAPE{si} to use. ran out of bits to do anything
sophisticated basically :)

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=116a6cfd3
Comment 69 Jacob Lifshay 2024-02-19 11:23:48 GMT
(In reply to Luke Kenneth Casson Leighton from comment #68)
> hiya jacob just added svshape4, untested,

looks nice, I'll have to try it out! I'm planning on working on thit some more on monday.
Comment 70 Luke Kenneth Casson Leighton 2024-02-19 12:38:57 GMT
(In reply to Jacob Lifshay from comment #69)
> (In reply to Luke Kenneth Casson Leighton from comment #68)
> > hiya jacob just added svshape4, untested,
> 
> looks nice, I'll have to try it out! 

hell of a lot simpler than the other ones, it is just copying bits
from operands to SPR, one operand says which SPR to target.
feel free to modify/bugfix it indiscriminately :)
Comment 71 Jacob Lifshay 2024-02-20 04:38:19 GMT
I added a skip_case_slow decorator that skips unless the RUN_SLOW env var is set and applied it to the long powmod test, running the whole test suite with RUN_SLOW unset now takes 3min41s on my desktop. I made it so CI runs with RUN_SLOW set, so it should still test everything.

Motivation: https://bugs.libre-soc.org/show_bug.cgi?id=676#c37

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=4873bfd82621172db7753ed98c56e3308f7d303c

commit 4873bfd82621172db7753ed98c56e3308f7d303c (HEAD -> bug_1155_bigmul_remap, origin/bug_1155_bigmul_remap)
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Feb 19 20:35:03 2024 -0800

    powmod: mark case_powmod_256 with skip_case_slow
    
    running full test suite with RUN_SLOW unset now takes 3min41s on
    my desktop -- substantially faster.

commit 10703bbd41ad4d1f5e527101d0a5746d7e851e7b
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Feb 19 20:34:39 2024 -0800

    .gitlab-ci.yml: always set RUN_SLOW for CI

commit adfef6ed6a57f5214eeaa55c055b4eb78d2826c2
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Feb 19 20:33:16 2024 -0800

    test/common: add skip_case_slow that skips unless RUN_SLOW env var is set
Comment 72 Jacob Lifshay 2024-02-20 05:30:20 GMT
(In reply to Luke Kenneth Casson Leighton from comment #66)
> (In reply to Jacob Lifshay from comment #65)
> 
> > just set SVSHAPE3's offset to -1, which puts RS in RT - 1, 
> 
> i love obscure tricks like this, esp. when i don't understand thm :)

well, perfect: I said we don't need signed offset, but turns out we still do.

Me saying we don't need signed offsets:
https://bugs.libre-soc.org/show_bug.cgi?id=911#c5

switching to 4-bit signed offsets (instead of unsigned) should be doable, since most applications of offsets can just reverse the trick mentioned in bug #911 comment #5.

here, we *do* need signed offsets since the base register is RC and we're offsetting RS, so we can't change the base register. if we don't have signed-offsets that limits the maximum product size to 14 registers, since we have to put the RS result outside of the product and the max offset we can set is 15.

to be honest, I'm thinking it would be better to just use the obvious algorithm from comment #0 (reproduced below in case comment #0 is edited) instead of trying to cram it all in one vertical first loop, since that VF-loop has very annoying dependency chains through CA, and because the algorithm from comment #0 also lets us have a bigint mul-add (just don't clear product, voila, an additional add input!), saving even more time. Also, the algorithm from comment #0 lets us use the exact same pattern (the bi loop) as a scalar-vector mul using horizontal first sv.maddedu, likely saving CPU complexity too.

Algorithm from comment #0:
void mul(
    uint64_t *product,
    const uint64_t *a,
    const uint64_t *b,
    size_t a_sz,
    size_t b_sz
) {
    for(size_t bi = 0; bi < a_sz; bi++) {
        product[bi] = 0;
    }
    for(size_t ai = 0; ai < a_sz; ai++) {
        uint64_t carry = 0;
        for(size_t bi = 0; bi < a_sz; bi++) {
            uint128_t v = (uint128_t)a[ai] * (uint128_t)b[bi];
            v += (uint128_t)carry;
            v += (uint128_t)product[ai + bi];
            carry = v >> 64;
            product[ai + bi] = (uint64_t)v;
        }
        product[ai + b_sz] = carry;
    }
}
Comment 73 Jacob Lifshay 2024-02-20 05:39:27 GMT
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0bb801c11404225735c7f25b1ecc6ea70db349e8

I'll switch to using svshape4 once we know the SVSHAPE values work...eliminating extra things that could break.

commit 0bb801c11404225735c7f25b1ecc6ea70db349e8
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Feb 19 21:34:23 2024 -0800

    mul_remap: WIP -- fill in SVSHAPE[0-2], SVSHAPE3 needs work still
    
    see: https://bugs.libre-soc.org/show_bug.cgi?id=1155#c72

commit d8138c79ff5b746f14a5188f46b2e8d79ec00701
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Feb 19 21:34:03 2024 -0800

    remapyield: add some more demos
Comment 74 Luke Kenneth Casson Leighton 2024-02-20 23:35:23 GMT
(In reply to Jacob Lifshay from comment #72)

> well, perfect: I said we don't need signed offset, but turns out we still do.

joy. luckily the whole purpose of the NLnet grants is to find these
things out :) 

hmm hmm, find another bit.
https://libre-soc.org/openpower/sv/draft_opcode_tables/

i know, move svshape3 to draft/temp-PO5 then use a 4-bit XO
for svshape4, something like:

| 0:5|6:9|10  |11:15  |16:20 | 21 22:23 24:25 |26:27| 28:31 | Form   |
| -- |---|----| ---   | ---- | -------------- | --  | ------| -------- |
| PO |SVo|SVyx| y     | x    | sk /sm2 / sm   | si  | XO    |TODO-Form |

where si and SVyz have swapped places compared to what i did previously.
Comment 75 Jacob Lifshay 2024-02-21 00:11:37 GMT
(In reply to Luke Kenneth Casson Leighton from comment #74)
> (In reply to Jacob Lifshay from comment #72)
> 
> > well, perfect: I said we don't need signed offset, but turns out we still do.
> 
> joy. luckily the whole purpose of the NLnet grants is to find these
> things out :) 
> 
> hmm hmm, find another bit.

we don't need another bit for now, 4-bit signed is sufficient for this. though having more bits would definitely be useful when trying to set offset dynamically, for bigint shifts by >64-bits.
Comment 76 Luke Kenneth Casson Leighton 2024-02-21 00:19:32 GMT
(In reply to Jacob Lifshay from comment #75)

> we don't need another bit for now, 4-bit signed is sufficient for this.

there's only 3 in svshape4.

svshape4, 2nd mode:
--------

 945 | 0:5|6:8|9:10|
 946 | -- |---|----|
 947 | PO |SVo|si  | 

it's pretty normal to get confused between the instruction(s)
and the SVSHAPEs tht the instructions update/modify.
Comment 77 Jacob Lifshay 2024-02-21 00:50:20 GMT
(In reply to Luke Kenneth Casson Leighton from comment #76)
> (In reply to Jacob Lifshay from comment #75)
> 
> > we don't need another bit for now, 4-bit signed is sufficient for this.
> 
> there's only 3 in svshape4.

ok, doesn't matter for this case since it uses matrix mode to get a constant with offset -1, not rhombus, triangle, or other modes, so it'll use svshape not svshape4.

SVSHAPE3 = SVSHAPE(0)
SVSHAPE3.order = (0, 2, 1)  # so something is non-zero
SVSHAPE3.offset = -1  # FIXME: offset is unsigned
Comment 78 Luke Kenneth Casson Leighton 2024-03-05 14:30:52 GMT
editing svshape4 *again* :) SVyx needs to be 2-bit, split
across 28 & 31, to fit into PO5 (temporary)

https://libre-soc.org/openpower/sv/draft_opcode_tables/
Comment 79 Luke Kenneth Casson Leighton 2024-03-05 14:41:43 GMT
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=9500dc9b0

annoying, WIP, svshape4 is all over the shop. moving it to PO5 and getting
the fields sorted out.
Comment 80 Luke Kenneth Casson Leighton 2024-03-07 11:30:00 GMT
holy cow this is getting complex :)

currently (remapyield.py example) this is achievable:

  triangle 5 5 5 1
  iterate_indices 5 tri True mod True offs True dims 1 5 5 mode 0b11 sm2 0b0 _skip 3 inv [0, 0, 0]
   0->0  11
   1->1   1  2->2  11
   3->2   1  4->3   1  5->4  11
   6->3   1  7->4   1  8->5   1  9->6  11
  10->4   1 11->5   1 12->6   1 13->7   1 14->8  111

but what is actually needed is

  0
  10
  210
  3210

working through it
Comment 81 Jacob Lifshay 2024-05-14 01:42:24 BST
After doing some research, it looks like we only promised NLnet that we'd work on demos of cryptographic algorithms, not which specific ones we'd work on, or what kind of demos those would be. NLnet only has #773 on the MOU, which this bug is part of. Therefore I think we can say that the mul_remap.py source code and the discussion on this bug qualifies as a demo, so we can close this bug as completed.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/mul_remap.py;h=d11ef619650882df2234a4c75d64fa17638a063b;hb=9500dc9b04f3f80bd3f0fbf1c83887f07548818c

For additional work outside of the cryptorouter grant, we can create a new bug if/when we get around to that.
Comment 82 Luke Kenneth Casson Leighton 2024-05-14 01:57:22 BST
(In reply to Jacob Lifshay from comment #81)

> which this bug is part of. Therefore I think we can say that the
> mul_remap.py source code and the discussion on this bug qualifies as a demo,
> so we can close this bug as completed.

yeah i'd agree with that logic.