1157 – Implement poly1305

Bug 1157 - Implement poly1305

Summary: Implement poly1305

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Milestones (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	2021-02-052

URL:	https://loup-vaillant.fr/tutorials/po...

Depends on:	1155
Blocks:	1158 1159
	Show dependency tree / graph

Reported:	2023-09-10 17:25 BST by Sadoon Albader
Modified:	2024-08-12 15:12 BST (History)
CC List:	4 users (show)

See Also:	1158 1159 1155 1007 965 1006 1192
NLnet milestone:	NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks:	1500
budget (EUR) for this task, excluding subtasks' budget:	1500
parent task for budget allocation:	773
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	donated={amount=1200, submitted=2024-05-21} lkcl={amount=300, submitted=2024-05-21}

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sadoon Albader 2023-09-10 17:25:13 BST

Implement poly1305

---

* TODO: copy style of chacha20 unit test 
        https://bugs.libre-soc.org/show_bug.cgi?id=965#c0
        this involves putting a python-based version of
        poly1305 in the unit test and comparing output
* TODO: replace Index REMSP with bigmul REMAP when ready
        (see bug #1155)
* TODO: (if time/budget) drop the algorithm into plain assembler
        (see bug #1007)

Comment 1 Luke Kenneth Casson Leighton 2023-09-10 18:17:23 BST

sadoon if you copy the style of chacha20 using indices (see bug #965)
you will have something that is very straightforward to drop the
bigmul REMAP in its place.

Comment 2 Sadoon Albader 2023-09-10 18:50:50 BST

https://libre-soc.org/irclog/latest.log.html#t2023-09-10T18:37:56

Comment 3 Luke Kenneth Casson Leighton 2023-09-10 21:10:02 BST

(In reply to Sadoon Albader from comment #2)
> https://libre-soc.org/irclog/latest.log.html#t2023-09-10T18:37:56

ok that's not a bad implementation, look for regular patterns.

btw this will do as the unit test just drop that into the openpower-isa
repo, make sure to preserve copyright, where it came from, and the license

https://github.com/ph4r05/py-chacha20poly1305/blob/master/chacha20poly1305/poly1305.py

hubert's is LGPLv2
https://github.com/ph4r05/py-chacha20poly1305/blob/master/LICENSE

the regular patterns you can source the operands from alternate
regs, using REMAP Index and Vertical-First just like in chacha20


so, line 176
https://github.com/floodyberry/poly1305-donna/blob/e6ad6e091d30d7f4ec2d4f978be1fcfcbce72781/poly1305-donna-64.h#L176

        c = 0 # makes the pattern regular
	h1 += c      c = (h1 >> 44); h1 &= 0xfffffffffff;
	h2 += c;     c = (h2 >> 42); h2 &= 0x3ffffffffff;
	h0 += c * 5; c = (h0 >> 44); h0 &= 0xfffffffffff;
	h1 += c;     c = (h1 >> 44); h1 &= 0xfffffffffff;
	h2 += c;     c = (h2 >> 42); h2 &= 0x3ffffffffff;
	h0 += c * 5; c = (h0 >> 44); h0 &= 0xfffffffffff;
	h1 += c; # do this manually (outside the loop)

what you can do is: 

* 2 regs with consts 0xffff and 0x3fffff then REMAP Indices
  0 1 0 0 1 0
* another 2 regs with shift amounts 44 and 42 then REMAP
  Indices 0 1 0 0 1 0 oh look those are the same, use the
  exact same SVSHAPE as above
* 2 regs with a constant 1 and 5 for a mul c*1 or c*5,
  here you will need a 2nd REMAP Index 0 0 1 0 0 1
* 3 regs h0 h1 h2 use another REMAP Index 1 2 0 1 2 0

if you are very lucky you can target the same reg for the shift
as the AND, which means you don't need to do svremap inside the
loop,you can do it outside.


whatever you do, *go slowly*. even just the above is enough
to get on with, do it *as its own unit test*.  it will be:

* 5 instructions to set up Index SVSHAPEs and loop counter,
  and setvl in vertical-first mode
* a mul, an add, a shift, an AND
* svstep to decrement the counter and bc to do the VF loop
* an add for h1+=c

12 instructions, absolute max.

(mul by 5 or 1 avoids pissing about with branches, which would
damage parallelism).

strongly suggest *only* doing this, as a first incremental step.

Comment 4 Luke Kenneth Casson Leighton 2023-09-11 17:26:24 BST

https://libre-soc.org/irclog/latest.log.html#t2023-09-10T18:44:00

konstantinos was only able to do this *after* i had created the
test_caller_svp64_chacha20.py unit test.

in ISACaller you can pre-load *and extract* all regs in python
which is far far easier.

go *slow*, maximum of 5-8 lines of assembler at a time, and bootstrap
your way up.

then once the snippets are ready in test_caller_xxxx.py you are
ready to move to pypowersim.  you will need to do a *full* function
as a ".s" file just like how konstantinos did it, starting from:

* a POWER v3.0 c function compiled to assembler with gcc -S
* dropping that into the test binary *in place* of the c file
* converting line by line to svp64 assembler, taking the fragments
  from the *python* unit test and dropping them in.

all of this is slow patient and above all *incremental* work that you
can in no way shortcut or attempt to just go "i will do everything from
cold".

Comment 5 Luke Kenneth Casson Leighton 2023-09-13 01:01:12 BST

    "This form is the most parallel possible. We could compute every
     product in parallel, then sum everything, map-reduce style.
     There’s only one problem: we need to compute r squared, cubed,
     and so on. To avoid this, we can take a hybrid approach, with two
     applications of Horner’s rule instead of just one:"

https://loup-vaillant.fr/tutorials/poly1305-design?

sadoon we have Parallel Prefix REMAP for doing these cascading
multiplies but i am not sure it s worth it here.

Comment 6 Luke Kenneth Casson Leighton 2023-09-16 10:19:04 BST

sadoon always always always cross-reference. do it immediately at the
time that you have the discussion.
https://libre-soc.org/irclog/%23libre-soc.2023-09-15.log.html#t2023-09-15T17:26:58

Comment 7 Luke Kenneth Casson Leighton 2023-09-16 10:27:10 BST

https://libre-soc.org/irclog/%23libre-soc.2023-09-15.log.html#t2023-09-15T17:39:01

what is happening here is explained in this page
https://libre-soc.org/openpower/sv/biginteger/analysis/

adde uses XER.CA as carry-in and creates XER.CA as carry-out.  therefore
if you do adde adde adde adde with successive register numbers you have
created a big-int add by chaining the carrys together on each adde.
*therefore* you can *replace the adde sequence with **ONE** sv.adde
instruction.

real simple.

the maddedu instruction works similarly by treating one of the
reg pair as a *64-bit* carry-in carry-out.

btw one absolutely critical thing to understand about Simple-V is
that the Vector Length is in **ELEMENTS NOT BITS**. this makes it
natural for humans to work with.

Comment 8 Luke Kenneth Casson Leighton 2023-09-17 12:57:08 BST

sadoon i'm going to do a quick python port of the
excellent algorithm you found (written in c). this
will make it easier for you to understand the porting
process and also keep to pure python.

Comment 9 Luke Kenneth Casson Leighton 2023-09-17 13:15:42 BST

added hubert's LGPLv2.1 pure-python implementation of poly1305
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=8d18377c0

Comment 10 Luke Kenneth Casson Leighton 2023-09-17 13:23:20 BST

https://albader.co/poly1305.sv
https://libre-soc.org/irclog/%23libre-soc.2023-09-15.log.html#t2023-09-15T17:49:42

Comment 11 Sadoon Albader 2023-09-17 15:48:24 BST

(In reply to Luke Kenneth Casson Leighton from comment #8)
> sadoon i'm going to do a quick python port of the
> excellent algorithm you found (written in c). this
> will make it easier for you to understand the porting
> process and also keep to pure python.

Thanks for that, was planning to post my findings and chat logs ASAP but got busy with something urgent over the weekend.

Comment 12 Luke Kenneth Casson Leighton 2023-09-17 17:20:23 BST

(In reply to Sadoon Albader from comment #11)

> Thanks for that, was planning to post my findings and chat logs ASAP but got
> busy with something urgent over the weekend.

no problem. i am currently going "oink" over how MUL macro works
in poly1305-donna-64.h which is... behaving oddly

Comment 13 Sadoon Albader 2023-09-17 17:50:03 BST

(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Sadoon Albader from comment #11)
> 
> > Thanks for that, was planning to post my findings and chat logs ASAP but got
> > busy with something urgent over the weekend.
> 
> no problem. i am currently going "oink" over how MUL macro works
> in poly1305-donna-64.h which is... behaving oddly

It's quite simple really, I hope this answers your concern:

- unsigned long long is 64 bits on 64-bit systems.
- d0,d1,d2, and d are uint128_t variables.
- all this MUL macro is doing is typecasting the multiplier to 128-bit unsigned to 
ensure that the result stays in 128-bits, because in all cases in the poly1305_blocks 
function the multiplication is between two 64-bit unsigned ints which results in a 
maximum of 128-bit length ints.

FWIW if you see any odd behavior, my poly1305.sv implementation you linked to earlier 
uses pure hardware multiplication and can replicate results from poly1305-donna as long 
as they are in multiples of 16 bytes (I had not implemented "leftover" processing due
to lack of time back then). This was verified both in testbench and on two FPGAs (CPLD & FPGA if I need to be pedantic)

Here's the testbench file (also from over two years ago, sorry for the messy code)
https://albader.co/poly1305_tb.sv

I have only verified this working with ModelSim back then, and have not verified it against the final poly1305.sv (64-bits vs 8-bits) file but it should just work.

Comment 14 Luke Kenneth Casson Leighton 2023-09-17 18:27:53 BST

successful first port:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ecda34dbc

as there are carry-rollovers it is *not* guaranteed that these are
the same, because the carry-over may occur intermittently.  this will
need some random input (or just some sequential data) to check that
poly1305.py and poly1305-donna.py are equivalent, at which point it
becomes possible to start "morphing" the algorithm in the same way
as was done for chacha20: extract "schedules" around "repeated stuff"

Comment 15 Luke Kenneth Casson Leighton 2023-09-17 19:52:53 BST

(In reply to Luke Kenneth Casson Leighton from comment #3)

> so, line 176
> https://github.com/floodyberry/poly1305-donna/blob/
> e6ad6e091d30d7f4ec2d4f978be1fcfcbce72781/poly1305-donna-64.h#L176
> 
>         c = 0 # makes the pattern regular
> 	h1 += c      c = (h1 >> 44); h1 &= 0xfffffffffff;

okaaay nooow we have poly1305-donna-64bit.h converted to python,
*now* it is possible/convenient to morph towards a version that
"illustrates" how REMAP/Indexed would be dropped in, easily

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=d8ae5d7fa

this is *very deliberately* slow-and-steady work, sadoon: you have to
keep the unit test working *at all times*, never allowing not even one
single micro-morph-step to produce a wrong result nor deviate along the
"i think i will just start writing some massive amounts of untested code from 
 scratch with no way to test it except after about 2 months daunting work"
path.

in that way, you know:

* where you are
* what the goal is
* you can work from both ends to meet in the middle
* and you always have the unit test to tell you if your changes still work

with that one illustration there, you should be able to do the exact same
conversion trick (look for the common pattern, "break out" the indices/consts
into lists) and once all done it will be near-ridiculously-trivial to convert
each section to SVP64 assembler, and once all fragments are converted,
it becomes just as ridiculously-trivial to stitch them all together into
a single algorithm.

Comment 16 Sadoon Albader 2023-09-17 20:17:10 BST

(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Luke Kenneth Casson Leighton from comment #3)


> okaaay nooow we have poly1305-donna-64bit.h converted to python,

Awesome! Will take a look at it tomorrow, thanks!

I always like to say that writing unit tests before the code makes the process more fun and effective.

Comment 17 Sadoon Albader 2023-09-18 16:19:11 BST

Alright just git pulled openpower-isa to test things out, I wanted to make sure that we can handle both messages larger than 16 bytes *and* messages with leftover bytes (not multiple of 16). I grabbed the known MAC from poly1305-donna with its key and message, all seem to be working flawlessly. Brilliant Luke!

Time to get to porting to SVP64, one baby step at a time.

Comment 18 Luke Kenneth Casson Leighton 2023-09-18 16:37:20 BST

(In reply to Sadoon Albader from comment #17)
> Alright just git pulled openpower-isa to test things out, I wanted to make
> sure that we can handle both messages larger than 16 bytes *and* messages
> with leftover bytes (not multiple of 16). I grabbed the known MAC from
> poly1305-donna with its key and message, all seem to be working flawlessly.
> Brilliant Luke!

awesome. did you remember to add those verifications as unit tests?
i had a bug where c was multiplied by *four* not five and so the
possibility i described in comment #14 is a very real one.

i look forward to seeing you follow the standard project development
procedures (which i also look forward to not being placed into a position
of having to remind you of again) to put the commit git diff URL into
a bugreport alongside reports such as the above, with those unit tests.

i suggest just doing a test_poly1305.py which imports from both
poly1305.py and poly1305_donna.py and chucks a whole stack of
random-length random-data at them both and compares the two.

> Time to get to porting to SVP64, one baby step at a time.

i just tried adding a python-based implementation of dsrd
which *should* be possible to use
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=49d5222a

you can see places where it could potentially be used:

            h0 += (( t0                    ) & 0xfffffffffff);
            h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
            h2 += (((t1 >> 24)             ) & 0x3ffffffffff) | hibit;

that would be *TWO* uses of dsrd, not 3 - because h2 ends up with the
remainder of t1 from the *second* use of dsrd.

the dsrd calls can be split out:

            dd0 = (( t0                    ) & 0xfffffffffff);
            dd1 = (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
            dd2 = (((t1 >> 24)             ) & 0x3ffffffffff)

            h0 += dd0;
            h1 += dd1;
            h2 += dd2 | hibit;

where (guessing here):

            dd0,dd1 = dsrd(t0, 0, 44)
            dd0,dd1 = dsrd(t0, t1, 44)

you see how that would work? it's not quite there, but close.

Comment 19 Sadoon Albader 2023-09-18 20:26:15 BST

I'm trying to put as much of my focus on the multiplication right now, here goes:


The multiplication is literally just long multiplication with a nice trick
(read https://loup-vaillant.fr/tutorials/poly1305-design to understand better.)

With normal multiplication, if we had hundreds times hundreds (because we are
using 3 64-bit integers of course), it'd go like this:


                          h2        h1        h0
                          r2        r1        r0   *
--------------------------------------------------
                        h2r0      h1r0      h0r0  |
         h2r1           h1r1      h0r1            |
h2r2     h1r2           h0r2                      |
--------------------------------------------------
                                                    +

and the result would be:

h2r2      h2r1+h1r2      h2r0+h1r1+h0r2      h1r0+h0r1      h0r0

Note that h2r2 can also be 2 digits, which makes the final number a maximum of 6
digits.
So we limit the 64-bit variables of both h[x] and r[x] to 44 bits for
elements 0 and 1 and 42 bits for element 2, to ensure that they don't overflow
in addition over 2^130, and don't exceed 128-bits in 64*64-bit multiplication.

Modular multiplication does things a little differently, to get the remainder,
we don't divide by (2^130)-5, we instead multiply by 5 * 2^(130-128), and lo and
behold, we get our remainder *if* the number is less than p (2^130-5, our
prime).

To do this, we divide the two leftmost numbers by 2^128 (essentially moving them
to the right as you will see below) and then multiply them by 5 * 2 ^ 2

Sounds familiar? It's literally sll and srl, nothing special. Now let's look at
what happens:

                          h2        h1        h0
                          r2        r1        r0   *
-------------------------------------------------------
                        h2r0      h1r0      h0r0      |
                        h1r1      h0r1      h2r1*20   |
                        h0r2      h2r2*20   h1r2*20   |
-------------------------------------------------------
                                                        +

Now if you read the C/Python implementation, it makes perfect sense why:
s1 = r1 * (5 << 2)
s2 = r2 * (5 << 2)

d0 = h0r0 + h1s2 + h2s1
d1 = h0r1 + h1r0 + h2s2
d2 = h0r2 + h1r1 + h2r0

Stitch the d's together and do carry propagation:

h0 = (d0           ) & 0xfffffffffff (44 bits)
h1 = (d1 + d0 >> 44) & 0xfffffffffff
h2 = (d2 + d1 >> 44) & 0x3ffffffffff

h0 = (h0 + (d2 >> 42)) * 5
h1 = h1 + (h0 >> 44)
h0 = h0 & 0xfffffffffff

This way, we can have an array of 3 64-bit integers h[x] where
h[0-2] = h0,h1,h2

And that's our hash for one 16-byte block.

Thoughts re: implementing in SVP64:

I still don't fully understand how SVP64 works but I'll make assumptions to
brainstorm and hopefully be pushed in the right direction:

We can use a maximum of 64-bit registers *unless* we use bigint. However,
64-bit registers are what is needed for implementing a 32-bit version of
poly1305, not 64-bits, because multiplying 44*44 (or 42*42 minimum) exceeds
64-bits but is well below 128-bits.

Given the parallel nature of SVP64, we might as well implement a 32-bit version
which limits itself to 26-bits in all variables (130 / 26 = 5) and we would
need a total of 10 registers for r[] and d[].

Since bigint basically stitches numbers together and uses 64-bit multiplication
anyways (please do correct me if I'm wrong), there's no benefit to using
64-bits as opposed to 32-bits, could be quite the opposite with the bigint
overhead (again correct me if I'm wrong)

Now with that in mind, let's assume this (it'll be more numbers but the pattern
repeats)

d0 = h0r0 + h1s2 + h2s1
d1 = h0r1 + h1r0 + h2s2
d2 = h0r2 + h1r1 + h2r0

If we load h[] and r[] to our SVSHAPEs, we can use the clear pattern of

0,0  1,2  2,1
0,1  1,0  2,2
0,2  1,1  2,0

The leftmost number is n where n = 0..2, the rightmost number is n where n =
0..2 in the first case, n = 2,0,1 in the second case (overflowing 2 to 0) and n
= 1,2,0 (again starting from 1, overflowing from 2 to 0)

We multiply all the numbers first in parallel, then add the first 3 additions
in parallel, then the second addition. Seems doable in 3 instructions after
register preparation at least for this part?

Long post I know but I wanted to get things right :)

Comment 20 Sadoon Albader 2023-09-18 20:38:22 BST

Relevant discussion: (the matrix bridge crashed at the beginning, I copy-pasted the comments before it starts)

Sadoon: https://bugs.libre-soc.org/show_bug.cgi?id=1157#c19
Sadoon: I know you and Luke will probably enjoy this
Sadoon: I disected the algorithm to find patterns that can be useful for SVP64
Sadoon: I also found that we should probably do 32-bit ints instead of 64-bit

Continues on the irclog below:

https://libre-soc.org/irclog/%23libre-soc.2023-09-18.log.html#t2023-09-18T20:28:04

Comment 21 Luke Kenneth Casson Leighton 2023-09-18 21:00:06 BST

(In reply to Sadoon Albader from comment #20)
> Relevant discussion: (the matrix bridge crashed at the beginning, I
> copy-pasted the comments before it starts)
> 
> Sadoon: https://bugs.libre-soc.org/show_bug.cgi?id=1157#c19
> Sadoon: I know you and Luke will probably enjoy this
> Sadoon: I disected the algorithm to find patterns that can be useful for
> SVP64

excellent. now read the pseudocode for svfixedarith (below)
ad the bigint test cases. see comment #18 commitdiff.

> Sadoon: I also found that we should probably do 32-bit ints instead of 64-bit

no, because of dsrd and maddedu, both of which are designed as 3-in
2-out instructions.

https://libre-soc.org/openpower/isa/svfixedarith/

this is why i converted poly135-donna-64.h not poly1305-donna-32.h

Comment 22 Luke Kenneth Casson Leighton 2023-09-18 21:05:46 BST

(In reply to Sadoon Albader from comment #19)

> If we load h[] and r[] to our SVSHAPEs, we can use the clear pattern of
> 
> 0,0  1,2  2,1
> 0,1  1,0  2,2
> 0,2  1,1  2,0

ta-daaa.  that is *exactly* the pattern that jacob is organising
for the bigint REMAP.
 
> The leftmost number is n where n = 0..2, the rightmost number is n where n =
> 0..2 in the first case, n = 2,0,1 in the second case (overflowing 2 to 0)
> and n
> = 1,2,0 (again starting from 1, overflowing from 2 to 0)

  x, (x+y)%MAXVL

> We multiply all the numbers first in parallel, then add the first 3 additions
> in parallel, then the second addition. Seems doable in 3 instructions after
> register preparation at least for this part?

absolutely.  utterly ridiculously easy, isn't it?
now write it out in python - in poly1305-donna.py - to make
absolutely sure you got it right.

but first *write that unit test i said twice already is essential*.
please do not make me repeat things more than once.

comment #14
comment #18

Comment 23 Luke Kenneth Casson Leighton 2023-09-18 21:11:17 BST

(In reply to Sadoon Albader from comment #19)

> We can use a maximum of 64-bit registers *unless* we use bigint. 

scalar instructions DESIGNED to accelerate bigint operations. 

see svfixedarith.

these are ***SCALAR** instructions.

they are NOT repeat NOT vector instructions.

the whole purpose of this grant is to design SCALAR instructions
*AND* to design a vector-looping-system that can take advantage
of those SCALAR instructions.

we found that maddedu was really damn good when auto-looped by SVP64.

but this task "actually shows and proves it".  if it *doesn't* do
that then we work out *why*, go back to the drawing board and
*design sclar instructions that do*.

this is how it works.

Comment 24 Sadoon Albader 2023-09-22 19:37:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Sadoon Albader from comment #19)
> 
>   x, (x+y)%MAXVL

Ooooh that's the one, perfect.

> but first *write that unit test i said twice already is essential*.
> please do not make me repeat things more than once.

Yup yup of course, don't worry, on it as I type this :)

Comment 25 Luke Kenneth Casson Leighton 2023-09-24 11:55:51 BST

ok the interception is added of all math primitives
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a79fe44cf

but reporting is only done on ADD and ADDLO, for now.  there's
no filtering, and no "where the heck did that add-operation come from again??"
which i will add later.

Comment 26 Luke Kenneth Casson Leighton 2023-09-24 19:13:05 BST

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0857d123

this adds some *real* basic filtering, to show only those ADD/ADDLO
operations that have a 2nd arg between 0 and 7. there's actually only
one non-zero add i have seen so far, add 5.

this is going to need a *lot* more data run through it but also some
capturing of more information such as "what part of the algorithm was
using ADD/ADDLO".  for tomorrow...

Comment 27 Sadoon Albader 2023-10-11 20:56:06 BST

This is a copy-pasta of poly1305-donna.py BUT now has a small function that runs ISACaller in such syntax:

simulator(lst, RS, RA, RB)

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=6a798c6eb31f17bc2e7e4e2f2b25e51c3de39044

Obviously doesn't work with immediate-based instructions but the most important part is:

I have added ASM inlining using ISACaller in a very simple way, which means I can finally move to the SVP64 implementation while still using the same tests.

For now, only the first h0+= ... is implemented in ISACaller, I explicitly did this for testing whether this works, and luckily it only needs one instruction :D

I'll need to reconsider later whether I will use this function (likely not when things grow bigger but it's useful nonetheless) but it was a good exercise to a- revise my python skills and b- make sure I know how to call ISACaller

Comment 28 Luke Kenneth Casson Leighton 2023-10-11 21:08:55 BST

(In reply to Sadoon Albader from comment #27)
> This is a copy-pasta of poly1305-donna.py

as a general rule that is not done. it results in a hell of a
maintenance headache if a bug is found in *one* of the *copies*
of the duplicated code.

also i cannot review the diff because it isn't a diff!

i.e. you made *two* changes - two actions - and merged them 
into one commit.

1. you copied the file AND
2. you made changes to the file.

(the clue being the word "AND", this is not allowed
 under the HDL_workflow development practices)

please the simulator() function to the original file.
(and remove this one. two separate commits please)

also you missed adding yourself as a Copyright holder.

Comment 29 Sadoon Albader 2023-10-11 21:17:11 BST

(In reply to Luke Kenneth Casson Leighton from comment #28)
> please the simulator() function to the original file.
> (and remove this one. two separate commits please)
> 
> also you missed adding yourself as a Copyright holder.

My bad, I got ahead of myself from the excitement xD

Thanks for the input. I guess I'll do this then:

Add this and all the future simulations to the original and always assert (simulation result == original code result) for each simulation run to guarantee that it is always correct. This means not just the program output but the output of chunks of code of course.

Will fix this on Friday as it is almost midnight here.

Comment 30 Sadoon Albader 2023-10-13 15:19:33 BST

(In reply to Luke Kenneth Casson Leighton from comment #28)
> also you missed adding yourself as a Copyright holder.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=38cc7d7633cd96130d984faa6df68187ae333d1f;hp=b84df17c929d1f28d0e2c458e59bea26d512134f

Committed to a new branch to be safe, the original python code was written by yourself but the simulator/SVP64 stuff is being written by me, do I just add myself to the copyright in a separate line?

Again this is only for me to familiarize myself with ISACaller and PPC asm. Next step is to do all the h[x] += ... block in a single SVP64 run which is a perfect opportunity to start with SVP64 as you suggested in https://bugs.libre-soc.org/show_bug.cgi?id=1157#c3 :)

If you run this against the rand.csv generated by running the C code > rand.csv it should run perfectly with no errors, but I think there might be a tiny bug due to the way I did the asm. Not a big deal because I'm only using the current asm stuff to make porting *that* to SVP64 more streamlined and will delete the original asm once I'm done with the SVP64 one.

Comment 31 Sadoon Albader 2023-10-13 15:37:53 BST

My current brainstorming for the initial h[x]+= block:

1- store 0xfffff... (twice) and 0x3ffff... in three registers (p0,p1,p2)
2- store h0, h1, and h2 in 3 registers
3- split t0 and t1 into 3 registers:
    t0_s = t0
    t1_s = t0 >> 44 | t1 << 20
    t2_s = t1 >> 24 | hibit (this might be wrong, need to check)
4- setvl 3? 9?
5- the final run should be something like this:

   h_s[x] += t[x]_s & p[x]

Which is perfectly doable in two or a few more SVP64 lines if I'm now understanding things correctly.

Comment 32 Luke Kenneth Casson Leighton 2023-10-13 18:41:15 BST

(In reply to Sadoon Albader from comment #31)
> My current brainstorming for the initial h[x]+= block:
> 
> 1- store 0xfffff... (twice) and 0x3ffff... in three registers (p0,p1,p2)

yep

> 2- store h0, h1, and h2 in 3 registers
> 3- split t0 and t1 into 3 registers:
>     t0_s = t0
>     t1_s = t0 >> 44 | t1 << 20
>     t2_s = t1 >> 24 | hibit (this might be wrong, need to check)

hibit is "1<<40 if final else 0"

you can use sv.dsrd here with some "morphing".  notice
how 44+20 == 64 and also 24+40 (from the hibit shift)
is also == 64?

therefore if you do:

* t2 = 1
* set an array of [44,24] in RB
* point RA at t0
* point RT at t1_s

you have created t1_s with one vector instruction.

now, of course, it looks pointless because you have one
instruction to setvl=2 then sv.dsrd is another, you might
as well have done two dsrd instructions!

so try that first ok?

theeen.... ha ha you'll like this: try with *three* dsrd instructions,
but use an array [0,44,24] in RB! 

> 4- setvl 3? 9?

2, starting at t1_s and going to t2_s.  but if you do the
trick above of RB=[0,44,24] and start at t0_s then you get
the "t0_s = t0 >> 0 | something" , arrange something to contain
zero and voila, no need to do a manual copy (no addi t0_s,t0,0)


> 5- the final run should be something like this:
> 
>    h_s[x] += t[x]_s & p[x]

t_s[x] but yes
 
> Which is perfectly doable in two or a few more SVP64 lines if I'm now
> understanding things correctly.

yes it is. if you stick to Horizontal-First then annoyingly it
has to be done as:

  # first HF instruction, sv.and
  for x in range(VL)
    tmpregs[x] = t_s[x] & p[x]
  # 2nd HF instruction, sv.add
  for x in range(VL)
    h_s[x] += tmpregs[x]

which given the wasted extra vector of tmpregs is precisely why Vertical
First was invented, as that becomes:

   for x in range(VL)
     tmpreg = t_s[x] & p[x]
     h_s[x] += tmpreg

note tmpREG scalar rather than tmpREGS vector, but you need a loop
and an svstep. instruction which is more instructions, so you only
use VF if the regfile is under pressure or you just want to show off :)

Comment 33 Sadoon Albader 2023-11-17 19:52:56 GMT

(In reply to Luke Kenneth Casson Leighton from comment #3)
> what you can do is: 
> 
> * 2 regs with consts 0xffff and 0x3fffff then REMAP Indices
>   0 1 0 0 1 0
> * another 2 regs with shift amounts 44 and 42 then REMAP
>   Indices 0 1 0 0 1 0 oh look those are the same, use the
>   exact same SVSHAPE as above
> * 2 regs with a constant 1 and 5 for a mul c*1 or c*5,
>   here you will need a 2nd REMAP Index 0 0 1 0 0 1
> * 3 regs h0 h1 h2 use another REMAP Index 1 2 0 1 2 0

I've been trying to understand this for 6 hours so..
here goes:

x = [0] * 16
x[0] = 0xffff
x[1] = 0x3fffff
x[2] = 44
x[3] = 42
x[4] = 1
x[5] = 5
x[6] = h1
x[7] = h2
x[8] = h0

setvl 0, 0, 2(?), 1, 1, 1 # 2 because we have two registers?
svindex 0/2, 0, 1, 0(?), 0, 1, 0 # 0 because they are 64-bit?
svindex 2/2, 1, 1, 0, 0, 1, 0
svindex 4/2, 2, 1, 0, 0, 1, 0
setvl 0, 0, 3(?), 1, 1, 1 # 3 because we have 3 registers?
svindex 6/2, 3, 1, 0, 0, 1, 0

I would start with the remaps but I highly doubt this setvl and svindex instructions are correct. The ChaCha20 pages and code are elaborate but I need a few much simpler examples to wrap my head around this.

Comment 34 Luke Kenneth Casson Leighton 2023-11-17 23:53:45 GMT

(In reply to Sadoon Albader from comment #33)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > what you can do is: 
> > 
> > * 2 regs with consts 0xffff and 0x3fffff then REMAP Indices
> >   0 1 0 0 1 0

let's focus on this first

> I've been trying to understand this for 6 hours so..

(edit: you should have written 5h50m sooner, then!)

> here goes:
> 
> x = [0] * 16
> x[0] = 0xffff
> x[1] = 0x3fffff

start the indices for these at say r20. use 8-bit so put the
indices 0 1 0 0 1 0 in as 0x00_01_00_00_01_00 for a value of
0x000100000100 into r20.

if you set 64-bit indices then you need
* 0x000000000000000 in r20
* 0x000000000000001 in r21
* ....
* ....                 r25

which is as you might guess a total waste of 5 registers.



> x[2] = 44
> x[3] = 42
> x[4] = 1
> x[5] = 5
> x[6] = h1
> x[7] = h2
> x[8] = h0

those are the *values* not the indices (offsets) *to* those values.

> setvl 0, 0, 2(?), 1, 1, 1 # 2 because we have two registers?
> svindex 0/2, 0, 1, 0(?), 0, 1, 0

 20/2 because you point at the INDICES not the VALUES.

> # 0 because they are 64-bit?

 3 because the INDICES are 8-bit otherwise you waste 5 registers

> I would start with the remaps but I highly doubt this setvl and svindex
> instructions are correct. The ChaCha20 pages and code are elaborate but I
> need a few much simpler examples to wrap my head around this.

not going to happen :) or... it is but you need to get the REMAP
concept first. read the spec, pay attention to the pseudocode,
start of the page rhen on section 3.4
https://libre-soc.org/openpower/sv/remap/

basically ask me and i will guide you through it. please *do not*
waste your time thinking "i MUST do this on my own because of some
stubborn sense of feeling i must do it independently", this is
completely new Computer Science that nobody has ever seen before ok?

Comment 35 Luke Kenneth Casson Leighton 2023-11-18 00:02:17 GMT

sadoon read this unit test.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l92

Comment 36 Sadoon Albader 2023-11-19 16:07:20 GMT

(In reply to Luke Kenneth Casson Leighton from comment #35)
> sadoon read this unit test.
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/
> decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l92

The first thing that caught my eye was this line:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l110

Says there that vl=10, clearly it is set to 6, is that just a typo?

EDIT:

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l46

Yes indeed, looks like it was copy-pasted :P
Nevermind. I'm at that stage where I can't see the forest for the trees.

Comment 37 Luke Kenneth Casson Leighton 2023-11-19 23:06:17 GMT

(In reply to Sadoon Albader from comment #36)

> Says there that vl=10, clearly it is set to 6, is that just a typo?

ignore or fix it, i don't mind which.

> Nevermind. I'm at that stage where I can't see the forest for the trees.

no forest, no trees. it is very simple.  go through it step by step.

overview. SV Loops https://libre-soc.org/openpower/sv/overview/

    for i = 0 to VL-1:
       GPR(RT+i) = GPR(RA+i) + GPR(RB+i)

remap. dead simple. https://libre-soc.org/openpower/sv/remap/

    for i in range(VL):
       GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + GPR[RB+remap3(i)];

indexed. also dead simple.

    def index_remap(i):
        return GPR((SVSHAPE.SVGPR<<1)+i)
    for i in 0..VL-1:
        GPR(RT + indexed_remap(i)) = ...

*THAT IS ALL THERE IS TO IT*. full stop.

you can see it right there in the unit test expected results.

    118         for i in range(6):
    119             RA = initial_regs[0+ ----===>>>idxs[i]<<<====----]
    120             RB = initial_regs[0+i]
    121             expected_regs[i+8] = RA+RB

RA: Index-REMAPped
RB: *NOT* REMAPped.

you are expecting the VALUES to "magically" be REMAPped but have not
supplied any INDEXes.  you must SUPPLY the Indices to use for REMAP
and i recommended they start at r20. this is NOT THE SAME AS THE VALUES
which start at r0. how else can the values be REMAPped if you supply no
indices?

to repeat:

    THE INDICES HAVE NOTHING TO DO WITH THE VALUES.

conceptually the code that you wrote is attempting to do this:

    118         for i in range(6):
    119             RA = idxs[i] # values conflated as indices

Comment 38 Sadoon Albader 2023-11-21 15:36:35 GMT

(In reply to Luke Kenneth Casson Leighton from comment #35)
> sadoon read this unit test.
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/
> decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l92

Looking at this unit test again after toying with it and changing values here and there and I now understand much better.

I had expected the values to be filled with ASM as well, but here we're (understandably) using python to override the registers as it's much simpler that way.

The obligatory question then:

Would it be reasonable to just prepare the registers with the data we need and just do the SVP64 parts in ASM? Also by extension, setting svstate.x to the needed values might help me finish up the implementation faster. Even if we needed the actual SVP64 ASM in the end, I'd at least get to a known working state that I can translate to ASM.

I hope I'm making sense :)

Comment 39 Sadoon Albader 2023-11-21 15:58:34 GMT

(In reply to Luke Kenneth Casson Leighton from comment #37)

> you are expecting the VALUES to "magically" be REMAPped but have not
> supplied any INDEXes.  you must SUPPLY the Indices to use for REMAP
> and i recommended they start at r20. this is NOT THE SAME AS THE VALUES
> which start at r0. how else can the values be REMAPped if you supply no
> indices?

Ahh now I see my mistake here!
Still toying with the same unit test, it's clearing things up a lot.

Comment 40 Luke Kenneth Casson Leighton 2023-11-21 17:46:18 GMT

(In reply to Sadoon Albader from comment #38)
> (In reply to Luke Kenneth Casson Leighton from comment #35)
> > sadoon read this unit test.
> > 
> > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/
> > decoder/isa/test_caller_svindex.py;h=36bb6e837;hb=191a4b0cf#l92
> 
> Looking at this unit test again after toying with it and changing values
> here and there and I now understand much better.
> 
> I had expected the values to be filled with ASM as well, but here we're
> (understandably) using python to override the registers as it's much simpler
> that way.
> 
> The obligatory question then:
> 
> Would it be reasonable to just prepare the registers with the data we need
> and just do the SVP64 parts in ASM? 

as incremental *partial* unit tests... where you do say the h0/1/2 calculation
and only have say 3 lines of assembler... absolutely yes!  if you look
at comment #3 that's *exactly what i suggest*!

> Also by extension, setting svstate.x to
> the needed values might help me finish up the implementation faster.

yes, as we just discussed here about how svindex unit test does precisely
that... yes.  but again, see comment #3, you should do *sub* tests only
that way, and *SLOWLY* build up to a larger (full, complete) functio
that doesn't do that trick
https://libre-soc.org/meetings/sync_up/sync_up_2023-11-21/


> Even if
> we needed the actual SVP64 ASM in the end, I'd at least get to a known
> working state that I can translate to ASM.

eexactly.

> I hope I'm making sense :)

indeed - you can see in comment #3 and from other tests that this is
precisely what i expected you to do as it has worked well for me for years :)

Comment 41 Sadoon Albader 2023-12-08 17:10:27 GMT

Alright, finished up the first part of blocks.

I've decided not to create a test_svp64* file just yet, as that requires rewriting the python implementation for poly1305 (class should inherit from FHDLTestCase etc and that gave me many issues). I've decided to focus more on the SVP64 part to get it working because once that works the rest can be done easily.

A simple function to call ISACaller with the given lst and initial_regs:

def simulation_svp(lst,initial_regs):

    with Program(lst, bigendian=False) as program:
        sim = run_tst(program, initial_regs)
    return sim.gpr 

And we can then use the gpr in python, basically like inline asm.

The SVP64 implementation for the first part of blocks looks like this:

                'or 11, 9, 9', # move t0 to r11
                'rldicl 20, 9,  %d, 44' %(64-44), # equivelant to srdi
                'rldicr 21, 10, 20, %d' %(63-20), # equivelant to sldi
                'or 12, 20, 21', # move result to r12, 20&21 are temps
                'rldicl 13, 10, %d, 24' %(64-24),
                'or 13, 13, 30', # hibit or'ing, might be wrong but tests pass
                'setvl 0, 0, 3, 0, 1, 1',
                'sv.and *11, *11, *0',
                'sv.add *3, *3, *11',

Which is this in python:

            h0 += t0 & 0xfffffffffff
            h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
            h2 += (((t1 >> 24)             ) & 0x3ffffffffff) | hibit;

Then we can assert to check that the values are correct

            assert (h* == final_regs[*].value), "h* and h* simulation are unequal!"


I've added a comment that the first part could use optimization but I think I'll focus on the actual multiplication and addition first.

https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=c89f618a57b299169e1bcbe4a342278247ad2ce5

Comment 42 Luke Kenneth Casson Leighton 2023-12-08 18:18:16 GMT

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=c89f618a57b299169e1bcbe4a342278247ad2ce5

looks good sadoon. first sv instructions!


+                'or 11, 9, 9', # move t0 to r11
+                'rldicl 20, 9,  %d, 44' %(64-44), # equivelant to srdi
+                'rldicr 21, 10, 20, %d' %(63-20), # equivelant to sldi
+                'or 12, 20, 21', # move result to r12, 20&21 are temps
+                'rldicl 13, 10, %d, 24' %(64-24),
+                'or 13, 13, 30', # to accommodate hibit

convert those to *scalar* dsrd. comment #18.
do *not* attempt to go straight to svp64.

incremental.

replace with *scalar* first.

then go, "huh. these two dsrd instructions are using sequentially-
incrementing registers. hey i know, that means i can put "sv." in
front and use only one instruction!"

notice how i did *not* say "go straight to something complex that
you have never used before and do everything at once"?

8ncremental steps, incremental steps, incremental steps.

Comment 43 Luke Kenneth Casson Leighton 2023-12-08 18:21:58 GMT

(In reply to Sadoon Albader from comment #41)

> Which is this in python:
> 
>             h0 += t0 & 0xfffffffffff
>             h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
>             h2 += (((t1 >> 24)             ) & 0x3ffffffffff) | hibit;

see comment #18.

it is something like this in python:

r1, t0 = dsrd(t0, 0, 44)
r2, t1 = dsrd(t1, t0, 44)

h1 += r1
h2 += r2

or close to it.

Comment 44 shriya.sharma 2023-12-15 13:03:43 GMT

Found a document for poly1305. Good for understanding the algorithm.

https://cryptography.io/en/latest/hazmat/primitives/mac/poly1305/#

Comment 45 Luke Kenneth Casson Leighton 2023-12-15 22:56:55 GMT

(In reply to shriya.sharma from comment #44)
> Found a document for poly1305. Good for understanding the algorithm.
> 
> https://cryptography.io/en/latest/hazmat/primitives/mac/poly1305/#

niice. i'll link on the documentation bugreport, bug #1158

Comment 46 Sadoon Albader 2024-01-09 17:46:05 GMT

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=951586077f4bc63339a3e5e9b686dae715b87d73

Implemented first half of blocks in dsrd. Made use of setvl as well except for dsrd but will use it for dsrd as well.

Comment 47 Luke Kenneth Casson Leighton 2024-01-09 21:31:24 GMT

(In reply to Sadoon Albader from comment #46)
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=951586077f4bc63339a3e5e9b686dae715b87d73
> 
> Implemented first half of blocks in dsrd.

> ha! delighted to see!
> Made use of setvl as well except
> for dsrd but will use it for dsrd as well.

+                    'dsrd 11, 1, 3, 12',
+                    'dsrd 6, 0, 4, 13',
+                    'setvl 0, 0, 2, 0, 1, 1',

yes by moving r0 r1 and r6 and r11 you should be able to do

+                    'dsrd 6, 0, 3, 12',
+                    'dsrd 7, 1, 4, 13',

which becomes

+                    'setvl 0, 0, 2, 0, 1, 1',
+                    'sv.dsrd *6, *0, *3, *12',

interested to know what the rldiclr is about? what does it do?

btw you *may* have noticed this?

             h0 += t0 & 0xfffffffffff;
             h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
             h2 += (((t1 >> 24)             ) & 0x3ffffffffff) | hibit;

can become:

             h0 += (((t-1>> 0 ) | (t0 << 0 )) & 0xfffffffffff);
             h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
             h2 += (((t1 >> 24)             ) & 0x3ffffffffff) | hibit;

(where t-1 is deliberately set to zero)

therefore you can have *three* dsrds, set vl to *three* not two,
this should reduce by one instruction at least as you can set vl=3
just the once?

also..

+                    'sv.or *11, *11, *5',

should that just be

+                    'or 13, 13, 5'

assuming h0 is r11
         h1 is r12
         h2 is r13

?

Comment 48 Luke Kenneth Casson Leighton 2024-01-09 21:33:04 GMT

p.s. good progress :)

Comment 49 Sadoon Albader 2024-01-16 16:40:23 GMT

(In reply to Luke Kenneth Casson Leighton from comment #48)
> p.s. good progress :)

Thanks :)

I'm actually focusing on the add/mul part since it's the bulk of the work, I can deal with sv.dsrd later.

I've edited the python code a bit to show the full picture:

            d0=MUL(h0,r0);d=MUL(h1,s2);d0=ADD(d0,d);d=MUL(h2,s1);d0=ADD(d0,d);
            d1=MUL(h0,r1);d=MUL(h1,r0);d1=ADD(d1,d);d=MUL(h2,s2);d1=ADD(d1,d);
            d2=MUL(h0,r2);d=MUL(h1,r1);d2=ADD(d2,d);d=MUL(h2,r0);d2=ADD(d2,d);

The dependencies here only apply to add, meaning we can do 9 multiplications first in one go, and then add twice (assuming d here is a separate GPR for each multiplication).

Now, I was about to do this like I assumed in #19, but let's take a second look at this:

We have h0,h1,h2; r0,r1,r2; s1;s2

The multiplication occurs as follows:

d0 = h0r0 + h1s2 + h2s1
d1 = h0r1 + h1r0 + h2s2
d2 = h0r2 + h1r1 + h2r0

If we arrange the registers like this:
[r2,r1,r0,s2,s1]
 1  2  3  4  5
We can do this:

setvl to 3
sv.mul *RT, *h, *3
sv.mul *RT, *h, *2
sv.mul *RT, *h, *1

Perfectly reasonable, but I don't like this. These muls can be one sv.mul if we arrange things correctly. I think we need svindex but am still trying to learn svindex and svremap.

Comment 50 Jacob Lifshay 2024-01-16 18:42:14 GMT

discussed during meeting:
https://libre-soc.org/meetings/sync_up/sync_up_2024-01-16/

Comment 51 Luke Kenneth Casson Leighton 2024-01-16 19:10:37 GMT

(In reply to Sadoon Albader from comment #49)
> (In reply to Luke Kenneth Casson Leighton from comment #48)
> > p.s. good progress :)
> 
> Thanks :)
> 
> I'm actually focusing on the add/mul part since it's the bulk of the work, I
> can deal with sv.dsrd later.
> 
> I've edited the python code a bit to show the full picture:
> 
>            
> d0=MUL(h0,r0);d=MUL(h1,s2);d0=ADD(d0,d);d=MUL(h2,s1);d0=ADD(d0,d);
>            
> d1=MUL(h0,r1);d=MUL(h1,r0);d1=ADD(d1,d);d=MUL(h2,s2);d1=ADD(d1,d);
>            
> d2=MUL(h0,r2);d=MUL(h1,r1);d2=ADD(d2,d);d=MUL(h2,r0);d2=ADD(d2,d);
> 
> The dependencies here only apply to add, meaning we can do 9 multiplications
> first in one go, and then add twice (assuming d here is a separate GPR for
> each multiplication).
> 
> Now, I was about to do this like I assumed in #19, but let's take a second
> look at this:
> 
> We have h0,h1,h2; r0,r1,r2; s1;s2
> 
> The multiplication occurs as follows:
> 
> d0 = h0r0 + h1s2 + h2s1
> d1 = h0r1 + h1r0 + h2s2
> d2 = h0r2 + h1r1 + h2r0
> 
> If we arrange the registers like this:
> [r2,r1,r0,s2,s1]
>  1  2  3  4  5
> We can do this:
> 
> setvl to 3
> sv.mul *RT, *h, *3
> sv.mul *RT, *h, *2
> sv.mul *RT, *h, *1
> 
> Perfectly reasonable, but I don't like this.

yep they are supposed to be covered by bigmul REMAP but
we really need the indices via svindex first, and then
can replicate them in bigmul-REMAP.

i *think* i have it, i did add modulo and i did add a "x+y"
mode so in theory it should be possible to produce

0  2  1
1  0  2
2  1  0 

> These muls can be one sv.mul if
> we arrange things correctly. I think we need svindex but am still trying to
> learn svindex and svremap.

remember you earlier made the mistake of confusing indices with values,
but then resolved that. looks like you need a refresher/reminder

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;hb=HEAD

pay attention to lines 167, 169 and 181. run the unit test,
use print statements if you don't follow immediately ok?

the mapping system is a little awkward you have to remember that
SVSHAPEs 0-3 are *NOT* automagically connected up to registers.

that relationship is set by *svremap* which sets up the 5 possible
registers RA RB RC RT RS with the indices 0-3.

Comment 52 Luke Kenneth Casson Leighton 2024-01-16 19:29:35 GMT

(In reply to Luke Kenneth Casson Leighton from comment #51)

> 0  2  1
> 1  0  2
> 2  1  0 

actually this is x-y modulo 3, isn't it?

that's also possible, i added the option to +/- on both x *and* y.

jacob how are you getting on with svshape3? bug #1155 comment #0

Comment 53 Sadoon Albader 2024-01-17 17:57:45 GMT

(In reply to Luke Kenneth Casson Leighton from comment #52)
> (In reply to Luke Kenneth Casson Leighton from comment #51)
> 
> > 0  2  1
> > 1  0  2
> > 2  1  0 
> 
> actually this is x-y modulo 3, isn't it?


Yup!

We discussed this during yesterday's meeting: https://libre-soc.org/meetings/sync_up/sync_up_2024-01-16/

Jacob helped out a lot, Andrey did too. So this is basically the direction. svindex + svremap.

Comment 54 Jacob Lifshay 2024-05-22 11:07:25 BST

We discovered Luke accidentally submitted an RFP that included Sadoon's portion of this task.
Sadoon confirmed via private email that he's willing to donate his portion to Luke. That way we don't have to redo the RFP.

I forwarded the details to NLnet.

Luke and I discussed this on the mailing list:
Luke saying he'll ask sadoon:
https://lists.libre-soc.org/pipermail/libre-soc-dev/2024-May/006272.html
Our finding the problem:
https://lists.libre-soc.org/pipermail/libre-soc-dev/2024-May/006270.html

Comment 55 Luke Kenneth Casson Leighton 2024-08-12 15:12:32 BST

assigning to 2021-02-052@nlnet.nl work has been completed any QUESTIONS
ask publicly here for complying with project trust and transparency and audit
requirements established since project inception in 2018.