Bug 1044 - SVP64 implementation of pow(x,y,z)
Summary: SVP64 implementation of pow(x,y,z)
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Jacob Lifshay
URL:
: 1153 (view as bug list)
Depends on: 1155 1183 1161 1175
Blocks: 1237
  Show dependency treegraph
 
Reported: 2023-03-28 16:33 BST by Luke Kenneth Casson Leighton
Modified: 2024-01-06 01:51 GMT (History)
4 users (show)

See Also:
NLnet milestone: NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 589
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
jacob=1000 lkcl={amount=1000,submitted=2023-11-30,paid=2023-12-19}


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-03-28 16:33:30 BST
a demo implementation of a modular exponentiation function
is needed as a unit test to be run by pypowersim and/or
ISACaller

primary uses are for Diffie-Hellman and RSA but there
are many more

https://github.com/TOPDapp/py-diffie-hellman/blob/master/diffiehellman/diffie_hellman.py

--

* DONE: raise bignum REMAP bugreport bug #1155 and link here, move
        the following into the bugreport:
        - cross-reference to the svshape table "12 REMAP modes"
          so i can begin advising where to put it
          https://libre-soc.org/openpower/sv/remap/#svshape
        https://bugs.libre-soc.org/show_bug.cgi?id=1155
* DONE: extremely simple version as test_caller_svp64_pow.py
* DONE: merge branch into master.
* TODO: bug #1183 comment #2, the assembler is non-optimal
        (and the whole point of the grant exercise is to
         identify opportunities *to* optimise, not just
         blithely sit back and go "oh well this sub-optimal
         bunch of ops will do, ho hum" :) but we *identified*
         an opportunity, must document it at the very least,
         as a new bugreport

see comment #78 for the test results.
Comment 1 Jacob Lifshay 2023-03-28 16:42:21 BST
this is already partially completed by my work on bigint-presentation-code (bug #942), which is currently blocked on compiler support for svp64 since luke correctly observed that the diy compiler backend in there is a major distraction. the code can be easily converted to emitting llvm ir or similar once svp64 is added to llvm.

bug #942 needs to have the subtasks of this bug split out.
Comment 2 Jacob Lifshay 2023-03-28 16:43:21 BST
rsa uses a non-prime modulus, changing title
Comment 3 Luke Kenneth Casson Leighton 2023-03-28 23:20:28 BST
(In reply to Jacob Lifshay from comment #2)
> rsa uses a non-prime modulus, changing title

ah good catch i totally missed that.

need to find algorithms now, shorter more compact
abd fitting into as few instructions as possible
without being overly long completion time is fine.
Comment 4 Jacob Lifshay 2023-03-29 00:48:54 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > rsa uses a non-prime modulus, changing title
> 
> ah good catch i totally missed that.
> 
> need to find algorithms now, shorter more compact
> abd fitting into as few instructions as possible
> without being overly long completion time is fine.

how long is overly long? the reason i'm using fast mul algorithms is because they are closer to O(n^(1+ε)) rather than O(n^2), e.g. even toom-2 (karatsuba) multiplication is O(n^1.5...) so on 4096-bit mul is very approximately 512 64-bit mul ops, whereas naive O(n^2) is about 4096 64-bit mul ops. toom-3 and higher are even less than that.

if we use an algorithm as our demo that is known to be >4x slower, even if our cpu has a 256-bit simd mul unit, it will take approximately 4x as much power for the same speed or 4x as long for the same power without the simd unit, so everyone will see our proposal as some kind of bad joke, or that we're just noobs who don't know what their doing.

the approximately 4000 instruction version of fast mul is what you get when you recursively inline the functions, if we instead leave them as uninlined recursive functions the amount of code used is substantially smaller.

in any case we'll still want montgomery multiplication (which is a relatively simple algorithm that we could just code in c to call our bigint mul algorithm), which transforms modular exponentiation from needing a 8192-bit bigint divmod at each step to only needing it once at the end, saving a substantial amount of time.
Comment 5 Luke Kenneth Casson Leighton 2023-03-29 08:25:19 BST
(In reply to Jacob Lifshay from comment #4)

> how long is overly long?

greater than fits into 3-4 cache lines, absolute maximum.
the goal here is to eliminate TLB and L1 cache lookups.
this is for an embedded scenario not a supercomputing
scenario.
Comment 6 Jacob Lifshay 2023-03-30 02:45:36 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #4)
> 
> > how long is overly long?
> 
> greater than fits into 3-4 cache lines, absolute maximum.
> the goal here is to eliminate TLB and L1 cache lookups.
I assume you mean misses rather than lookups.

> this is for an embedded scenario not a supercomputing
> scenario.

ok, using O(n^2) multiplication could work since embedded cares less about speed and we can assume rsa is infrequently run so power doesn't really matter. I also already have code for O(n^2) mul in bigint-presentation-code.
Comment 7 Jacob Lifshay 2023-09-08 04:06:47 BST
*** Bug 1153 has been marked as a duplicate of this bug. ***
Comment 8 Luke Kenneth Casson Leighton 2023-09-08 11:23:33 BST
jacob i am assigning this to you, the REMAP bignum should
be set up as a sub-bug

shriya apologies i have to reassign this as the cryptoprimitives
grant is under pressure of ending, and you are finishing your
dissertation.
Comment 9 shriya.sharma 2023-09-08 18:53:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)
> jacob i am assigning this to you, the REMAP bignum should
> be set up as a sub-bug
> 
> shriya apologies i have to reassign this as the cryptoprimitives
> grant is under pressure of ending, and you are finishing your
> dissertation.

Nw. Thankyou.
Comment 10 Jacob Lifshay 2023-09-14 07:32:50 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=e044403f4c243b3248df0872a661756b6cc8a984

Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 13 23:24:12 2023 -0700

    add SVP64 256x256->512-bit multiply

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ec1196cb3abedd28492be29546e52959dee1a030

Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 13 23:22:33 2023 -0700

    generalize assemble() fn so other test cases can easily use it
Comment 11 Luke Kenneth Casson Leighton 2023-09-14 08:19:32 BST
(In reply to Jacob Lifshay from comment #10)
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=e044403f4c243b3248df0872a661756b6cc8a984
> 
> Author: Jacob Lifshay <programmerjake@gmail.com>
> Date:   Wed Sep 13 23:24:12 2023 -0700
> 
>     add SVP64 256x256->512-bit multiply
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=ec1196cb3abedd28492be29546e52959dee1a030

all of these can go in a vertical-first loop, using
predication 0b1000100010001000 and sv.addi/pm=nn a,b,c
to "pretend" there is a pair of nested loops.

  30     "sv.adde *7, *7, *21",
  31     "addi 24, 0, 0",
  32     "sv.maddedu *20, *12, 19, 24",  # final partial-product a * b[3]
  33     "addc 7, 7, 20",
  34     "sv.adde *8, *8, *21",

i.e. sv.addi with scalars does exactly the same thing as the
non-prefixed equivalent except you get predication and elwidth
overrides

the extra sv.adde can add a zero'd out source, making the first
iteration orthogonal with all other iterations and consequently
the VF loop can be applied.

and you can use Indexed REMAP on the sv.maddubrs and sv.adde.
only pain being, you have to enable/disable svremap before each
so i suggest using "non-persistent" svremap 

also same as Andrey, all files require a copyright notice, you
know this https://bugs.libre-soc.org/show_bug.cgi?id=1126#c13


> Author: Jacob Lifshay <programmerjake@gmail.com>
> Date:   Wed Sep 13 23:22:33 2023 -0700
> 
>     generalize assemble() fn so other test cases can easily use it

like it.
Comment 12 Luke Kenneth Casson Leighton 2023-09-14 08:54:07 BST
sigh, would be great to investigate this in a future grant
https://www.google.com/url?q=https://digital.library.adelaide.edu.au/dspace/bitstream/2440/101502/2/02whole.pdf
Comment 13 Jacob Lifshay 2023-09-15 04:20:04 BST
I started changing the register assignments to conform to the ABI regarding volatile/nonvolatile registers, since I want the modular exponentiation algorithm to be able to store temporaries in registers while calling the multiply algorithm, however, I discovered a scalar EXTRA2 encoding bug:
https://bugs.libre-soc.org/show_bug.cgi?id=1161

I spent the rest of the day trying to figure that bug out.

(In reply to Luke Kenneth Casson Leighton from comment #11)
> all of these can go in a vertical-first loop, using
> predication 0b1000100010001000 and sv.addi/pm=nn a,b,c
> to "pretend" there is a pair of nested loops.

I'll work on figuring that out and implementing it when I start working on the REMAP bug, other than changing the registers used, the current multiply algorithm is imho good enough for the purposes of this bug.

> 
> also same as Andrey, all files require a copyright notice, you
> know this https://bugs.libre-soc.org/show_bug.cgi?id=1126#c13

added.

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=ca22d9687c91e764bb463ee776fb2f6efbf6eae9

commit ca22d9687c91e764bb463ee776fb2f6efbf6eae9
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Sep 14 20:07:18 2023 -0700

    change registers used to avoid r13-31 which are reserved/nonvolatile
    
    broken -- blocked on https://bugs.libre-soc.org/show_bug.cgi?id=1161

commit 2bfd298a3844af197d265b74c38094a5edc397b1
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Sep 14 20:06:59 2023 -0700

    pass in stack pointer

commit e329df6cbcda7752ab68db8c2713bbc795aa03ef
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Sep 14 20:04:55 2023 -0700

    log more register read/writes to LogKind.InstrInOuts

commit 4996bc893db92aa80cefb8cb603739b8c5b9b859
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Sep 14 17:28:42 2023 -0700

    add copyright stuff
    
    [skip ci]
Comment 14 Jacob Lifshay 2023-09-27 05:51:54 BST
I started adding the divmod algorithm:
"extremely slow and simplistic shift and subtract algorithm"

I completed the algorithm, but there are some bugs I haven't fixed yet. WIP

(edit: add link)
https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=8444282d4c29bbeb19e84612c5f075d078d27f3b

commit 8444282d4c29bbeb19e84612c5f075d078d27f3b
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Tue Sep 26 21:41:58 2023 -0700

    working on adding divmod 512x256 to 256x256

commit 27a6ab7218e0347107cef53ab2badf21bcd14572
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Tue Sep 26 21:39:31 2023 -0700

    log writing CA[32]/OV[32] for OP_ADD
Comment 15 Jacob Lifshay 2023-09-27 09:22:26 BST
(In reply to Luke Kenneth Casson Leighton from bug #1175 comment #2)
> (In reply to Jacob Lifshay from bug #1175 comment #0)
> > since I want to use [mcrxrx].
> 
> ahh what for? can you put something about it on the mailing list
> and as usual crossref to bugreport(s)

I'm using mcrxrx to check if an unsigned bigint subtraction has a result that's less than zero (aka. it underflowed) .. the sv.subfe leaves XER.CA set if the result is >= 0, otherwise it's cleared. mcrxrx moves CA to a CR field, so I can branch on it.

basically, it's a combined subtraction-comparison all-in-one.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=c0d5e3161cf0421f2c829aa9ae7b36fe837004dc;hb=8444282d4c29bbeb19e84612c5f075d078d27f3b#l112

see the corresponding lines in the python algorithm:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=c0d5e3161cf0421f2c829aa9ae7b36fe837004dc;hb=8444282d4c29bbeb19e84612c5f075d078d27f3b#l138
Comment 16 Luke Kenneth Casson Leighton 2023-09-27 11:31:44 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=22e9003

jacob what i'm doing here is a mini-code-morphed-demo investigating
the viability of using bigmul REMAP. i did the exact same thing with
test_caller_svp64_chacha20.py, am helping sadoon on bug #1157 / bug #1159
and ed25519 will need the same thing: looking for the patterns and
condensing them down so that bigmul REMAP (preferably without needing
Vertical-First) can just go *BLAM*, one instruction, the whole lot.

i don't think it's going to be possible to do without Vertical-First
but one "trick" of Vertical-First is that there is "end of inner loop"
notification when using "svstep."

by checking the *correct bit* of CR0 you can do branch-conditional
code that, say, performs that "extra add-with-carry"
this will be quite fascinating to see if that trick works.
Comment 17 Luke Kenneth Casson Leighton 2023-09-27 20:06:17 BST
(In reply to Jacob Lifshay from comment #15)

> basically, it's a combined subtraction-comparison all-in-one.

love it.

this is the key work:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=85abafd4

this is what we are aiming for - first with svindex and then using
that to drive the development of bigmul REMAP and then come back and
replace Indexed REMAP.

loop-unrolling in SVP64 assembler is a "no-no". if it's "needed" then
it means something's seriously wrong and we need to activate "redesign
SVP64" procedures to make sure that the loop-unrolling is definitely,
definitely *not* needed.
Comment 18 Luke Kenneth Casson Leighton 2023-09-27 20:19:04 BST
got rid of the addc - use adde but set CA to zero going in.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=d612cd562f

inside the loop is now:

li t[4], 0
sv.maddedu
XER.CA=0
sv.adde

if done as vertical-first it is possible to detect when the
inner loop completes, and do branch-conditional CA=0/sv.adde
Comment 19 Jacob Lifshay 2023-09-27 20:54:05 BST
(In reply to Luke Kenneth Casson Leighton from comment #17)
> this is the key work:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=85abafd4
> 
> this is what we are aiming for - first with svindex and then using
> that to drive the development of bigmul REMAP and then come back and
> replace Indexed REMAP.

that's all part of bug #1155 which has a separate budget that you should be using instead of this one...for this bug we need bare-minimum working mul, all improvements toward the goal of mul REMAP should be under bug #1155
Comment 20 Jacob Lifshay 2023-09-27 20:59:35 BST
(In reply to Jacob Lifshay from comment #19)
> that's all part of bug #1155 which has a separate budget that you should be
> using instead of this one...for this bug we need bare-minimum working mul,
> all improvements toward the goal of mul REMAP should be under bug #1155

nvm, bug #1155's budget is too small for that.
Comment 21 Luke Kenneth Casson Leighton 2023-09-27 22:11:40 BST
(In reply to Jacob Lifshay from comment #19)

> using instead of this one...for this bug we need bare-minimum working mul,

with Indexed. loop-unrolling is prohibited.
Comment 22 Jacob Lifshay 2023-09-28 04:08:32 BST
I got divmod working!

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=78421a7d10dc21670c4f8359a4f005649470305f

commit 78421a7d10dc21670c4f8359a4f005649470305f
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 27 19:51:35 2023 -0700

    fix divmod

commit 33db8d3430b5608fc56473668a09746d04d941f9
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 27 19:50:47 2023 -0700

    in divmod algorithm log regexes that match against expected register values

commit 018562039235cc223464c6e6754e0b85b2ab5e7f
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 27 19:48:50 2023 -0700

    test python_divmod_algorithm

commit 6a0680af7318c73b9da2d7cc4995ad1792d1f3ed
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 27 19:45:34 2023 -0700

    format code

commit 571acee1aa6859b708b9568538327e8e16b5f891
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 27 19:25:57 2023 -0700

    log asmop to LogKind.InstrInOuts too since only printing `.long 0xFOOBAR` isn't very useful
Comment 23 Luke Kenneth Casson Leighton 2023-09-28 07:57:52 BST
(In reply to Jacob Lifshay from comment #22)
> I got divmod working!

oleeee! :)
Comment 24 Jacob Lifshay 2023-10-06 04:25:39 BST
I wrote the powmod assembly, and got a python equivalent working, however, because divmod is really slow, running one powmod invocation is a few hundred thousand simulated instructions (powmod calls divmod 256-512 times, divmod loops 256 times, each inner loop is ~20 ops).

Therefore, I think I should implement Knuth's algorithm D instead of the shift & subtract divmod algorithm we currently have, since I expect it to run 20-50x faster.

I pushed to the programmerjake/powmod branch.

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=f8467f43e8e04f84280ac5586f92222916df0676

commit f8467f43e8e04f84280ac5586f92222916df0676
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Oct 5 19:57:29 2023 -0700

    add WIP powmod_256 -- asm test is currently disabled since divmod is too slow

commit f56b16a029c1e41b323b99f9514e04a42bba9c3a
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Oct 5 19:53:45 2023 -0700

    fix assemble to properly look for whole symbols to replace
    
    previously if there were labels foo and foobar, it would partially
    replace foobar giving 0x<addr>bar, which is wrong.
    
    also optimized to use dict instead of linear search for label names
Comment 25 Jacob Lifshay 2023-10-06 04:27:53 BST
(In reply to Jacob Lifshay from comment #24)
> I wrote the powmod assembly, and got a python equivalent working, however,
> because divmod is really slow, running one powmod invocation is a few
> hundred thousand simulated instructions (powmod calls divmod 256-512 times,
> divmod loops 256 times, each inner loop is ~20 ops).

actually, picking the average number of divmod calls of 384, it comes out to
~2 million operations!
384 * 256 * 20 = 1966080
Comment 26 Jacob Lifshay 2023-10-06 04:31:00 BST
(In reply to Jacob Lifshay from comment #25)
> actually, picking the average number of divmod calls of 384, it comes out to
> ~2 million operations!
> 384 * 256 * 20 = 1966080

for a sense of scale, i'd guess that if we were to use all the optimization techniques and fastest algorithms (montgomery multiplication, maybe karatsuba), it could be done in under ~20k operations.
Comment 27 Luke Kenneth Casson Leighton 2023-10-06 08:38:29 BST
(In reply to Jacob Lifshay from comment #24)
> I wrote the powmod assembly, and got a python equivalent working

hooraay! that's really good. can estimate the number of assembler
instructions now

> however,
> because divmod is really slow, running one powmod invocation is a few
> hundred thousand simulated instructions (powmod calls divmod 256-512 times,
> divmod loops 256 times, each inner loop is ~20 ops).

but there is at least a first version, unit tests can be done.

> Therefore, I think I should implement Knuth's algorithm D instead of the
> shift & subtract divmod algorithm we currently have, since I expect it to
> run 20-50x faster.

yes i concur. nice thing is, that is a fixed (small) task.

can i ask you a favour: make it "adaptable" i.e. parameteriseable
as to what input output and temp regs are used, as well as their
dimensions? (default to the ones you are using)?

or, if that is a lot of work, just specify the size (bitlength)
of X and Y?

the reason i ask is that it becomes possible to crank down the
sizes to run in sane times.

we do *not* want people running tests to find that it is 300 minutes
to completion, when if we put this in front of a customer for a demo
(which is actually going to happen) it is ok to say "this is a 128 bit
powmod, it completes quick but you get the idea".

192 bits is.... 3 64-bit GPRs which is more than enough to show
carry-rollover. am tempted to suggest 128 but it is only 2 GPRs,
i feel happier with 3 (like in the bigint shift/mul/div tests i wrote)

> I pushed to the programmerjake/powmod branch.

irony: this is leaf-node work so a branch is unnecessary unless you
prefer it.
Comment 28 Luke Kenneth Casson Leighton 2023-10-06 08:45:44 BST
btw don't chuck away DIVMOD_512x256_TO_256x256_ASM
because it is useful for illustrative purposes.
i will be doing a customer demo 11th november.
Comment 29 Jacob Lifshay 2023-10-06 08:49:07 BST
(In reply to Luke Kenneth Casson Leighton from comment #28)
> btw don't chuck away DIVMOD_512x256_TO_256x256_ASM
> because it is useful for illustrative purposes.
> i will be doing a customer demo 11th november.

ok, i'll just rename it then, probably to
DIVMOD_SHIFT_SUB_512x256_TO_256x256_ASM
Comment 30 Jacob Lifshay 2023-10-06 08:53:23 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> > however,
> > because divmod is really slow, running one powmod invocation is a few
> > hundred thousand simulated instructions (powmod calls divmod 256-512 times,
> > divmod loops 256 times, each inner loop is ~20 ops).
> 
> but there is at least a first version, unit tests can be done.

not really, extrapolating from what i recall of divmod's timing, that's like 30hr for one call of the powmod algorithm on one of the fastest modern cpus.
Comment 31 Jacob Lifshay 2023-10-06 09:12:03 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> > Therefore, I think I should implement Knuth's algorithm D instead of the
> > shift & subtract divmod algorithm we currently have, since I expect it to
> > run 20-50x faster.
> 
> yes i concur. nice thing is, that is a fixed (small) task.

sounds good!
> 
> can i ask you a favour: make it "adaptable" i.e. parameteriseable
> as to what input output and temp regs are used, as well as their
> dimensions? (default to the ones you are using)?

afaict that needs a register allocator, which is the whole rabbit hole i went down for several months.
> 
> or, if that is a lot of work, just specify the size (bitlength)
> of X and Y?
> 
> the reason i ask is that it becomes possible to crank down the
> sizes to run in sane times.

the crazy times are because shift-sub-based divmod is O(num_bits) and powmod is O(divmod_time * num_bits) which is 256^2 == 65536 -- huge. knuth's algo-D is O(num_words^2) which is 4^2 == 16, combined with powmod gives 256 * 4^2 == 4096, which is doable.
> 
> we do *not* want people running tests to find that it is 300 minutes
> to completion,

with knuth's algo-D, i expect each 256-bit divmod to be <100 operations runtime, so powmod should be testable with a few min runtime.

> when if we put this in front of a customer for a demo
> (which is actually going to happen) it is ok to say "this is a 128 bit
> powmod, it completes quick but you get the idea".
> 
> 192 bits is.... 3 64-bit GPRs which is more than enough to show
> carry-rollover. am tempted to suggest 128 but it is only 2 GPRs,
> i feel happier with 3 (like in the bigint shift/mul/div tests i wrote)

we really need > 128-bit, since algo-D has some special cases that aren't ever hit with just two words and maybe not three. 256-bit isn't very big.
Comment 32 Jacob Lifshay 2023-10-07 01:12:50 BST
started implementing Knuth's Algorithm D:

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=3094512b39f8e9e0e798b483a15f21fc152eb3de

commit 3094512b39f8e9e0e798b483a15f21fc152eb3de
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Fri Oct 6 17:08:35 2023 -0700

    add WIP knuth algorithm D python implementation

commit 4dbfd28bf6d4779f6e7582693f71ad276de12251
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Fri Oct 6 15:21:07 2023 -0700

    rename divmod algorithm -> divmod_shift_sub in prep for adding divmod based on Knuth's Algorithm D
Comment 33 Luke Kenneth Casson Leighton 2023-10-07 04:40:14 BST
(In reply to Jacob Lifshay from comment #32)

>     add WIP knuth algorithm D python implementation

nice. clear.

   362    if qhat * vn[n - 2] > (rhat << word_size) + un[j + n - 2]:

can use 1<<r3 on that, where r3=j, temporarily, rather than Indexed
REMAP. but really, Indexed REMAP is a necessity here.


   380             # subfe
   381             t += ~product[i] + un[j + i]

just sv.subfe! but you need Indexed REMAP on un[j+i]
and can replace it later with Bigmul (rhombic) REMAP.

it is quite ridiculous how small this algorithm will
end up being, esp. as the overflow condition you pre-designed
divmod2du to give that value

     352         if un[j + n] >= vn[n - 1]:
     353             # division overflows word


ultimately though this *is* going to need Vertical-First,
with the "svstep." notification, as the loops j and i are
nested.
Comment 34 Jacob Lifshay 2023-10-09 05:54:04 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=728b09c7f2785d0405a0cf5bc9c91460cd268f3b

commit 728b09c7f2785d0405a0cf5bc9c91460cd268f3b
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Sun Oct 8 21:46:57 2023 -0700

    finish writing python_divmod_knuth_algorithm_d

commit 30915c1feeb515be74e62cf8b6f93ab7f5fb43aa
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Sun Oct 8 21:46:14 2023 -0700

    fix generating invalid divmod tests

commit f2fb0dd37712411459b540edd97005423e3f7162
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Sun Oct 8 21:43:00 2023 -0700

    speed up divmod shift-sub tests by removing most test cases
Comment 35 Jacob Lifshay 2023-10-09 06:08:58 BST
(In reply to Luke Kenneth Casson Leighton from comment #33)
> (In reply to Jacob Lifshay from comment #32)
> 
> >     add WIP knuth algorithm D python implementation
> 
> nice. clear.
> 
>    362    if qhat * vn[n - 2] > (rhat << word_size) + un[j + n - 2]:
> 
> can use 1<<r3 on that, where r3=j, temporarily, rather than Indexed
> REMAP. but really, Indexed REMAP is a necessity here.

actually, I think using svoffset is a much better idea. (we need a dynamic instruction to set svoffset for dynamic bigint shifts anyway, bug #973)
> 
> 
>    380             # subfe
>    381             t += ~product[i] + un[j + i]
> 
> just sv.subfe!

yup!

> but you need Indexed REMAP on un[j+i]
> and can replace it later with Bigmul (rhombic) REMAP.
> 
> ultimately though this *is* going to need Vertical-First,
> with the "svstep." notification, as the loops j and i are
> nested.

I disagree, this algorithm isn't very suitable for vertical-first, because it has both operations in one and two loop nesting levels:

e.g.:
for j in ...:
    qhat = ...  # scalar here
    for i in ...:  # vector here
        mul/add/sub
    for i in ...:  # more vector here with different offsets
        mul/add/sub


this makes it not really work with rhombic REMAP.

another reason vertical-first isn't very suitable: several sections are necessary for correctness but extremely rarely run (for random inputs on the order of 1 in 2^64), such as the qhat overflow and add-back sections.

Therefore I'm planning on writing it to use horizontal-first with svoffset.
Comment 36 Luke Kenneth Casson Leighton 2023-10-09 08:31:17 BST
(In reply to Jacob Lifshay from comment #35)

> I disagree,

that's because you're not familiar with SVP64.
https://bugs.libre-soc.org/show_bug.cgi?id=973#c5

> this algorithm isn't very suitable for vertical-first, because
> it has both operations in one and two loop nesting levels:

> this makes it not really work with rhombic REMAP.

one trick is to ave SVSTATE in a temporary SPR then put it back
and carry on.

> Therefore I'm planning on writing it to use horizontal-first with svoffset.

don't do that please. it will just need a rewrite to use vertical-first
after you have spent the time doing HF.

the purpose of the exercise here is not "get it done".
the purpose of the entire grant is to be *efficient* in the
ISA, to prove tht it is efficient, and if it is not then
*modify* the ISA to *make* it efficient.

if you have not listened to the primary designer of the ISA
who knows all of the tricks that make algoriths efficient,
thus meeting the criteria of the grant, instead doing something
that is *not* efficient, and therefore does not meet the criteria
of the grant, then how can i justify paying out the RFP?

this is as much a learning exercise for you, jacob. you are still
very unfamiliar with using REMAP. hell, everyone is as it is literally
a new computer science concept.
Comment 37 Luke Kenneth Casson Leighton 2023-10-09 08:50:36 BST
ok a correction there.  i *believe* that 1<<r3 and/or REMAP on
a Scalar with VF mode would give us the level of efficiency.
as this is your task it is down to you to attempt to use them.

if they *don't work* then we can carry out a re-assessment,
and design some new (future) SVP64 features.

even if we run out of time to actually complete the algorithm
the conditions of the grant have still been met *because we did
the assessment*.

but just going straight to something inefficient (such as the
loop-unrolled mul256 algorithm you wrote, although it gets us
one incremental step ahead), this is *not* satisfying the conditions
of the grant.
Comment 38 Jacob Lifshay 2023-10-09 08:59:17 BST
(In reply to Luke Kenneth Casson Leighton from comment #36)
> the purpose of the exercise here is not "get it done".
> the purpose of the entire grant is to be *efficient* in the
> ISA, to prove tht it is efficient, and if it is not then
> *modify* the ISA to *make* it efficient.

well, vertical-first is probably not beneficial because the algorithm needs to run a bunch of operations at different rates (1-8 mul/sub for every div, because the mul/sub are nested inside another loop and the div is not). svp64 works well when all of the operations can be done at the same rate, and works less well when the rates wildly differ because you need to jump through predication hoops to get it to work. scalar powerisa instructions are quite efficient for the operations that aren't uniform and aren't the highest rate. if you try to cram it all into a vertical first loop, it may be possible but won't be pretty and probably won't even be faster than scalar ops and horizontal-first loops partially because of all the complex setup code and standard OoO cpus already being able to quite efficiently handle those scalar and hf-svp64 data flow patterns.
> 
> if you have not listened to the primary designer of the ISA
> who knows all of the tricks that make algoriths efficient,

convoluted and squished into one vertical-first loop with lots of remap isn't the same thing as efficiency, being vertical-first may make it *less* efficient, even if we add new remap modes.

How about this:
we write *both* a horizontal-first (since I find that straightforward and relatively obvious) and later attempt a vertical-first version? we can then compare them and see wether all the extra complexity is beneficial.
Comment 39 Jacob Lifshay 2023-10-09 09:02:25 BST
(In reply to Luke Kenneth Casson Leighton from comment #37)
> but just going straight to something inefficient (such as the
> loop-unrolled mul256 algorithm you wrote, although it gets us
> one incremental step ahead), this is *not* satisfying the conditions
> of the grant.

which definition of efficiency are you using?
Comment 40 Jacob Lifshay 2023-10-09 09:59:29 BST
(In reply to Jacob Lifshay from comment #38)
> How about this:
> we write *both* a horizontal-first (since I find that straightforward and
> relatively obvious) and later attempt a vertical-first version? we can then
> compare them and see wether all the extra complexity is beneficial.

another reason to start out with horizontal-first is that powmod is currently sitting effectively untestable until we have *any* fast enough divmod algorithm. I am quite confident I can write a horizontal-first divmod algorithm in a relatively short amount of time, I am not very confident I can write a vertical-first divmod at all and would probably need to go by way of writing a horizontal-first algorithm and then transforming it into a vertical-first algorithm anyway.
Comment 41 Luke Kenneth Casson Leighton 2023-10-09 10:18:01 BST
also bear in mind, if you have s[i+1] = s[i] then as long as
you have 64 bit regs (no elwidth) you can do sv.whatever *r1, *r0, ...

that works with 1<<r3 as well so sv.whatever/dm=1<<r3 *r1, *r0

and works with REMAP just as well.
Comment 42 Luke Kenneth Casson Leighton 2023-10-09 10:22:43 BST
(In reply to Jacob Lifshay from comment #39)
> (In reply to Luke Kenneth Casson Leighton from comment #37)
> > but just going straight to something inefficient (such as the
> > loop-unrolled mul256 algorithm you wrote, although it gets us
> > one incremental step ahead), this is *not* satisfying the conditions
> > of the grant.
> 
> which definition of efficiency are you using?

the one that meets customer requirements which i repeated many times:
top priority on code size. number of regs second.

it is down to the hardware to merge VF and HF elements into
 "issue batches".  which is here repeatedly everyone including
you keeps assuming VF is incapable of doing that "thrrefore it
musy be inefficient performance wise". this assumption is
100% false.

(In reply to Jacob Lifshay from comment #40)

> algorithm in a relatively short amount of time, I am not very confident I
> can write a vertical-first divmod at all and would probably need to go by
> way of writing a horizontal-first algorithm and then transforming it into a
> vertical-first algorithm anyway.

i don't have a problrm with that at all. good call, go for it.
Comment 43 Luke Kenneth Casson Leighton 2023-10-09 13:19:04 BST
btw just be aware, on the 3-in 2-out instructions with implicit RS,
the pairs (hi lo) are put into separate blocks.
RT (lo half) is consecutive, RS (hi half) starts at RT+MAXVL.

can i sugggest not worrying about how many regs are used, first.
there are... enough :)
Comment 44 Jacob Lifshay 2023-10-10 05:26:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #42)
> (In reply to Jacob Lifshay from comment #39)
> > (In reply to Luke Kenneth Casson Leighton from comment #37)
> > > but just going straight to something inefficient (such as the
> > > loop-unrolled mul256 algorithm you wrote, although it gets us
> > > one incremental step ahead), this is *not* satisfying the conditions
> > > of the grant.
> > 
> > which definition of efficiency are you using?
> 
> the one that meets customer requirements which i repeated many times:
> top priority on code size. number of regs second.

ok.
> 
> it is down to the hardware to merge VF and HF elements into
>  "issue batches".  which is here repeatedly everyone including
> you keeps assuming VF is incapable of doing that "thrrefore it
> musy be inefficient performance wise".

I was basing my efficiency claims on both:
* the complexity I expect will be required to get a vertical-first divmod to work at all. I fully expect it to take *more* (and more complex) instructions than the horizontal-first version, because afaict it doesn't cleanly map to VF mode. this is bad for both code size and power and probably performance.
* it will most likely require lots of dynamic predicates (more than just 1<<r3) with *large* amounts of bits that are zeros, this inherently is rather inefficient from a performance perspective, because I'm assuming either:
  * the predicate will have to be handed to the decode pipe
    before the predicated operations can be issued. this is
    bad for performance because you're forced to stall the
    entire fetch/decode pipe for several cycles while waiting
    for the predicate to be computed.
  * the predicate is not known at decode/issue time, so the
    full set of element operations are issued, potentially
    blocking issue queues, only to later find out that most
    of them were wasted. this is bad for both power and
    performance.
    the predicate not being known at issue time also means
    that propagating results to registers and/or any following
    instructions is also blocked for any instructions that
    use twin-predication, since the cpu needs to wait until
    it knows which registers to write to.
Comment 45 Jacob Lifshay 2023-10-10 05:34:58 BST
(In reply to Luke Kenneth Casson Leighton from comment #42)
> i don't have a problrm with that at all. good call, go for it.

Thanks!

I did add a system so the user can adjust which registers and vector sizes are chosen, as you requested in comment #27 -- it doesn't need a register allocator, since the *user* is the register allocator :) all it does is some cursory checks to ensure the passed-in registers aren't completely borked (e.g. where an assignment is omitted by having both the source and dest be the same register, I assert that they actually are the same register).

I added an early WIP assembly version, as well as added register assignments to the python version for logging.

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=d7c8086452d26a17a6643cb7858fa1cfa51416b9

commit d7c8086452d26a17a6643cb7858fa1cfa51416b9
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 9 21:05:17 2023 -0700

    add WIP Knuth's algorithm D assembly

commit f1f40a302cd40dfd953cd71e438ed9f9321562a5
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 9 20:38:30 2023 -0700

    divmod: assign registers to variables

commit 49ab8a331bce33def6cd21c637845adfe6c38b95
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 9 20:35:40 2023 -0700

    adapt divmod algorithm for putting variables in registers

commit b134f188788c1dcb5614244cb4f6fdceb645c543
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 9 20:30:43 2023 -0700

    add more features for _DivModRegsRegexLogger

commit b7be514750e59adbaf3fe543d2977d81f16fd322
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 9 18:25:43 2023 -0700

    finish moving Knuth algorithm D into a class

commit 592e2bfeb196ed06a92f476e936e6322ef91a945
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 9 18:21:40 2023 -0700

    merely indent function
    
    [skip ci]

commit 015c71a871760955a9e9588720ef7ad8b3cd258d
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 9 18:19:25 2023 -0700

    start adding DivModKnuthAlgorithmD class

commit 136e6f26ae6296d781e2a5d4ca4342a14cff2158
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 9 18:18:28 2023 -0700

    format code
Comment 46 Luke Kenneth Casson Leighton 2023-10-10 08:00:09 BST
(In reply to Jacob Lifshay from comment #44)
> (In reply to Luke Kenneth Casson Leighton from comment #42)
> > (In reply to Jacob Lifshay from comment #39)
> > > (In reply to Luke Kenneth Casson Leighton from comment #37)
> > > > but just going straight to something inefficient (such as the
> > > > loop-unrolled mul256 algorithm you wrote, although it gets us
> > > > one incremental step ahead), this is *not* satisfying the conditions
> > > > of the grant.
> > > 
> > > which definition of efficiency are you using?
> > 
> > the one that meets customer requirements which i repeated many times:
> > top priority on code size. number of regs second.
> 
> ok.
> > 
> > it is down to the hardware to merge VF and HF elements into
> >  "issue batches".  which is here repeatedly everyone including
> > you keeps assuming VF is incapable of doing that "thrrefore it
> > musy be inefficient performance wise".
> 
> I was basing my efficiency claims on both:
> * the complexity I expect will be required to get a vertical-first divmod to
> work at all. I fully expect it to take *more* (and more complex)
> instructions than the horizontal-first version, because afaict it doesn't
> cleanly map to VF mode. this is bad for both code size and power and
> probably performance.
> * it will most likely require lots of dynamic predicates (more than just
> 1<<r3) with *large* amounts of bits that are zeros,

again: please please please, patiently i repeat: please
*stop* making assumptions about what the hardware is capable of.

the assumption that you are making is one that a *SIMD* architecture
has because the ISA directly connects to the back-end hardware.

this is *not true* with Simple-V.

 this inherently is
> rather inefficient from a performance perspective, because I'm assuming
> either:
>   * the predicate will have to be handed to the decode pipe
>     before the predicated operations can be issued. this is
>     bad for performance

which is *not the focus of this research*

>    because you're forced to stall the
>     entire fetch/decode pipe for several cycles while waiting
>     for the predicate to be computed.

*only on naive implementations*

>   * the predicate is not known at decode/issue time, so the
>     full set of element operations are issued, potentially
>     blocking issue queues,

*only on naive implementations*


>     the predicate not being known at issue time also means
>     that propagating results to registers and/or any following
>     instructions is also blocked for any instructions that
>     use twin-predication, since the cpu needs to wait until
>     it knows which registers to write to.

only on underresourced implementations.

it basically is not your problem "as an assembler writer" to
worry about what hardware does, because there is no direct connection
(unlike in a SIMD ISA).

this is something it seems that literally everyone who has ever used
SIMD has to unlearn, including you.

please completely and utterly forget, stop talking about, stop 2nd
guessing and stop trying to assess, and kind of hardware performance
*completely*, and just trust the hardware, and focus exclusively on
getting program size down as compact as possible ok?
Comment 47 Jacob Lifshay 2023-10-10 08:45:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #46)
> the assumption that you are making is one that a *SIMD* architecture
> has because the ISA directly connects to the back-end hardware.

no, i was thinking of vertical-first mode, not SIMD. I do know how it works, having read your explanations for years and having read about how Mitch Alsup's cpu design works.

> it basically is not your problem "as an assembler writer" to
> worry about what hardware does, because there is no direct connection
> (unlike in a SIMD ISA).

it *is* my problem, because:
* the code I write directly determines code size, which is what we're optimizing for here.
* the data-flow graph of the code I write directly determines minimum latency even if the cpu has effectively infinite resources. you can't escape Amdahl's law. not even if you have a monster cpu with all the fanciest SV features.
Comment 48 Jacob Lifshay 2023-10-10 08:53:31 BST
I'm dropping the conversation about performance, it's a distraction and less arguing means less stress for luke.

For code size, we can wait and see.
Comment 49 Luke Kenneth Casson Leighton 2023-10-10 09:14:52 BST
(In reply to Jacob Lifshay from comment #47)
> (In reply to Luke Kenneth Casson Leighton from comment #46)
> > the assumption that you are making is one that a *SIMD* architecture
> > has because the ISA directly connects to the back-end hardware.
> 
> no, i was thinking of vertical-first mode, not SIMD. I do know how it works,
> having read your explanations for years and having read about how Mitch
> Alsup's cpu design works.

... but you then go on to make assumptions about the hardware
implementation, which is completely inappropriate to do given
that there will be *multiple* implementations, of varying
capability

in amongst those, the *only* thing that is 100% guaranteed
is that "smaller code size results in lower power consumption".

therefore you *must* stop limiting your thinking based on the
*assumption* that *we* will implement *only one* piece of
hardware, and that that one piece of hardware will have
"some limitation or other on performance if feature X of SVP64
is utilised".

the entire purpose of the entire grants every single one is to
prove the *ISA* not the implementations *of* the ISA.



> * the data-flow graph of the code I write directly determines minimum
> latency even if the cpu has effectively infinite resources. you can't escape
> Amdahl's law. not even if you have a monster cpu with all the fanciest SV
> features.

again: *not your damn problem*!!! :)

that is the *hardware implementor's* problem to deal with, not
yours, right now.

i asked you not to second-guess the hardware implementation(s)
and you did exactly that only 5 minutes later!

(In reply to Jacob Lifshay from comment #48)
> I'm dropping the conversation about performance, it's a distraction 

yes it is. 100%.

we are proving - and demonstrating - the *ISA* here,
not implementations *of* the ISA, ok? we simply cannot
guess what *future* implementors will come up with that
solve every single one of every single "objection related
to performance". therefore we *must* drop it, stone-cold,
as "100% not our problem".

exactly like the Khronos Group does not specify performance in the
Certification of a given implementation.  only "do the tests pass,
yes, then you got your Compliance Certificate".
Comment 50 djac 2023-10-10 09:33:05 BST
As a hardware guy  :-)

I am very conscious that just as in the software world there are "ways of doing things" that a hardware person wont understand or appreciate, there are hardware tricks that will run rings round the software, and cannot be guessed at.

Beyond a cursory look at what the hardware approach might be it is not efficient to second guess the hardware implementation until RED has a hardware team which can interact on these ideas.

Dont loose anything clever but as I have said before save them for when future thinking.
Comment 51 Luke Kenneth Casson Leighton 2023-10-10 10:37:29 BST
(In reply to djac from comment #50)

> Beyond a cursory look at what the hardware approach might be it is not
> efficient to second guess the hardware implementation until RED has a
> hardware team which can interact on these ideas.

not just inefficient but actually damaging because they will not
have any examples to demonstrate why *their* misunderstandings of
the full features are wrong (yes, the hardware team will have to learn
SVP64 "on-the-hoof").

we *need* examples that push the hardware implementors out of their
SIMD-based assumptions.

> Dont loose anything clever but as I have said before save them for when
> future thinking.

yes, this is part of an ISA Designer's responsibility to think
through *all* possible implementations (exactly like how the
Khronos Group Members must do).

as the person directly responsible for SVP64 and having been on it
100% full-time i had to do this (partly in my head, partly written).
it will be years if not decades before all "hardware tricks" emerge.

a SIMD ISA on the other hand, the only real "tricks" that they can
apply are:

1. multi-issue (both ARM NEON and IBM POWER9/10 do this)

2. Cray-conceptualised vector chaining *on elements in the SIMD registers*
   ( the only public recent company to reveal they did this
   was AMD with their AVX512 implementation.)

we have way more options open than just those two, thanks to the
clean "abstracted" API, but i say "we" when i mean *all* future
implementors of SVP64 (not just RED, in the distant future).
Comment 52 Jacob Lifshay 2023-10-11 06:15:01 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=5f0468b5f432804b8e6d201f2e38bdd52fdce612

commit 5f0468b5f432804b8e6d201f2e38bdd52fdce612
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Tue Oct 10 22:07:17 2023 -0700

    WIP divmod: implemented division by single word

commit b43960a84efcb0386c07281961d439167ae8a3e7
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Tue Oct 10 22:05:49 2023 -0700

    support ignoring integer registers for ExpectedState

commit 147e028182d01f9e3f698cdffef7d1442c8302ff
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Tue Oct 10 22:03:38 2023 -0700

    convert assigned values to SVSHAPE when writing to SVSHAPE[0-3] SPRs
    
    this makes mtspr SVSHAPE0, reg properly maintain ISACaller invariants
Comment 53 Jacob Lifshay 2023-10-11 06:27:10 BST
one annoying thing I encountered while implementing DivModKnuthAlgorithmD is setting up a REMAP to reverse element order -- to implement the simple python loop where denom_size = len(v):

n = denom_size

# get non-zero length of divisor
while n > 0 and v[n - 1] == 0:
    n -= 1

I had to do all of:

setvl 0, 0, {denom_size}, 0, 1, 1
addis 0, 0, {svshape_high}
ori 0, 0, {svshape_low}
mtspr {SVSHAPE0}, 0 # mtspr SVSHAPE0, 0
svremap 0o01, 0, 0, 0, 0, 0, 0  # enable SVSHAPE0 for RA
sv.cmpi/ff=ne *0, 1, *{v}, 0
setvl {n_scalar}, 0, 1, 0, 0, 0 # getvl {n_scalar}
subfic {n_scalar}, {n_scalar}, {denom_size}

if we had reverse gear for data-dependent fail-first, it would be 3 instructions:
setvl 0, 0, {denom_size}, 0, 1, 1
sv.cmpi/ff=ne/mrr *0, 1, *{v}, 0
setvl {n_scalar}, 0, 1, 0, 0, 0 # getvl {n_scalar}

(I also had to fix the simulator for mtspr SVSHAPE*, which was not keeping the SVSHAPE* SPRs as SVSHAPE instances, I changed SPR.__setitem__ to convert to SVSHAPE)
Comment 54 Luke Kenneth Casson Leighton 2023-10-11 06:33:17 BST
(In reply to Jacob Lifshay from comment #53)
> one annoying thing I encountered while implementing DivModKnuthAlgorithmD is
> setting up a REMAP to reverse element order

use /mrr.
Comment 55 Luke Kenneth Casson Leighton 2023-10-11 06:35:29 BST
(In reply to Jacob Lifshay from comment #53)
subfic {n_scalar}, {n_scalar}, {denom_size}
> 
> if we had reverse gear for data-dependent fail-first, it would be 3
> instructions:
> setvl 0, 0, {denom_size}, 0, 1, 1
> sv.cmpi/ff=ne/mrr *0, 1, *{v}, 0
> setvl {n_scalar}, 0, 1, 0, 0, 0 # getvl {n_scalar}

sorry, too early in the morning. still blinking. let me check the spec.
this is exactly the kind of thing that these research grants are
here to find.

> (I also had to fix the simulator for mtspr SVSHAPE*, which was not keeping
> the SVSHAPE* SPRs as SVSHAPE instances, I changed SPR.__setitem__ to convert
> to SVSHAPE)

awesome.
Comment 56 Luke Kenneth Casson Leighton 2023-10-11 06:44:48 BST
https://libre-soc.org/openpower/sv/cr_ops/

## Format

SVP64 RM `MODE` (includes `ELWIDTH_SRC` bits) for CR-based operations:

|6 | 7 |19:20|21 | 22:23   |  description     |
|--|---|-----|---|---------|------------------|
|/ | / |0  0 |RG | dz  sz  | simple mode                      |
|/ | / |1  0 |RG | dz  sz  | scalar reduce mode (mapreduce) |
|zz|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode      |
|/ |SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |

nope, you're out of luck. not enough bits. that's annoying.
in *theory* it could be done... oh hang on! look, bit 6 is free!
it won't be perfect, ddffirst on CR-Field ops (3) would not be possible
but on CR-*bit* ops (5) such as cmpi not a problem.

okaay bugreport time, done as bug #1183
it'll need a TODO list involving
* spec
* powerdecoder
* insndb
* ISACaller unit tests
* binutils
* binutils unit tests

a lot of work nowadays.
Comment 57 Jacob Lifshay 2023-10-12 04:54:21 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=aff7be153e28891f6873b89802d9af8fa5e6907c

commit aff7be153e28891f6873b89802d9af8fa5e6907c
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Oct 11 20:29:51 2023 -0700

    WIP getting asm version of knuth's algorithm d working
Comment 58 Jacob Lifshay 2023-10-13 23:15:57 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=8d6113697dfce5ca482dc7cdd83d631992332a6b

commit 8d6113697dfce5ca482dc7cdd83d631992332a6b
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Fri Oct 13 15:13:15 2023 -0700

    WIP divmod: finished writing out asm knuth's algorithm d, still buggy
Comment 59 Jacob Lifshay 2023-10-16 04:41:39 BST
I ran into an issue: I'm trying to use:
svremap 0o10, 0, 0, 0, 0, 0, 0  # enable SVSHAPE0 for RT
# q[j] = qhat
sv.or 4, 12, 12  # note all of RT, RA, and RB are scalar

I expected the REMAP in SVSHAPE0 to apply to RT, but it seems like it doesn't. I couldn't find documentation one way or the other... (I saw documentation that says REMAP doesn't apply to scalar instructions, but I think that just means instructions without a SV-prefix)

I think REMAP should apply to scalar operands when there's a sv-prefix (but not sv-single-prefix). This seems useful to me so you can e.g. extract the 2nd element in a byte vector by using SVSHAPE.offset or other times you'd want to use a REMAP-ped index.

For now, I'll use sv.or/m=1<<r3 *4, 12, 12
but that workaround forces me to put the index in r3 which means I can't keep anything else there for that instruction.
Comment 60 Jacob Lifshay 2023-10-16 05:35:30 BST
I got the assembly version of Knuth's algorithm D to work!

Next, I plan on getting the powmod to work, and then optimizing the divmod somewhat, since it probably has a few redundant instructions. After that, I think I'll work on the multiply remap, and then, if we still have enough time, attempt the vertical-first Knuth's Algorithm D (I don't expect to have enough time to finish it due to lack of budget).

running one divmod using the simulator takes 4.81s (I'd randomly guess 30% is just for startup e.g. insndb parsing) on my computer when calculating:
n = 0x5b_16ef_a3ca_8698_fa0f_0509_d80d_6a81_0ea7_baf2_fe8d_f34a_de28_f881_e13c_b9d3_7007_2241_a954_4276_d9c1
d = 0x7_c54b_a969_8413_d194_f6a7_154d_2d7d_fe3a_6783_4859_c671_0374
q, r = divmod(n, d)
q == 0xb_b8e2_9e3f_8a67_25bd_c45b_f2a4_2f7a_15b4
r == 0x5_8e2f_6500_9eb0_8011_4645_c75a_e598_dd10_9ee6_3d1f_846f_e831

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=4dcec3a630e4e082447230445ef31e24245cb6aa

commit 4dcec3a630e4e082447230445ef31e24245cb6aa
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Sun Oct 15 21:17:09 2023 -0700

    add comments telling people to keep the asm and python versions in sync

commit 88e79ac8d5758f225703750b6c2a3bd6130b6673
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Sun Oct 15 21:06:02 2023 -0700

    divmod: asm version of Knuth's algorithm D works!

commit 2f2a215a11d9bbb84671253d3340dcf2654382e2
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Sun Oct 15 20:59:36 2023 -0700

    add labels to DIVMOD REGEX log

commit b3dc024833e16218399b958360243bbedfe4a51e
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Sun Oct 15 20:54:12 2023 -0700

    put DIVMOD REGEX under new LogKind: LogKind.OutputMatching
Comment 61 Jacob Lifshay 2023-10-17 06:48:29 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=685380aa5d008dbd64d22a79332f282e9a849dca

commit 685380aa5d008dbd64d22a79332f282e9a849dca
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 16 22:22:50 2023 -0700

    powmod asm tests pass!

commit c4783587012052e5a80a29f5efc4a7a27c051943
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 16 16:09:07 2023 -0700

    switch powmod to using new divmod implementation -- test not enabled yet

commit 34eb127cf2889c6f2c7713e782a69cd644824b95
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 16 16:08:44 2023 -0700

    call log, not print

commit 030ba8e67010e9f814edbc67a15190ca913e5ff0
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Oct 16 15:39:17 2023 -0700

    make parser work for pypy
    
    pypy's ast classes' constructors seem to only work if all arguments or
    no arguments are passed, not only some of them.
Comment 62 Luke Kenneth Casson Leighton 2023-10-17 08:49:30 BST
(In reply to Jacob Lifshay from comment #59)
> I ran into an issue: I'm trying to use:
> svremap 0o10, 0, 0, 0, 0, 0, 0  # enable SVSHAPE0 for RT
> # q[j] = qhat
> sv.or 4, 12, 12  # note all of RT, RA, and RB are scalar
> 
> I expected the REMAP in SVSHAPE0 to apply to RT, but it seems like it
> doesn't.

no, it doesn't. this is deliberate.

> I think REMAP should apply to scalar operands when there's a sv-prefix (but
> not sv-single-prefix).

no.

> This seems useful to me so you can e.g. extract the
> 2nd element in a byte vector by using SVSHAPE.offset or other times you'd
> want to use a REMAP-ped index.
> 
> For now, I'll use sv.or/m=1<<r3 *4, 12, 12

yes that's how to index a single element.

> but that workaround forces me to put the index in r3 which means I can't
> keep anything else there for that instruction.

indeed.
Comment 63 Jacob Lifshay 2023-11-30 06:21:13 GMT
(In reply to Jacob Lifshay from comment #61)
> https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;
> h=685380aa5d008dbd64d22a79332f282e9a849dca
> 
> commit 685380aa5d008dbd64d22a79332f282e9a849dca
> Author: Jacob Lifshay <programmerjake@gmail.com>
> Date:   Mon Oct 16 22:22:50 2023 -0700
> 
>     powmod asm tests pass!

so, now that this is all working, I think we can push all improvements of the bigint mul algorithm into the bigint mul remap bug #1155 and declare this bug complete and merge the branch.

Also, I think we should assign some money to luke for the review and feedback (and also because he needs money).

Luke, what do you think?
Comment 64 Luke Kenneth Casson Leighton 2023-11-30 06:59:38 GMT
(In reply to Jacob Lifshay from comment #63)
> (In reply to Jacob Lifshay from comment #61)

> >     powmod asm tests pass!
> 
> so, now that this is all working, I think we can push all improvements of
> the bigint mul algorithm into the bigint mul remap bug #1155 

ok.

> and declare
> this bug complete and merge the branch.

ok

> Also, I think we should assign some money to luke for the review and
> feedback (and also because he needs money).
> 
> Luke, what do you think?

good idea. the bigmul remap is quite straightforward, there is a
step-by-step procedure to follow. let's get RFPs in and move to
bug #1155, and examine the assembler here for patterns.

ed25519 also has patterns but they are carry-save ones (modulo wrap
on the index sums, (x+y)%N rather than just x+y)
Comment 65 Jacob Lifshay 2023-11-30 07:07:23 GMT
(In reply to Luke Kenneth Casson Leighton from comment #64)
> > and declare
> > this bug complete and merge the branch.
> 
> ok

don't rush ahead and declare everything complete before I merge...currently running tests on rebased branch.
Comment 66 Luke Kenneth Casson Leighton 2023-11-30 07:15:28 GMT
NLnet asked ast week to link to CI test logs for review.
do you have them?
not the full log just the unit test.
not "upload 100 MB of crap as attachment overwhelming the VM"
just, "where s the CI test"
Comment 67 Luke Kenneth Casson Leighton 2023-11-30 07:17:32 GMT
(In reply to Jacob Lifshay from comment #65)
ok
> 
> don't rush ahead and declare everything complete before I merge...currently
> running tests on rebased branch.

ack leave it to you. can you cf the link to the CI log just the one
that has the powmod test, if not separate may need a line number.
Comment 68 Jacob Lifshay 2023-11-30 07:19:40 GMT
(In reply to Luke Kenneth Casson Leighton from comment #66)
> NLnet asked ast week to link to CI test logs for review.
> do you have them?
> not the full log just the unit test.

I don't have logs for just one unit test, CI runs everything and doesn't split stuff out like that (though it does spit out error messages that are delimited in the log.

> not "upload 100 MB of crap as attachment overwhelming the VM"
> just, "where s the CI test"

I can link to debian salsa which is nicer to look at but the logs will go away eventually.
Alternatively, I can link to the commit in the git repo of logs I'm keeping (aren't you glad I did that...) on my build server, but it doesn't have handy html pages, it's just a git repo you can clone.
Comment 69 Jacob Lifshay 2023-11-30 07:24:47 GMT
(In reply to Jacob Lifshay from comment #65)
> don't rush ahead and declare everything complete before I merge...currently
> running tests on rebased branch.

well, git majorly messed up the rebase, so I tried git merge which is also messed up...investigating...
Comment 70 Jacob Lifshay 2023-11-30 08:21:36 GMT
(In reply to Jacob Lifshay from comment #69)
> (In reply to Jacob Lifshay from comment #65)
> > don't rush ahead and declare everything complete before I merge...currently
> > running tests on rebased branch.
> 
> well, git majorly messed up the rebase, so I tried git merge which is also
> messed up...investigating...

ok, fixed it afaict, it was the LogKind -> LogType rename getting in there to mess stuff up.

running tests is taking forever (8min only 1/4 completes the full powmod test), so I pushed the newly-rebased branch to programmerjake/powmod and will wait for CI while I do other stuff. I also manually pushed to the mirror so I don't have to wait 24hr(!).

https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4983704
Comment 71 Luke Kenneth Casson Leighton 2023-11-30 09:28:11 GMT
(In reply to Jacob Lifshay from comment #70)
> (In reply to Jacob Lifshay from comment #69)
> > (In reply to Jacob Lifshay from comment #65)
> > > don't rush ahead and declare everything complete before I merge...currently
> > > running tests on rebased branch.
> > 
> > well, git majorly messed up the rebase, so I tried git merge which is also
> > messed up...investigating...
> 
> ok, fixed it afaict, it was the LogKind -> LogType rename getting in there
> to mess stuff up.

(i always regularly rebase from to avoid that)

> running tests is taking forever (8min only 1/4 completes the full powmod
> test), 

it'll need moving to "long" tests.

> so I pushed the newly-rebased branch to programmerjake/powmod and
> will wait for CI while I do other stuff. I also manually pushed to the
> mirror so I don't have to wait 24hr(!).
> 
> https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4983704

hmm hmm hmm we will need something more focussed in future, running
just the one unit test not several, so that the log is targetted at
a given bugreport.

otherwise we have to provide a reproducible build script *for every bug
for every RFP* (!!)

ok i am seeing "passed" on 116 searches for "powmod" which in my 
(not-having-been-writing-the-tests) mind seems excessive but i know
you do very comprehensive and thorough tests so am perfectly
happy with them.
Comment 72 Jacob Lifshay 2023-11-30 09:33:36 GMT
(edit: trim context)
(In reply to Luke Kenneth Casson Leighton from comment #71)
> it'll need moving to "long" tests.

yeah, I remember it taking much less time before I rebased...

> ok i am seeing "passed" on 116 searches for "powmod" which in my 
> (not-having-been-writing-the-tests) mind seems excessive but i know
> you do very comprehensive and thorough tests so am perfectly
> happy with them.

well, it mentions each test twice, once when it starts it and once when it finishes it. also, most of those tests are actually testing sub-pieces of powmod, e.g. I have a pile of tests for divmod to ensure I actually hit all the corner cases (otherwise how would I know if it's correct? it's too complex for me to confidentially say it's bug-free without them). the few tests testing the full powmod algorithm are the ones that take 30min.
Comment 73 Luke Kenneth Casson Leighton 2023-11-30 09:37:57 GMT
(we also cannot send them a link where when they open it, it literally
overwhelms their machine, taking MINUTES to load due to MASSIVE javascript
processing time in additionto a long download time). some... refining
to do, here :)
(one more task for a future grant, sigh)
Comment 74 Luke Kenneth Casson Leighton 2023-11-30 09:39:50 GMT
(In reply to Jacob Lifshay from comment #72)

> well, it mentions each test twice, once when it starts it and once when it
> finishes it. also, most of those tests are actually testing sub-pieces of
> powmod,

aaawwesoome, this is *exactly* the technique i'd have followed, but also
what i want shriya and sadoon to learn. otherwise it is "overwhelm"
because assembler is just that hard.
Comment 75 Luke Kenneth Casson Leighton 2023-11-30 09:40:24 GMT

    
Comment 76 Jacob Lifshay 2023-11-30 22:59:53 GMT
(In reply to Jacob Lifshay from comment #70)
> https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4983704

it still didn't finish after running 6hr! (mostly due to pytest's terrible scheduling decisions, it likes to run a lot of tests serially on 1 core towards the end, powmod just happens to be at the end with some very long tests)

I made some changes that should speed up CI a bit, unfortunately some profiling shows about 70% of runtime is taken up by nmigen simulation, which is relatively slow.

I think the biggest change that probably will speed up CI is renaming the test file so pytest schedules it earlier (allowing more parallelism), I'll try that next.

commit 22c2fd82ddb9f4383eaf5bc867779d6b6845fcd4
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Nov 30 14:55:08 2023 -0800

    pytest: try to improve scheduling

commit 795cd9cba2882b0dd80c4fb93c984501e9f30cf4
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Nov 30 14:54:05 2023 -0800

    test_caller_svp64_powmod: rename to test_aaa_caller_svp64_powmod so pytest tries to run it earlier

commit 0280a1ead801545c10b72829092dd1dc65aaf054
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Nov 30 14:21:36 2023 -0800

    speed up CI by disabling logging and VCD generation

commit 489c777a2f0abf9fee2899cc1ceb59aec0a97d22
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Nov 30 14:20:54 2023 -0800

    test/runner: allow disabling VCD using SIM_NO_VCD=1 env var
Comment 77 Jacob Lifshay 2023-11-30 23:21:05 GMT
tests currently running on:
https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4986961

looks like pytest decided to schedule all the long powmod tests in parallel this time! (TestPowMod43-45 iirc)

I'll merge once they pass.
Comment 78 Jacob Lifshay 2023-12-01 03:46:53 GMT
(In reply to Jacob Lifshay from comment #77)
> tests currently running on:
> https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4986961
> 
> looks like pytest decided to schedule all the long powmod tests in parallel
> this time! (TestPowMod43-45 iirc)
> 
> I'll merge once they pass.

they passed, merged to master. the 7 test failures are unrelated and expected.

to see them:
> git clone "https://build.libre-soc.programmerjake.tk/build-archive.git" build-archive
> cd build-archive
> git checkout c1ecc0f936a992ca86497e0cb40c41689b609a61
> less -R pipelines/608110/job-4986961-log.txt

type /svp64_powmod and press enter, it will show you all powmod tests, press n to go to the next occurrence.
Comment 79 Luke Kenneth Casson Leighton 2023-12-09 09:20:28 GMT
rats forgot about bug #1183, the locations where the assembler is
sub-optimal (5 instructions instead of 1) needs to be cross-referenced,
bugreport raised if needed.  the important thing here is to document
what was discovered. implementation can come later (different bug,
or even different grant), but documenting and capturing the advances
is *real* important.
Comment 80 Jacob Lifshay 2023-12-12 02:44:53 GMT
(In reply to Luke Kenneth Casson Leighton from comment #79)
> rats forgot about bug #1183, the locations where the assembler is
> sub-optimal (5 instructions instead of 1)

ok, icr all the locations, but I think the motivating examples are the spots where Knuth's Algorithm D needs inputs with the most-significant word being non-zero, so we have to find the number of leading zero words in our fixed-length inputs so we know how many words to tell Algorithm D is in the dynamically-sized inputs (basically, we just make Algorithm D ignore all the high-order zero words and act as if the inputs are really shorter than they are).

so, count-leading-zeros, except we're counting zero words, not bits.
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=7fc794685bebb1f3c2451c64da041a0e81143e29;hb=d3c7a0ddc169877e8daf38a347091c366e85624b#l717
and
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=7fc794685bebb1f3c2451c64da041a0e81143e29;hb=d3c7a0ddc169877e8daf38a347091c366e85624b#l734

I'll link bug #1183 to this comment, so I think that's all we need.
Comment 81 Luke Kenneth Casson Leighton 2023-12-12 05:24:10 GMT
(In reply to Jacob Lifshay from comment #80)

> ok, icr all the locations, but I think the motivating examples are the spots
> where Knuth's Algorithm D needs inputs with the most-significant word being
> non-zero, so we have to find 

"have to"... remember what i said about not doing performance
optimisations and keeping the assembler brutally short?

especially if it makes a mess, but also even to find out that
this was a shortcoming *after the bug is closed* is quite a
serious infraction of the milestone's purpose and could be
flagged by an Auditor!

can you please immediately document ALL such findings as an
absolute top priority as it is literally the entire purpose
of this entire grant (2021-02A-051) which you should full well know
as i have repeated it to you at least eight times in three years!

sorry to have to tell you off on that but i am a little taken aback
Comment 82 Jacob Lifshay 2023-12-12 05:29:42 GMT
(In reply to Luke Kenneth Casson Leighton from comment #81)
> (In reply to Jacob Lifshay from comment #80)
> 
> > ok, icr all the locations, but I think the motivating examples are the spots
> > where Knuth's Algorithm D needs inputs with the most-significant word being
> > non-zero, so we have to find 
> 
> "have to"... remember what i said about not doing performance
> optimisations and keeping the assembler brutally short?

it isn't a performance optimization, it is required for correctness of Knuth's Algorithm D.
Comment 83 Luke Kenneth Casson Leighton 2023-12-12 08:11:10 GMT
(In reply to Jacob Lifshay from comment #82)

> it isn't a performance optimization, it is required for correctness of
> Knuth's Algorithm D.

ahh okaay. been a while i *think* i know where you mean. should
work now. this unit test confirmed works
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dd_ffirst.py;h=c371830cf#l70

still documenting these (inspired) changes needed to SVP64 is
critical, sorry for not making it really really explicitly clear as a
task on the TODO list.
Comment 84 Jacob Lifshay 2023-12-19 03:28:42 GMT
(In reply to Luke Kenneth Casson Leighton from comment #83)
> still documenting these (inspired) changes needed to SVP64 is
> critical, sorry for not making it really really explicitly clear as a
> task on the TODO list.

the change is documented in the comment I linked to from https://bugs.libre-soc.org/show_bug.cgi?id=1183#c21

I also opened bug #1237 to track the needed change to the divmod assembly (not part of this bug).

that's everything that i recall, so I'm closing this.
Comment 85 Luke Kenneth Casson Leighton 2023-12-19 08:12:00 GMT
(In reply to Jacob Lifshay from comment #84)

> the change is documented in the comment I linked to from
> https://bugs.libre-soc.org/show_bug.cgi?id=1183#c21

i meant: it goes on a *wiki* page, explicitly in an explicit
summary report. i *think* i have it implemented, but seeing
it being used, and the implementation linking back to the
feature, is incredibly important.

> I also opened bug #1237 to track the needed change to the divmod assembly
> (not part of this bug).
> 
> that's everything that i recall, so I'm closing this.

good stuff.