Bug 1044 - SVP64 implementation of pow(x,y,z)
Summary: SVP64 implementation of pow(x,y,z)
Status: IN_PROGRESS
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Jacob Lifshay
URL:
: 1153 (view as bug list)
Depends on: 1155 1161 1175
Blocks:
  Show dependency treegraph
 
Reported: 2023-03-28 16:33 BST by Luke Kenneth Casson Leighton
Modified: 2023-09-27 11:31 BST (History)
4 users (show)

See Also:
NLnet milestone: NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 589
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
jacob=2000


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-03-28 16:33:30 BST
a demo implementation of a modular exponentiation function
is needed as a unit test to be run by pypowersim and/or
ISACaller

primary uses are for Diffie-Hellman and RSA but there
are many more

https://github.com/TOPDapp/py-diffie-hellman/blob/master/diffiehellman/diffie_hellman.py

--

* TODO: raise bignum REMAP bugreport (jacob) and link here, move
        the following into the bugreport:
        - cross-reference to the svshape table "12 REMAP modes"
          so i can begin advising where to put it
          https://libre-soc.org/openpower/sv/remap/#svshape
        https://bugs.libre-soc.org/show_bug.cgi?id=1155
* WIP: extremely simple version as test_caller_svp64_pow.py
Comment 1 Jacob Lifshay 2023-03-28 16:42:21 BST
this is already partially completed by my work on bigint-presentation-code (bug #942), which is currently blocked on compiler support for svp64 since luke correctly observed that the diy compiler backend in there is a major distraction. the code can be easily converted to emitting llvm ir or similar once svp64 is added to llvm.

bug #942 needs to have the subtasks of this bug split out.
Comment 2 Jacob Lifshay 2023-03-28 16:43:21 BST
rsa uses a non-prime modulus, changing title
Comment 3 Luke Kenneth Casson Leighton 2023-03-28 23:20:28 BST
(In reply to Jacob Lifshay from comment #2)
> rsa uses a non-prime modulus, changing title

ah good catch i totally missed that.

need to find algorithms now, shorter more compact
abd fitting into as few instructions as possible
without being overly long completion time is fine.
Comment 4 Jacob Lifshay 2023-03-29 00:48:54 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > rsa uses a non-prime modulus, changing title
> 
> ah good catch i totally missed that.
> 
> need to find algorithms now, shorter more compact
> abd fitting into as few instructions as possible
> without being overly long completion time is fine.

how long is overly long? the reason i'm using fast mul algorithms is because they are closer to O(n^(1+ε)) rather than O(n^2), e.g. even toom-2 (karatsuba) multiplication is O(n^1.5...) so on 4096-bit mul is very approximately 512 64-bit mul ops, whereas naive O(n^2) is about 4096 64-bit mul ops. toom-3 and higher are even less than that.

if we use an algorithm as our demo that is known to be >4x slower, even if our cpu has a 256-bit simd mul unit, it will take approximately 4x as much power for the same speed or 4x as long for the same power without the simd unit, so everyone will see our proposal as some kind of bad joke, or that we're just noobs who don't know what their doing.

the approximately 4000 instruction version of fast mul is what you get when you recursively inline the functions, if we instead leave them as uninlined recursive functions the amount of code used is substantially smaller.

in any case we'll still want montgomery multiplication (which is a relatively simple algorithm that we could just code in c to call our bigint mul algorithm), which transforms modular exponentiation from needing a 8192-bit bigint divmod at each step to only needing it once at the end, saving a substantial amount of time.
Comment 5 Luke Kenneth Casson Leighton 2023-03-29 08:25:19 BST
(In reply to Jacob Lifshay from comment #4)

> how long is overly long?

greater than fits into 3-4 cache lines, absolute maximum.
the goal here is to eliminate TLB and L1 cache lookups.
this is for an embedded scenario not a supercomputing
scenario.
Comment 6 Jacob Lifshay 2023-03-30 02:45:36 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #4)
> 
> > how long is overly long?
> 
> greater than fits into 3-4 cache lines, absolute maximum.
> the goal here is to eliminate TLB and L1 cache lookups.
I assume you mean misses rather than lookups.

> this is for an embedded scenario not a supercomputing
> scenario.

ok, using O(n^2) multiplication could work since embedded cares less about speed and we can assume rsa is infrequently run so power doesn't really matter. I also already have code for O(n^2) mul in bigint-presentation-code.
Comment 7 Jacob Lifshay 2023-09-08 04:06:47 BST
*** Bug 1153 has been marked as a duplicate of this bug. ***
Comment 8 Luke Kenneth Casson Leighton 2023-09-08 11:23:33 BST
jacob i am assigning this to you, the REMAP bignum should
be set up as a sub-bug

shriya apologies i have to reassign this as the cryptoprimitives
grant is under pressure of ending, and you are finishing your
dissertation.
Comment 9 shriya.sharma 2023-09-08 18:53:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)
> jacob i am assigning this to you, the REMAP bignum should
> be set up as a sub-bug
> 
> shriya apologies i have to reassign this as the cryptoprimitives
> grant is under pressure of ending, and you are finishing your
> dissertation.

Nw. Thankyou.
Comment 10 Jacob Lifshay 2023-09-14 07:32:50 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=e044403f4c243b3248df0872a661756b6cc8a984

Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 13 23:24:12 2023 -0700

    add SVP64 256x256->512-bit multiply

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ec1196cb3abedd28492be29546e52959dee1a030

Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Sep 13 23:22:33 2023 -0700

    generalize assemble() fn so other test cases can easily use it
Comment 11 Luke Kenneth Casson Leighton 2023-09-14 08:19:32 BST
(In reply to Jacob Lifshay from comment #10)
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=e044403f4c243b3248df0872a661756b6cc8a984
> 
> Author: Jacob Lifshay <programmerjake@gmail.com>
> Date:   Wed Sep 13 23:24:12 2023 -0700
> 
>     add SVP64 256x256->512-bit multiply
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=ec1196cb3abedd28492be29546e52959dee1a030

all of these can go in a vertical-first loop, using
predication 0b1000100010001000 and sv.addi/pm=nn a,b,c
to "pretend" there is a pair of nested loops.

  30     "sv.adde *7, *7, *21",
  31     "addi 24, 0, 0",
  32     "sv.maddedu *20, *12, 19, 24",  # final partial-product a * b[3]
  33     "addc 7, 7, 20",
  34     "sv.adde *8, *8, *21",

i.e. sv.addi with scalars does exactly the same thing as the
non-prefixed equivalent except you get predication and elwidth
overrides

the extra sv.adde can add a zero'd out source, making the first
iteration orthogonal with all other iterations and consequently
the VF loop can be applied.

and you can use Indexed REMAP on the sv.maddubrs and sv.adde.
only pain being, you have to enable/disable svremap before each
so i suggest using "non-persistent" svremap 

also same as Andrey, all files require a copyright notice, you
know this https://bugs.libre-soc.org/show_bug.cgi?id=1126#c13


> Author: Jacob Lifshay <programmerjake@gmail.com>
> Date:   Wed Sep 13 23:22:33 2023 -0700
> 
>     generalize assemble() fn so other test cases can easily use it

like it.
Comment 12 Luke Kenneth Casson Leighton 2023-09-14 08:54:07 BST
sigh, would be great to investigate this in a future grant
https://www.google.com/url?q=https://digital.library.adelaide.edu.au/dspace/bitstream/2440/101502/2/02whole.pdf
Comment 13 Jacob Lifshay 2023-09-15 04:20:04 BST
I started changing the register assignments to conform to the ABI regarding volatile/nonvolatile registers, since I want the modular exponentiation algorithm to be able to store temporaries in registers while calling the multiply algorithm, however, I discovered a scalar EXTRA2 encoding bug:
https://bugs.libre-soc.org/show_bug.cgi?id=1161

I spent the rest of the day trying to figure that bug out.

(In reply to Luke Kenneth Casson Leighton from comment #11)
> all of these can go in a vertical-first loop, using
> predication 0b1000100010001000 and sv.addi/pm=nn a,b,c
> to "pretend" there is a pair of nested loops.

I'll work on figuring that out and implementing it when I start working on the REMAP bug, other than changing the registers used, the current multiply algorithm is imho good enough for the purposes of this bug.

> 
> also same as Andrey, all files require a copyright notice, you
> know this https://bugs.libre-soc.org/show_bug.cgi?id=1126#c13

added.

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=ca22d9687c91e764bb463ee776fb2f6efbf6eae9

commit ca22d9687c91e764bb463ee776fb2f6efbf6eae9
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Sep 14 20:07:18 2023 -0700

    change registers used to avoid r13-31 which are reserved/nonvolatile
    
    broken -- blocked on https://bugs.libre-soc.org/show_bug.cgi?id=1161

commit 2bfd298a3844af197d265b74c38094a5edc397b1
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Sep 14 20:06:59 2023 -0700

    pass in stack pointer

commit e329df6cbcda7752ab68db8c2713bbc795aa03ef
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Sep 14 20:04:55 2023 -0700

    log more register read/writes to LogKind.InstrInOuts

commit 4996bc893db92aa80cefb8cb603739b8c5b9b859
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Sep 14 17:28:42 2023 -0700

    add copyright stuff
    
    [skip ci]
Comment 14 Jacob Lifshay 2023-09-27 05:51:54 BST
I started adding the divmod algorithm:
"extremely slow and simplistic shift and subtract algorithm"

I completed the algorithm, but there are some bugs I haven't fixed yet. WIP

(edit: add link)
https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=8444282d4c29bbeb19e84612c5f075d078d27f3b

commit 8444282d4c29bbeb19e84612c5f075d078d27f3b
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Tue Sep 26 21:41:58 2023 -0700

    working on adding divmod 512x256 to 256x256

commit 27a6ab7218e0347107cef53ab2badf21bcd14572
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Tue Sep 26 21:39:31 2023 -0700

    log writing CA[32]/OV[32] for OP_ADD
Comment 15 Jacob Lifshay 2023-09-27 09:22:26 BST
(In reply to Luke Kenneth Casson Leighton from bug #1175 comment #2)
> (In reply to Jacob Lifshay from bug #1175 comment #0)
> > since I want to use [mcrxrx].
> 
> ahh what for? can you put something about it on the mailing list
> and as usual crossref to bugreport(s)

I'm using mcrxrx to check if an unsigned bigint subtraction has a result that's less than zero (aka. it underflowed) .. the sv.subfe leaves XER.CA set if the result is >= 0, otherwise it's cleared. mcrxrx moves CA to a CR field, so I can branch on it.

basically, it's a combined subtraction-comparison all-in-one.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=c0d5e3161cf0421f2c829aa9ae7b36fe837004dc;hb=8444282d4c29bbeb19e84612c5f075d078d27f3b#l112

see the corresponding lines in the python algorithm:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/powmod.py;h=c0d5e3161cf0421f2c829aa9ae7b36fe837004dc;hb=8444282d4c29bbeb19e84612c5f075d078d27f3b#l138
Comment 16 Luke Kenneth Casson Leighton 2023-09-27 11:31:44 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=22e9003

jacob what i'm doing here is a mini-code-morphed-demo investigating
the viability of using bigmul REMAP. i did the exact same thing with
test_caller_svp64_chacha20.py, am helping sadoon on bug #1157 / bug #1159
and ed25519 will need the same thing: looking for the patterns and
condensing them down so that bigmul REMAP (preferably without needing
Vertical-First) can just go *BLAM*, one instruction, the whole lot.

i don't think it's going to be possible to do without Vertical-First
but one "trick" of Vertical-First is that there is "end of inner loop"
notification when using "svstep."

by checking the *correct bit* of CR0 you can do branch-conditional
code that, say, performs that "extra add-with-carry"
this will be quite fascinating to see if that trick works.