942 – next things to work on -- bigint rsa mul algorithm for bigint presentation

Bug 942 - next things to work on -- bigint rsa mul algorithm for bigint presentation

Summary: next things to work on -- bigint rsa mul algorithm for bigint presentation

Status:	IN_PROGRESS

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Conferences (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:

Depends on:
Blocks:

Reported:	2022-10-04 18:46 BST by Jacob Lifshay
Modified:	2023-09-08 03:08 BST (History)
CC List:	3 users (show)

See Also:	770 1044 1153
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2022-10-04 18:46:11 BST

See also:
https://lists.libre-soc.org/pipermail/libre-soc-dev/2023-January/005475.html
https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/register.20allocating.20the.20stack/near/319529633

Original comment:
posting here because libre-soc-dev is currently broken.

Now that the nlnet deadline is effectively over (no time left for new work), I'd like to work on my bigint presentation:

If for some reason my presentation is rejected, this would just get published online somewhere instead.

I want to build a practical 2048-bit modular exponentiation function for RSA -- the subtasks should fit under https://bugs.libre-soc.org/show_bug.cgi?id=773 and probably other gigabit router subtasks too.

it would use montgomery multiplication and a toom-cook multiplication algorithm generator that we can instantiate with different parameters to find the most efficient one. toom-cook covers all of (Toom-1) traditional O(n^2) long multiplication, (Toom-2) O(n^1.58) karatsuba multiplication, and higher algorithms such as Toom-3 O(n^1.46)

It would be run on a simplistic model cpu with 1 256-bit simd add unit and 1 256-bit wide-mul unit (the model would be adjustable) and scalar units, I would create a simulator for that, the simulator would cover ALU pipeline timing and dependencies but not much else. the simulator would be designed to output timing info such that blender can generate an animation of those ALUs executing the algorithm. the simulator would probably be written in python and use our existing isacaller simulator to generate execution traces for it to run on.

I'm estimating this taking 3-4 weeks of work, it would probably be the most significant demo of svp64 actually providing a significant speed-up so far (isacaller doesn't count since it doesn't simulate actual cpu timing, soc.git isn't far enough along to simulate multi-issue and operation merging for SIMD backends).

this would cover the last half of the bigint presentation, the first half would be explaining concepts.

What do you think?

Comment 1 Luke Kenneth Casson Leighton 2022-10-04 19:02:47 BST

(In reply to Jacob Lifshay from comment #0)

> It would be run on a simplistic model cpu with 1 256-bit simd add unit and 1
> 256-bit wide-mul unit (the model would be adjustable) and scalar units, I
> would create a simulator for that, the simulator would cover ALU pipeline
> timing and dependencies but not much else.

everything but this is a great idea.  there's no point spending time
on doing hardware-cycle-accurate pipelining when cavatools is specifically
designed to do exactly that, and already has the infrastructure in place.
any time spent on developing any hardware-cycle-accurate simulations is
therefore both time and money completely wasted.

also as i have repeated many times timing is *not* part of the bitmanip
NLnet proposal because the financial cost to NLnet would be an order of
magnitude larger and put the entire project at risk to even attempt to
tackle timing.

     *this is not up for discussion*.

please do not *under any circumstances* spend *any* time attempting to
show timing or any level of performance other than reduction in instruction
count and reduction in complexity at this very early stage.

the entire focus at the moment is on power reduction and complexity reduction.
*NOT* in ANY WAY on TIMING.  at all.

the algorithm implementations, great idea.

Comment 2 Jacob Lifshay 2022-10-04 19:52:58 BST

(In reply to Luke Kenneth Casson Leighton from comment #1)
> (In reply to Jacob Lifshay from comment #0)
> 
> > It would be run on a simplistic model cpu with 1 256-bit simd add unit and 1
> > 256-bit wide-mul unit (the model would be adjustable) and scalar units, I
> > would create a simulator for that, the simulator would cover ALU pipeline
> > timing and dependencies but not much else.
> 
> everything but this is a great idea.  there's no point spending time
> on doing hardware-cycle-accurate pipelining when cavatools is specifically
> designed to do exactly that, and already has the infrastructure in place.
> any time spent on developing any hardware-cycle-accurate simulations is
> therefore both time and money completely wasted.

the point is to have pretty animations of how data moves in the cpu so people understand how the wide-mul merging works (which requires something to construct those animations), not to have specific timing for exact performance comparisons.

the sw to construct those animations is the simplistic SIMD ALU simulator -- imho spending a week writing that is waay better than creating animations by hand that only cover 1 specific run of 1 specific algorithm and is likely to have mistakes because it was done by hand.

Comment 3 Jacob Lifshay 2022-10-04 20:03:47 BST

(In reply to Jacob Lifshay from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #1)
> > (In reply to Jacob Lifshay from comment #0)
> > 
> > > It would be run on a simplistic model cpu with 1 256-bit simd add unit and 1
> > > 256-bit wide-mul unit (the model would be adjustable) and scalar units, I
> > > would create a simulator for that, the simulator would cover ALU pipeline
> > > timing and dependencies but not much else.
> > 
> > everything but this is a great idea.  there's no point spending time
> > on doing hardware-cycle-accurate pipelining when cavatools is specifically
> > designed to do exactly that, and already has the infrastructure in place.
> > any time spent on developing any hardware-cycle-accurate simulations is
> > therefore both time and money completely wasted.
> 
> the point is to have pretty animations of how data moves in the cpu so
> people understand how the wide-mul merging works (which requires something
> to construct those animations), not to have specific timing for exact
> performance comparisons.
> 
> the sw to construct those animations is the simplistic SIMD ALU simulator --
> imho spending a week writing that is waay better than creating animations by
> hand that only cover 1 specific run of 1 specific algorithm and is likely to
> have mistakes because it was done by hand.

it is not a waste of time or money because it has a completely different purpose and deadline than cavatools -- it generates animations rather than tries to quickly simulate a whole program, it is simplistic and ignores cache latency, cache hits, branch mispredictions, scalar op latency, etc. cavatools is much more detailed and complete and won't be ready any time soon and won't produce blender animations so is unsuitable for a demo for the bigint presentation.

Comment 4 Luke Kenneth Casson Leighton 2022-10-04 21:36:15 BST

(In reply to Jacob Lifshay from comment #2)

> the point is to have pretty animations of how data moves in the cpu so
> people understand how the wide-mul merging works (which requires something
> to construct those animations), not to have specific timing for exact
> performance comparisons.

ah ok then that's different - you might find this approach useful rather
than attempting to use (or create a dependency on) a massive software
suite named "blender"

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/dct_butterfly_svg.py;hb=HEAD

Comment 5 Jacob Lifshay 2022-10-05 03:56:15 BST

Luke, can you create a bigint-presentation-code.git repo for the code for this? it would include the animation code and the toom-cook multiplication algorithm generator since that needs matrix inversion

Comment 6 Luke Kenneth Casson Leighton 2022-10-05 18:47:14 BST

(In reply to Jacob Lifshay from comment #5)
> Luke, can you create a bigint-presentation-code.git repo for the code for
> this? it would include the animation code and the toom-cook multiplication
> algorithm generator since that needs matrix inversion

yep done.

matrix inversion, do you mean mirror/transpose? xy inversion
options on sv.remap allow for dimension "mirroring" but unfortunately
there are not enough bits spare to enable them in actual instructions,
unless we go 64-bit (EXT001 NOT svp64) i.e. a "psvshape" instruction.

temporarily you can blat manual values into the SVSHAPE SPRs,
or you can use fmv/mrr with svshape2 enabling the inversion of
the other dinension.  mrr will run in reverse, but the yx field
will invert the Y direction.  have to try it and see.

Comment 7 Jacob Lifshay 2022-10-05 19:19:25 BST

(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #5)
> > Luke, can you create a bigint-presentation-code.git repo for the code for
> > this? it would include the animation code and the toom-cook multiplication
> > algorithm generator since that needs matrix inversion
> 
> yep done.

thx
> 
> matrix inversion, do you mean mirror/transpose?

no, i mean like the matrix equivalent of 1 / x.
https://en.wikipedia.org/wiki/Invertible_matrix
it happens as part of generating the assembly program for toom-cook so using svp64 isn't helpful there.

Comment 8 Luke Kenneth Casson Leighton 2022-10-05 20:12:21 BST

(In reply to Jacob Lifshay from comment #7)

> no, i mean like the matrix equivalent of 1 / x.
> https://en.wikipedia.org/wiki/Invertible_matrix
> it happens as part of generating the assembly program for toom-cook so using
> svp64 isn't helpful there.

hey i remember doing those, urr 35 years ago so uhn not so helpful :)
i have vague recollections about calculating them 

https://stackoverflow.com/questions/32114054/matrix-inversion-without-numpy#51214047

that looks pretty reasonable, still using matrix remap schedules,
including the triple loop

                for k in range(2*n_Columns):
                    Matrix[j][k] = Matrix[j][k] - ratio * Matrix[i][k]

with matrix remap you *don't* have to set the same target regs as is
set in the example, you can use svremap to arbitrarily target any
operation and any regs *in* that operation.

the use of predication would be hairraising though.

Comment 9 Jacob Lifshay 2022-10-05 20:46:33 BST

(In reply to Luke Kenneth Casson Leighton from comment #8)
> (In reply to Jacob Lifshay from comment #7)
> 
> > no, i mean like the matrix equivalent of 1 / x.
> > https://en.wikipedia.org/wiki/Invertible_matrix
> > it happens as part of generating the assembly program for toom-cook so using
> > svp64 isn't helpful there.
> 
> hey i remember doing those, urr 35 years ago so uhn not so helpful :)
> i have vague recollections about calculating them 
> 
> https://stackoverflow.com/questions/32114054/matrix-inversion-without-
> numpy#51214047
> 
> that looks pretty reasonable, still using matrix remap schedules,

except that the matrix ops will be executed while creating the assembly program, not as part of the assembly program. this is essentially compile-time computation, so svp64 matrix ops aren't needed, a plain python function will do best.

Comment 10 Luke Kenneth Casson Leighton 2022-10-05 21:16:06 BST

(In reply to Jacob Lifshay from comment #9)

> except that the matrix ops will be executed while creating the assembly
> program, not as part of the assembly program. this is essentially
> compile-time computation, so svp64 matrix ops aren't needed, a plain python
> function will do best.

aw doh :)

Comment 11 Jacob Lifshay 2022-10-06 06:14:51 BST

(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #5)
> > Luke, can you create a bigint-presentation-code.git repo for the code for
> > this? it would include the animation code and the toom-cook multiplication
> > algorithm generator since that needs matrix inversion
> 
> yep done.

Can you set it as public? the gitlab mirroring scripts (needed for gitlab ci) try to clone via https, but it failed:
git clone https://git.libre-soc.org/git/bigint-presentation-code.git
Cloning into 'bigint-presentation-code'...
fatal: repository 'https://git.libre-soc.org/git/bigint-presentation-code.git/' not found

Comment 12 Jacob Lifshay 2022-10-06 06:20:48 BST

commit 05c03eb9551619d9a61874cd7639020e49a1999f (HEAD -> master, origin/master)
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Oct 5 22:03:33 2022 -0700

    add Matrix class

I added a Matrix class for matrixes of rational numbers, it includes matrix multiply, inverse, and most of the other common ops.

Unfortunately no one can see the code because the repo isn't yet public...as mentioned in the previous comment.

Later, I'll get around to splitting this bug up into sub-bugs that can be assigned funding.