770 – Discussion and Finalisation of Which Cryptographic Primitives to Implement

Bug 770 - Discussion and Finalisation of Which Cryptographic Primitives to Implement

Summary: Discussion and Finalisation of Which Cryptographic Primitives to Implement

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's second ASIC
Classification:	Unclassified
Component:	source code (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:	589
	Show dependency tree / graph

Reported:	2022-02-15 06:37 GMT by Jacob Lifshay
Modified:	2023-09-20 12:15 BST (History)
CC List:	1 user (show)

See Also:	942 965 1171
NLnet milestone:	NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks:	2000
budget (EUR) for this task, excluding subtasks' budget:	2000
parent task for budget allocation:	589
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	lkcl = { amount = 1300, submitted = 2022-12-08, paid = 2022-12-13 } [jacob] amount = 700 submitted = 2023-06-05 paid = 2023-06-21

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2022-02-15 06:37:38 GMT

*****

PLEASE NOTE: a review/audit (bug #1171) showed a "seeming" discrepancy
where in fact it is just an artefact of this RFP *pre-dating* the automated
NLnet RFP system.  the record is CORRECT, and is in NLnet's system
as "2023-06-20 EUR 700 t-1 Jacob Lifshay"

*****

Discuss and decide on what cryptographic or other primitives are the most beneficial,
simple, or otherwise well-suited for us to implement, since there are too
many in existence for us to implement them all.

* chacha20+poly1305 specification
rfc referenced by wireguard:
https://www.rfc-editor.org/rfc/rfc7539

* chacha20 ref

http://cr.yp.to/streamciphers/timings/estreambench/submissions/salsa20/chacha8/ref/chacha.c
https://github.com/pts/chacha20/blob/master/chacha20_python3.py
https://elixir.bootlin.com/linux/latest/source/arch/x86/crypto/chacha-avx2-x86_64.S
https://hg.mozilla.org/projects/nss/file/764124fddaa2a908ffa8222732d31829b71c357f/lib/freebl/chacha20-ppc64le.S
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_chacha20.py;hb=HEAD

* ec25519

looks very straightforward, fixed-size (VL), meaning straight scalar-vector
is a single instruction (add/sub is vector-vector).

* RSA

TODO

* twofish (if chosen?)

https://www.cartotype.com/assets/downloads/twofish/
https://www.oryx-embedded.com/doc/twofish_8c_source.html#l00335

Comment 1 Luke Kenneth Casson Leighton 2022-10-12 16:45:11 BST

https://www.oryx-embedded.com/doc/curve25519_8c_source.html

now that we have bigint add, sub, shift, mul (and
soon sv.divmod) and only need QTY4 64-bit registers
ecc25519 looks stunningly simple to do.

Comment 2 Luke Kenneth Casson Leighton 2022-10-13 14:51:06 BST

the other obvious one (jacob slready working on) is RSA.

that just leaves one other candidate algorithm to pick,
and my feeling is it should be an uncommon non-hardware-optimised
one, that does not need either elwidth overrides or REMAP as
that would make both features a hard dependency on TestIssuer.

something needing some of the xperm instructions, grevluti,
and other bitmanip would be good.

Comment 3 Luke Kenneth Casson Leighton 2022-10-13 15:02:32 BST

https://cryptobook.nakov.com/symmetric-key-ciphers/popular-symmetric-algorithms

looks like chacha is commonly used, although it is almost laughably simple
https://en.m.wikipedia.org/wiki/Salsa20
comprising add, shift and xor, i'm not sure it's a worthwhile demo

this looks much more interesting as it has bit-swapping:
https://en.m.wikipedia.org/wiki/Serpent_(cipher)

Comment 4 Luke Kenneth Casson Leighton 2022-10-13 15:16:07 BST

https://www.oryx-embedded.com/doc/twofish_8c_source.html

twofish would also be really good, it uses GF2^8 with
prime, reduction, and several other things, the only
issue being in TestIssuer elwidth overrides would be
needed.

it is still worthwhile as a candidate for running in
the Simulator.

Comment 5 Jacob Lifshay 2022-10-13 16:17:02 BST

imho we should implement chacha20-poly1305 -- a very commonly used AEAD, used by Wireguard and ssh and tls and more. imho we should implement the wireguard variant.

https://www.rfc-editor.org/rfc/rfc7539

poly1305 is quite simple:
clamp(r): r &= 0x0ffffffc0ffffffc0ffffffc0fffffff
poly1305_mac(msg, key):
    r = (le_bytes_to_num(key[0..15])
    clamp(r)
    s = le_num(key[16..31])
    accumulator = 0
    p = (1<<130)-5
    for i=1 upto ceil(msg length in bytes / 16)
        n = le_bytes_to_num(msg[((i-1)*16)..(i*16)] | [0x01])
        a += n
        a = (r * a) % p
    end
    a += s
    return num_to_16_le_bytes(a)
end

the remainder op can be done using shifting and add and sub iirc, so isn't super slow

Comment 6 Luke Kenneth Casson Leighton 2022-10-15 11:03:04 BST

(In reply to Jacob Lifshay from comment #5)
> imho we should implement chacha20-poly1305 -- a very commonly used AEAD,
> used by Wireguard and ssh and tls and more. imho we should implement the
> wireguard variant.

it's so simple that there's no point.

> the remainder op can be done using shifting and add and sub iirc, 

the entire purpose of this Grant is to show *new* instructions and
vector concepts that are entirely general-purpose, how they reduce
instruction count.

as these algorithms have been specifically designed using *already existing*
general-purpose instructions there is zero benefit to spending grant money
or time on them.

by total contrast the bigint ones are an astonishing resounding success,
the emergence of sv.adde as a vector-vector add, the 64-bit carry on
divmod2deu and madded? so simple and effective.

serpent (or twofish? the one with the Feistel Network) again by contrast
uses an add-with-shift (a + b<<1) which was intended to be added for
general-purpose address computations. turns out it saves one instruction.

looking for things like that are what this grant is about.

there are none - at all - in chacha - that i can see. exactly as you found,
it is all about rotate, add, subtract.

btw general-purpose is incredibly important, to avoid classification under
most Govts "Weapons Export" Laws.

Comment 7 Luke Kenneth Casson Leighton 2022-10-15 11:27:48 BST

correction: i just found something interesting / intriguing:

  https://www.oryx-embedded.com/doc/chacha_8c_source.html

    //ChaCha runs 8, 12 or 20 rounds, alternating between column rounds and
    //diagonal rounds
    for(i = 0; i < context->nr; i += 2)
    {
       //The column rounds apply the quarter-round function to the four
       //columns, from left to right
       CHACHA_QUARTER_ROUND(w[0], w[4], w[8], w[12]);
       CHACHA_QUARTER_ROUND(w[1], w[5], w[9], w[13]);
       CHACHA_QUARTER_ROUND(w[2], w[6], w[10], w[14]);
       CHACHA_QUARTER_ROUND(w[3], w[7], w[11], w[15]);
  
       //The diagonal rounds apply the quarter-round function to the top-left,
       //bottom-right diagonal, followed by the pattern shifted one place to
       //the right, for three more quarter-rounds
       CHACHA_QUARTER_ROUND(w[0], w[5], w[10], w[15]);
       CHACHA_QUARTER_ROUND(w[1], w[6], w[11], w[12]);
       CHACHA_QUARTER_ROUND(w[2], w[7], w[8], w[13]);
       CHACHA_QUARTER_ROUND(w[3], w[4], w[9], w[14]);
    }

that's a rotated matrix multiply that *might* fit with a combination
of Matrix REMAP and Indexed REMAP.  it would almost certainly have to
be Vertical-First Mode as there is a dependency chain on the QUARTERROUND

whilst it would be tempting to make a double copy of w[] it would be
much more interesting to see what REMAP can do, here.

 //ChaCha quarter-round function
 #define CHACHA_QUARTER_ROUND(a, b, c, d) \
 { \
    a += b; \
    d ^= a; \
    d = ROL32(d, 16); \
    c += d; \
    b ^= c; \
    b = ROL32(b, 12); \
    a += b; \
    d ^= a; \
    d = ROL32(d, 8); \
    c += d; \
    b ^= c; \
    b = ROL32(b, 7); \
 }

so there are interdependencies created between those, but pipelineable.
however they are all in the exact same order, meaning that some REMAP
Index offsets (QTY 32 per operand) could be used (because 2 blocks on
w[])

with elwidth=8 on Indexed REMAP that is QTY4 64bit regs to store 32
8bit Indices. 

*BUT*... 5 are needed (4 is a b c d above) because of the magic ROL32
constants. or wait, no, it is Vertical-First, they can be selected using
scalar.

that's doable.

the entire inner loop crashes down to around... 8 instructions which
fits into a single Cache-Line.  holy cow.  yeah that's worth exploring.

Comment 8 Luke Kenneth Casson Leighton 2022-10-15 11:54:31 BST

    d = ROL32(d, 16); \
    b = ROL32(b, 12); \
    d = ROL32(d, 8); \
    b = ROL32(b, 7); \

x = 0x100c0807 # magic constant
x = ROL32(x, 8) # in loop

couldn't be easier.  no need to even ANDi that because the other ROL32s
will cut off the top bits automatically.

also it is a lot less regs for REMAP Indices, because there are 8 QUARTER
rounds, that is one 64 bit reg in elwidth=8, times four, is *only*
QTY 4of 64-bit regs.

ah, and there are not 4 separate applications a b c d, there is just
"the 16TH_ROUND part" i.e. they all meet this pattern:

    r0 += r1; \
    r2 ^= r0; \
    r2 = ROL32(r2, r3);

that's doable with 4 Indices! holy cow.  this entire thing will
collapse down to 5 instructions inside a loop: one sv.add, one sv.XOR,
one sv.ROL32, svstep, and bc in CTR mode.

that's almost obscene and laughable.

Comment 9 Jacob Lifshay 2022-10-16 04:01:04 BST

(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #5)
> > imho we should implement chacha20-poly1305 -- a very commonly used AEAD,
> > used by Wireguard and ssh and tls and more. imho we should implement the
> > wireguard variant.
> 
> it's so simple that there's no point.

there *is* a point, because demonstrating that svp64+biginteger can make chacha20+poly1305 run much faster is very highly significant because that is a big part of the processing required for wireguard and other protocols. picking random example numbers, it would be hugely significant if you could encrypt/decrypt packets at 2GiB/s rather than 500MiB/s per core.

> there are none - at all - in chacha - that i can see. exactly as you found,
> it is all about rotate, add, subtract.

nearly everything i described in comment #5 was talking about poly1305, not chacha. those are all 320 and 192-bit bigint ops (rounded up to nearest multiple of 64-bits) that svp64 + bigint definitely accelerates.

Comment 10 Luke Kenneth Casson Leighton 2022-10-16 11:06:07 BST

(In reply to Jacob Lifshay from comment #9)

> there *is* a point,

it's always a good idea to read ahead, all messages, in full.
i already worked that out :)

> run much faster 

please understand and accept that the purpose of the exercise,
under this Grant, is *not* processor speed, it is instruction
count reduction and thus efficiency and power reduction.

speed is an arbitrary factor based on a direct near-linear
relationship with how much back-end silicon is thrown at the
problem, whereas power consumption is not.

> is very highly significant

anyone may throw more silicon down and claim "it's significant".
every datasheet for hard macros also contains power consumption
figures and these are what is much more critical.

SV being completely abstract and an architecturally independent
ISA the only thing we can possibly claim right across the board
of all possible microarchitectures is that power consumption
is reduced as a direct result of the reduction in instruction
count.

> nearly everything i described in comment #5 was talking about poly1305, not
> chacha. those are all 320 and 192-bit bigint ops (rounded up to nearest
> multiple of 64-bits) that svp64 + bigint definitely accelerates.

ok.  i needed to know that.  i have (had) no idea what poly1305 is,
and it would be helpful to have crossreferences (comment 0) to reference
code, and so on, yeh?

Comment 11 Jacob Lifshay 2022-10-16 20:00:17 BST

(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Jacob Lifshay from comment #9)
> 
> > there *is* a point,
> 
> it's always a good idea to read ahead, all messages, in full.
> i already worked that out :)

i'd read all messages, my point was different than the one you arrived at...
> 
> > run much faster 
> 
> please understand and accept that the purpose of the exercise,
> under this Grant, is *not* processor speed, it is instruction
> count reduction and thus efficiency and power reduction.

i assumed that you would infer I meant using instruction count and probable implementation strategy as a proxy for speed.
> 
> speed is an arbitrary factor based on a direct near-linear
> relationship
it's linear only at the low end, it gets very non-linear very quickly, e.g. going from 4-wide to 8-wide issue doesn't give a 2x speedup, it's closer to somewhere around 1.2x.

also, serial dependencies usually prevent wider silicon from giving much speedup, if at all, so that's a very good reason to work on reducing instruction count and/or switch to instructions known to be easily implementable with much higher performance such as sv.maddedu rather than a complex loop of adde/maddld/maddhdu instructions.

> with how much back-end silicon is thrown at the
> problem, whereas power consumption is not.
> 
> > is very highly significant
> 
> anyone may throw more silicon down and claim "it's significant".
> every datasheet for hard macros also contains power consumption
> figures and these are what is much more critical.
> 
> SV being completely abstract and an architecturally independent
> ISA the only thing we can possibly claim right across the board

then don't claim right across the board, maybe claim that with the appropriate microarchitecture (128x64->192-bit wide-mul for sv.maddedu rather than u64x2 simd mul, or even more if working with 256-bit simd) it gets a huge speedup, much more than spending the same silicon on additional scalar performance (you'd definitely hit diminishing returns by that point, you'd only have enough additional for maybe 1 alu where you likely already have several). maybe claim that it doesn't have terrible performance issues for smaller microarchitectures if svp64 is implemented in hardware (e.g. excluding trap & emulate)

Comment 12 Jacob Lifshay 2022-10-16 20:05:47 BST

(In reply to Luke Kenneth Casson Leighton from comment #10)
> and it would be helpful to have crossreferences (comment 0) to reference
> code, and so on, yeh?

added

Comment 13 Luke Kenneth Casson Leighton 2022-10-19 17:52:12 BST

(In reply to Jacob Lifshay from comment #11)
> i assumed that you would infer I meant using instruction count and probable
> implementation strategy as a proxy for speed.

no, as it is industry-standard practice to differentiate explicitly
between performance (usually Dhrystone MIPS) and performance/watt.

> > SV being completely abstract and an architecturally independent
> > ISA the only thing we can possibly claim right across the board
> 
> then don't claim right across the board,

i had to do the analysis for the Due Diligence, so i know that it is.

> have several). maybe claim that it doesn't have terrible performance issues
> for smaller microarchitectures if svp64 is implemented in hardware (e.g.
> excluding trap & emulate)

these still reduce instruction count, and performance is the least concern,
which is precisely why i made it clear that it is imperative *not* to focus
on performance.

also the implementor is free and clear to implement precisely and exactly
that hardware desired for the specific (sole) application for which that
silicon is exclusively dedicated.  if they fail to add the required instructions
and damage performance as a direct result that is their problem to fix.

right across *all* of these options reduction in instruction count is the
sole constant that reduces power consumption (except where implementors
fail to do a decent job, which is, again, out of scope)

summary: performance claims it is imperative that we make none WHATSOEVER,
the only claim we can reasonably make is "performance/watt reduction"
and steer well clear of stating what that is, because it is too implementation-
specific.  we have to *have* an implementation in order to measure it, and
we are about 2 years away from that.

so please: *do not* make performance claims, do not make proposals containing
performance claims, do not make presentations showing performance claims,
do not publicly state anywhere in any discussions that performance is claimed
to be greater. you will be required to provide proof, and get us into
trouble for being unable to back up the claim, or, worse, commit us to
needing to provide proof of that claim, when we have zero funding to be
able to do so.

Comment 14 Luke Kenneth Casson Leighton 2022-10-20 10:38:17 BST

(In reply to Luke Kenneth Casson Leighton from comment #8)

>     r0 += r1; \
>     r2 ^= r0; \
>     r2 = ROL32(r2, r3);

these are (using reg nums to indicate which REMAP source is needed)

    add r0, r0, r1            RT, RA, RB
    xor r2, r2, r0            RT, RA, RB
    rlwnm r2, r2, r3, 0, 31   rlwnm RA,RS,RB,MB,ME (Rc=0)

however they are all overlapping regs with the exception of RS, meaning
a sequence of svremaps is needed before each operation. slightly annoying
but doable.

example: RT and RA are REMAP0 in the add, but RT must be REMAP2 in the xor.
by turning things around they can line up, reducing the number of svremaps:

    svremap RT=0,RA=1,RB=0
    add r0, r1, r0            RT, RA, RB
    svremap RT=2,RA=2,RB=0    # RB stays = 0
    xor r2, r2, r0            RT, RA, RB
    svremap RS=2,RA=2,RB=3    # RA stays = 2
    rlwnm r2, r2, r3, 0, 31   rlwnm RA,RS,RB,MB,ME (Rc=0)

conclusion: hmmm, likely just going to have to go with one svremap in
front of every instruction, oh well.

Comment 15 Luke Kenneth Casson Leighton 2022-10-20 17:40:46 BST

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;h=f4d15884e57d22c0c585f0fcbbb018b90baa2d9c;hb=4f7218a8e3579cbb96fe254495033765f4d772ab#l536

principle for chacha20 round works great. 2nd loop using CTR should
work perfectly.

Comment 16 Luke Kenneth Casson Leighton 2022-10-21 13:30:07 BST

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_chacha20.py;hb=HEAD

2nd CTR-based loop works perfectly.