***** PLEASE NOTE: a review/audit (bug #1171) showed a "seeming" discrepancy where in fact it is just an artefact of this RFP *pre-dating* the automated NLnet RFP system. the record is CORRECT, and is in NLnet's system as "2023-06-20 EUR 700 t-1 Jacob Lifshay" ***** Discuss and decide on what cryptographic or other primitives are the most beneficial, simple, or otherwise well-suited for us to implement, since there are too many in existence for us to implement them all. * chacha20+poly1305 specification rfc referenced by wireguard: https://www.rfc-editor.org/rfc/rfc7539 * chacha20 ref http://cr.yp.to/streamciphers/timings/estreambench/submissions/salsa20/chacha8/ref/chacha.c https://github.com/pts/chacha20/blob/master/chacha20_python3.py https://elixir.bootlin.com/linux/latest/source/arch/x86/crypto/chacha-avx2-x86_64.S https://hg.mozilla.org/projects/nss/file/764124fddaa2a908ffa8222732d31829b71c357f/lib/freebl/chacha20-ppc64le.S https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_chacha20.py;hb=HEAD * ec25519 looks very straightforward, fixed-size (VL), meaning straight scalar-vector is a single instruction (add/sub is vector-vector). * RSA TODO * twofish (if chosen?) https://www.cartotype.com/assets/downloads/twofish/ https://www.oryx-embedded.com/doc/twofish_8c_source.html#l00335
https://www.oryx-embedded.com/doc/curve25519_8c_source.html now that we have bigint add, sub, shift, mul (and soon sv.divmod) and only need QTY4 64-bit registers ecc25519 looks stunningly simple to do.
the other obvious one (jacob slready working on) is RSA. that just leaves one other candidate algorithm to pick, and my feeling is it should be an uncommon non-hardware-optimised one, that does not need either elwidth overrides or REMAP as that would make both features a hard dependency on TestIssuer. something needing some of the xperm instructions, grevluti, and other bitmanip would be good.
https://cryptobook.nakov.com/symmetric-key-ciphers/popular-symmetric-algorithms looks like chacha is commonly used, although it is almost laughably simple https://en.m.wikipedia.org/wiki/Salsa20 comprising add, shift and xor, i'm not sure it's a worthwhile demo this looks much more interesting as it has bit-swapping: https://en.m.wikipedia.org/wiki/Serpent_(cipher)
https://www.oryx-embedded.com/doc/twofish_8c_source.html twofish would also be really good, it uses GF2^8 with prime, reduction, and several other things, the only issue being in TestIssuer elwidth overrides would be needed. it is still worthwhile as a candidate for running in the Simulator.
imho we should implement chacha20-poly1305 -- a very commonly used AEAD, used by Wireguard and ssh and tls and more. imho we should implement the wireguard variant. https://www.rfc-editor.org/rfc/rfc7539 poly1305 is quite simple: clamp(r): r &= 0x0ffffffc0ffffffc0ffffffc0fffffff poly1305_mac(msg, key): r = (le_bytes_to_num(key[0..15]) clamp(r) s = le_num(key[16..31]) accumulator = 0 p = (1<<130)-5 for i=1 upto ceil(msg length in bytes / 16) n = le_bytes_to_num(msg[((i-1)*16)..(i*16)] | [0x01]) a += n a = (r * a) % p end a += s return num_to_16_le_bytes(a) end the remainder op can be done using shifting and add and sub iirc, so isn't super slow
(In reply to Jacob Lifshay from comment #5) > imho we should implement chacha20-poly1305 -- a very commonly used AEAD, > used by Wireguard and ssh and tls and more. imho we should implement the > wireguard variant. it's so simple that there's no point. > the remainder op can be done using shifting and add and sub iirc, the entire purpose of this Grant is to show *new* instructions and vector concepts that are entirely general-purpose, how they reduce instruction count. as these algorithms have been specifically designed using *already existing* general-purpose instructions there is zero benefit to spending grant money or time on them. by total contrast the bigint ones are an astonishing resounding success, the emergence of sv.adde as a vector-vector add, the 64-bit carry on divmod2deu and madded? so simple and effective. serpent (or twofish? the one with the Feistel Network) again by contrast uses an add-with-shift (a + b<<1) which was intended to be added for general-purpose address computations. turns out it saves one instruction. looking for things like that are what this grant is about. there are none - at all - in chacha - that i can see. exactly as you found, it is all about rotate, add, subtract. btw general-purpose is incredibly important, to avoid classification under most Govts "Weapons Export" Laws.
correction: i just found something interesting / intriguing: https://www.oryx-embedded.com/doc/chacha_8c_source.html //ChaCha runs 8, 12 or 20 rounds, alternating between column rounds and //diagonal rounds for(i = 0; i < context->nr; i += 2) { //The column rounds apply the quarter-round function to the four //columns, from left to right CHACHA_QUARTER_ROUND(w[0], w[4], w[8], w[12]); CHACHA_QUARTER_ROUND(w[1], w[5], w[9], w[13]); CHACHA_QUARTER_ROUND(w[2], w[6], w[10], w[14]); CHACHA_QUARTER_ROUND(w[3], w[7], w[11], w[15]); //The diagonal rounds apply the quarter-round function to the top-left, //bottom-right diagonal, followed by the pattern shifted one place to //the right, for three more quarter-rounds CHACHA_QUARTER_ROUND(w[0], w[5], w[10], w[15]); CHACHA_QUARTER_ROUND(w[1], w[6], w[11], w[12]); CHACHA_QUARTER_ROUND(w[2], w[7], w[8], w[13]); CHACHA_QUARTER_ROUND(w[3], w[4], w[9], w[14]); } that's a rotated matrix multiply that *might* fit with a combination of Matrix REMAP and Indexed REMAP. it would almost certainly have to be Vertical-First Mode as there is a dependency chain on the QUARTERROUND whilst it would be tempting to make a double copy of w[] it would be much more interesting to see what REMAP can do, here. //ChaCha quarter-round function #define CHACHA_QUARTER_ROUND(a, b, c, d) \ { \ a += b; \ d ^= a; \ d = ROL32(d, 16); \ c += d; \ b ^= c; \ b = ROL32(b, 12); \ a += b; \ d ^= a; \ d = ROL32(d, 8); \ c += d; \ b ^= c; \ b = ROL32(b, 7); \ } so there are interdependencies created between those, but pipelineable. however they are all in the exact same order, meaning that some REMAP Index offsets (QTY 32 per operand) could be used (because 2 blocks on w[]) with elwidth=8 on Indexed REMAP that is QTY4 64bit regs to store 32 8bit Indices. *BUT*... 5 are needed (4 is a b c d above) because of the magic ROL32 constants. or wait, no, it is Vertical-First, they can be selected using scalar. that's doable. the entire inner loop crashes down to around... 8 instructions which fits into a single Cache-Line. holy cow. yeah that's worth exploring.
d = ROL32(d, 16); \ b = ROL32(b, 12); \ d = ROL32(d, 8); \ b = ROL32(b, 7); \ x = 0x100c0807 # magic constant x = ROL32(x, 8) # in loop couldn't be easier. no need to even ANDi that because the other ROL32s will cut off the top bits automatically. also it is a lot less regs for REMAP Indices, because there are 8 QUARTER rounds, that is one 64 bit reg in elwidth=8, times four, is *only* QTY 4of 64-bit regs. ah, and there are not 4 separate applications a b c d, there is just "the 16TH_ROUND part" i.e. they all meet this pattern: r0 += r1; \ r2 ^= r0; \ r2 = ROL32(r2, r3); that's doable with 4 Indices! holy cow. this entire thing will collapse down to 5 instructions inside a loop: one sv.add, one sv.XOR, one sv.ROL32, svstep, and bc in CTR mode. that's almost obscene and laughable.
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #5) > > imho we should implement chacha20-poly1305 -- a very commonly used AEAD, > > used by Wireguard and ssh and tls and more. imho we should implement the > > wireguard variant. > > it's so simple that there's no point. there *is* a point, because demonstrating that svp64+biginteger can make chacha20+poly1305 run much faster is very highly significant because that is a big part of the processing required for wireguard and other protocols. picking random example numbers, it would be hugely significant if you could encrypt/decrypt packets at 2GiB/s rather than 500MiB/s per core. > there are none - at all - in chacha - that i can see. exactly as you found, > it is all about rotate, add, subtract. nearly everything i described in comment #5 was talking about poly1305, not chacha. those are all 320 and 192-bit bigint ops (rounded up to nearest multiple of 64-bits) that svp64 + bigint definitely accelerates.
(In reply to Jacob Lifshay from comment #9) > there *is* a point, it's always a good idea to read ahead, all messages, in full. i already worked that out :) > run much faster please understand and accept that the purpose of the exercise, under this Grant, is *not* processor speed, it is instruction count reduction and thus efficiency and power reduction. speed is an arbitrary factor based on a direct near-linear relationship with how much back-end silicon is thrown at the problem, whereas power consumption is not. > is very highly significant anyone may throw more silicon down and claim "it's significant". every datasheet for hard macros also contains power consumption figures and these are what is much more critical. SV being completely abstract and an architecturally independent ISA the only thing we can possibly claim right across the board of all possible microarchitectures is that power consumption is reduced as a direct result of the reduction in instruction count. > nearly everything i described in comment #5 was talking about poly1305, not > chacha. those are all 320 and 192-bit bigint ops (rounded up to nearest > multiple of 64-bits) that svp64 + bigint definitely accelerates. ok. i needed to know that. i have (had) no idea what poly1305 is, and it would be helpful to have crossreferences (comment 0) to reference code, and so on, yeh?
(In reply to Luke Kenneth Casson Leighton from comment #10) > (In reply to Jacob Lifshay from comment #9) > > > there *is* a point, > > it's always a good idea to read ahead, all messages, in full. > i already worked that out :) i'd read all messages, my point was different than the one you arrived at... > > > run much faster > > please understand and accept that the purpose of the exercise, > under this Grant, is *not* processor speed, it is instruction > count reduction and thus efficiency and power reduction. i assumed that you would infer I meant using instruction count and probable implementation strategy as a proxy for speed. > > speed is an arbitrary factor based on a direct near-linear > relationship it's linear only at the low end, it gets very non-linear very quickly, e.g. going from 4-wide to 8-wide issue doesn't give a 2x speedup, it's closer to somewhere around 1.2x. also, serial dependencies usually prevent wider silicon from giving much speedup, if at all, so that's a very good reason to work on reducing instruction count and/or switch to instructions known to be easily implementable with much higher performance such as sv.maddedu rather than a complex loop of adde/maddld/maddhdu instructions. > with how much back-end silicon is thrown at the > problem, whereas power consumption is not. > > > is very highly significant > > anyone may throw more silicon down and claim "it's significant". > every datasheet for hard macros also contains power consumption > figures and these are what is much more critical. > > SV being completely abstract and an architecturally independent > ISA the only thing we can possibly claim right across the board then don't claim right across the board, maybe claim that with the appropriate microarchitecture (128x64->192-bit wide-mul for sv.maddedu rather than u64x2 simd mul, or even more if working with 256-bit simd) it gets a huge speedup, much more than spending the same silicon on additional scalar performance (you'd definitely hit diminishing returns by that point, you'd only have enough additional for maybe 1 alu where you likely already have several). maybe claim that it doesn't have terrible performance issues for smaller microarchitectures if svp64 is implemented in hardware (e.g. excluding trap & emulate)
(In reply to Luke Kenneth Casson Leighton from comment #10) > and it would be helpful to have crossreferences (comment 0) to reference > code, and so on, yeh? added
(In reply to Jacob Lifshay from comment #11) > i assumed that you would infer I meant using instruction count and probable > implementation strategy as a proxy for speed. no, as it is industry-standard practice to differentiate explicitly between performance (usually Dhrystone MIPS) and performance/watt. > > SV being completely abstract and an architecturally independent > > ISA the only thing we can possibly claim right across the board > > then don't claim right across the board, i had to do the analysis for the Due Diligence, so i know that it is. > have several). maybe claim that it doesn't have terrible performance issues > for smaller microarchitectures if svp64 is implemented in hardware (e.g. > excluding trap & emulate) these still reduce instruction count, and performance is the least concern, which is precisely why i made it clear that it is imperative *not* to focus on performance. also the implementor is free and clear to implement precisely and exactly that hardware desired for the specific (sole) application for which that silicon is exclusively dedicated. if they fail to add the required instructions and damage performance as a direct result that is their problem to fix. right across *all* of these options reduction in instruction count is the sole constant that reduces power consumption (except where implementors fail to do a decent job, which is, again, out of scope) summary: performance claims it is imperative that we make none WHATSOEVER, the only claim we can reasonably make is "performance/watt reduction" and steer well clear of stating what that is, because it is too implementation- specific. we have to *have* an implementation in order to measure it, and we are about 2 years away from that. so please: *do not* make performance claims, do not make proposals containing performance claims, do not make presentations showing performance claims, do not publicly state anywhere in any discussions that performance is claimed to be greater. you will be required to provide proof, and get us into trouble for being unable to back up the claim, or, worse, commit us to needing to provide proof of that claim, when we have zero funding to be able to do so.
(In reply to Luke Kenneth Casson Leighton from comment #8) > r0 += r1; \ > r2 ^= r0; \ > r2 = ROL32(r2, r3); these are (using reg nums to indicate which REMAP source is needed) add r0, r0, r1 RT, RA, RB xor r2, r2, r0 RT, RA, RB rlwnm r2, r2, r3, 0, 31 rlwnm RA,RS,RB,MB,ME (Rc=0) however they are all overlapping regs with the exception of RS, meaning a sequence of svremaps is needed before each operation. slightly annoying but doable. example: RT and RA are REMAP0 in the add, but RT must be REMAP2 in the xor. by turning things around they can line up, reducing the number of svremaps: svremap RT=0,RA=1,RB=0 add r0, r1, r0 RT, RA, RB svremap RT=2,RA=2,RB=0 # RB stays = 0 xor r2, r2, r0 RT, RA, RB svremap RS=2,RA=2,RB=3 # RA stays = 2 rlwnm r2, r2, r3, 0, 31 rlwnm RA,RS,RB,MB,ME (Rc=0) conclusion: hmmm, likely just going to have to go with one svremap in front of every instruction, oh well.
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;h=f4d15884e57d22c0c585f0fcbbb018b90baa2d9c;hb=4f7218a8e3579cbb96fe254495033765f4d772ab#l536 principle for chacha20 round works great. 2nd loop using CTR should work perfectly.
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_chacha20.py;hb=HEAD 2nd CTR-based loop works perfectly.