vector-efficient implementation of chacha20 is needed https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_chacha20.py;hb=HEAD
getting very interesting. svindex is successful in doing the inner rounds, svstep for the round inner loop, CTR mode for the outer. there is however an opportunity to reorder the access to elements such that the parallelism originally intended for chacha20 by bernstein is possible, and it involves 3D REMAP. an attempt to only deploy 2D REMAP was not successful, due to the fact that the Indices are set up to cover both round-groups (the straight group of 16 followed by the rotated group): CHACHA_QUARTER_ROUND(w[0], w[4], w[8], w[12]); CHACHA_QUARTER_ROUND(w[1], w[5], w[9], w[13]); CHACHA_QUARTER_ROUND(w[2], w[6], w[10], w[14]); CHACHA_QUARTER_ROUND(w[3], w[7], w[11], w[15]); CHACHA_QUARTER_ROUND(w[0], w[5], w[10], w[15]); CHACHA_QUARTER_ROUND(w[1], w[6], w[11], w[12]); CHACHA_QUARTER_ROUND(w[2], w[7], w[8], w[13]); CHACHA_QUARTER_ROUND(w[3], w[4], w[9], w[14]); the ordering needed was initially believed to be 2D: cycling through by row (y) before moving to the next column (x) unfortunately it is necessary to stop half-way down the rows (y=0-3) before moving on to the next column (x+=1), then after all columns are done repeat the process with the 2nd group (y=4-7) this is perfectly possible but requires a 3D version of svindex and svshape2, which is too much.