todo list: * DONE: resolve difficulty of use issues with dsld/dsrd (need splat to RT vector or copy of input as well as a separate scalar shift for MSB/LSB word for bigint shift left/right/arithmetic-right) -- something more like divmod2du/maddedu with an explicit carrying input/output will likely work much better, though requires 4-arg instructions again... * TODO: figure out how we want to quickly shift by a dynamic number of whole elements, svindex is too resource hungery. https://bugs.libre-soc.org/show_bug.cgi?id=937#c23 untested code for 3-in 1-out unsigned shift by signed instructions (dshlsd and dshrsd in code, we'll also want 3-in 1-out signed shift for handling sign extension for the MSB end of some bigint formats, and for signed 128-bit integers) def u64(v): """convert to u64""" return int(v) % 2 ** 64 def i64(v): """convert to i64""" v = u64(v) if v >> 63: v -= 2 ** 64 return v def dshlsd(RA, RB, RC): """double-width unsigned-shift-left signed-shift dword""" RA = u64(RA) RB = u64(RB) RC = i64(RC) v = (RA << 64) | RB if RC <= -64 or RC >= 128: return 0 elif RC < 0: v >>= -RC else: v <<= RC return u64(v >> 64) # return high 64-bits def dshrsd(RA, RB, RC): """double-width unsigned-shift-right signed-shift dword""" RA = u64(RA) RB = u64(RB) RC = i64(RC) v = (RA << 64) | RB if RC <= -64 or RC >= 128: return 0 elif RC < 0: v <<= -RC else: v >>= RC return u64(v) # return low 64-bits def bigint_shl(inp, out, shift): if shift < 0: return bigint_shr(inp, out, -shift) offset = shift // 64 shift %= 64 for i in range(len(out)): out[i] = dshlsd(inp[i - offset], inp[i - offset - 1], shift) def bigint_shr(inp, out, shift): if shift < 0: return bigint_shl(inp, out, -shift) offset = shift // 64 shift %= 64 for i in range(len(out)): out[i] = dshrsd(inp[i + offset + 1], inp[i + offset], shift) def pcenc(symbols, start, symbol_table): """ prefix-codes encode symbols: list[int] symbol indexes start: int LSB0 bit-index to start encoding at, must be < 64 symbol_table: list[tuple[int, int]] pairs of bits and bit lengths for symbols Returns: tuple[list[int], int] the output 64-bit words, and the LSB0 bit-index in the last word where trailing zero-padding starts. """ out = [0] while len(symbols): for i in range(len(symbols)): # gather-load and shifts and ands to extract fields: sym_bits, sym_len = symbol_table[symbols[i]] # prefix-sum: cur_start = start start += sym_len # shift then or-reduce: out[-1] |= dshlsd(sym_bits[i], 0, cur_start) # data-dependent fail-first: if start >= 64: # part of outer loop carry = dshlsd(0, sym_bits[i], cur_start) start -= 64 out.append(carry) symbols = symbols[i + 1:] break return out, start

with the shift-up placing zeros into the LSBs if RC>0 can you explain clearly why the following alteration is not correct for pcdec? elif RC < 0: v <<= -RC return u64(v>>64) # return hi 64-bits else: v >>= RC return u64(v) # return low 64-bits this will be fun for i in range(len(out)): out[i] = dshrsd(inp[i + offset + 1], inp[i + offset], shift) it will need to be an overwrite (RT-as-source) in order to get it into RM-1P-2S1D as dshrshd RT,RA,RB so that EXTRA3 may be used. as 3-in 1-out RT,RA,RB,RC there is no room for EXTRA3 it would be EXTRA2 which cannot handle odd vector regnums except through svoffset the only thing is, reverse-gear needed on dshlsd to avoid overwrite. which is fine. RT,RA,RB would be X-Form which would reduce opcode pressure greatly. prefix-sum btw is doable with mapreduce using the right offsets of RT RA and RB (1 different) probably RT=RA+1

(In reply to Luke Kenneth Casson Leighton from comment #1) > with the shift-up placing zeros into the LSBs if RC>0 > can you explain clearly why the following alteration > is not correct for pcdec? assuming you mean pcenc > > elif RC < 0: > v <<= -RC > return u64(v>>64) # return hi 64-bits > else: > v >>= RC > return u64(v) # return low 64-bits it turns out that signed shift amounts aren't actually necessary for the pcenc algorithm as written above, i thought they were necessary because I was thinking of an alternative algorithm that i never got fully working. > it will need to be an overwrite (RT-as-source) in order to get it into > RM-1P-2S1D as dshrshd RT,RA,RB so that EXTRA3 may be used. as 3-in 1-out > RT,RA,RB,RC there is no room for EXTRA3 it would be EXTRA2 which cannot > handle odd vector regnums except through svoffset as mentioned on irc: https://libre-soc.org/irclog/%23libre-soc.2022-09-25.log.html#t2022-09-25T23:24:46 imho we should have a few variants if we need overwrite, since that makes it more flexible: > yes, it can be an overwrite variant, imho if we do that we should provide > several variants for each input we overwrite: e.g. RT = op(RT, RA, RB), > RT= op(RA, RT, RB), RT = op(RA, RB, RT), RT=op(0, RA, RB), RT=op(RA, 0, RB) also on irc: > for pcenc it has to reduce into several dynamically-determined outputs, so just a traditional mapreduce won't work After writing that, I thought of a algorithm where we could use 3 prefix-sums for the whole function, rather than a separate reduction for every output dword: (again, code not tested) def pcenc(symbols, symbol_table): """ prefix-codes encode symbols: list[int] symbol indexes symbol_table: list[tuple[int, int]] pairs of bits and bit lengths for symbols Returns: tuple[list[int], int] the output 64-bit words, and the total number of output bits. """ # gather-load and shifts and ands to extract fields: sym_bits = [0] * len(symbols) sym_len = [0] * len(symbols) for i in range(len(symbols)): sym_bits[i], sym_len[i] = symbol_table[symbols[i]] # prefix-sum: start = [0] * (len(symbols) + 1) for i in range(len(symbols)): start[i + 1] = start[i] + sym_len[i] # shift: shifted0 = [0] * (len(symbols) + 1) for i in range(len(symbols)): shifted0[i + 1] = u64(sym_bits[i] << (start[i] % 64)) shifted1 = [0] * (len(symbols) + 1) for i in range(len(symbols)): shifted1[i + 1] = (sym_bits[i] << (start[i] % 64)) >> 64 # for pedagogical purposes, not needed in final algorithm: # orig_shifted0 = shifted0.copy() # orig_shifted1 = shifted1.copy() # xor prefix-sum (can't use bitwise-or because it's not invertible): for i in range(len(symbols)): shifted0[i + 1] ^= shifted0[i] for i in range(len(symbols)): shifted1[i + 1] ^= shifted1[i] # scatter or twin-pred: out_to_in_map = [0] * (start[-1] // 64 + 1) for i in range(len(start)): out_to_in_map[start[i] // 64] = i # gather and xor: out = [0] * len(out_to_in_map) for i in range(len(out_to_in_map) - 1): # extract a subsequence of orig_shifted0 by xor-ing the # start/end of the subsequence from the xor prefix-summed sequence out[i] = shifted0[out_to_in_map[i]] out[i] ^= shifted0[out_to_in_map[i + 1]] for i in range(1, len(out_to_in_map)): # extract a subsequence of orig_shifted0 by xor-ing the # start/end of the subsequence from the xor prefix-summed sequence out[i] ^= shifted1[out_to_in_map[i - 1]] out[i] ^= shifted1[out_to_in_map[i]] return out, start[-1]

(In reply to Jacob Lifshay from comment #2) > > it will need to be an overwrite (RT-as-source) in order to get it into > > RM-1P-2S1D as dshrshd RT,RA,RB so that EXTRA3 may be used. as 3-in 1-out > > RT,RA,RB,RC there is no room for EXTRA3 it would be EXTRA2 which cannot > > handle odd vector regnums except through svoffset > > as mentioned on irc: > https://libre-soc.org/irclog/%23libre-soc.2022-09-25.log.html#t2022-09-25T23: > 24:46 > imho we should have a few variants if we need overwrite, since that makes it > more flexible: > > yes, it can be an overwrite variant, imho if we do that we should provide > > several variants for each input we overwrite: e.g. RT = op(RT, RA, RB), > > RT= op(RA, RT, RB), RT = op(RA, RB, RT), RT=op(0, RA, RB), RT=op(RA, 0, RB) If we change the shifts to shift mod XLEN (specifically XLEN, not XLEN * 2) instead of being signed shift, then we only need 8 ops total. This is like all other PowerISA shifts. shifting by >= XLEN is unnecessary for the pcenc algorithm in comment #2 and for bigint shift: def shl_op(a, b, c): # just like x86 shld for 64-bit values v = (u64(a) << 64) | u64(b) v <<= c % 64 return u64(v >> 64) def shr_op(a, b, c): # just like x86 shrd for 64-bit values v = (u64(a) << 64) | u64(b) v >>= c % 64 return u64(v) * RT = shl_op(RT, RA, RB) * RT = shl_op(RA, RT, RB) * RT = shl_op(RA, RB, RT) * RT = shl_op(0, RA, RB) * RT = shr_op(RT, RA, RB) * RT = shr_op(RA, RT, RB) * RT = shr_op(RA, RB, RT) * RT = shr_op(RA, 0, RB) Unnecessary ops: * RT = shl_op(RA, 0, RB) is just plain shl * RT = shr_op(0, RA, RB) is just plain shr * because the shift amount is never >= 64, sign/zero extension doesn't matter since none of the output bits would differ, hence no signed-right-shift is needed.

(In reply to Jacob Lifshay from comment #2) > # shift: > shifted0 = [0] * (len(symbols) + 1) > for i in range(len(symbols)): > shifted0[i + 1] = u64(sym_bits[i] << (start[i] % 64)) > shifted1 = [0] * (len(symbols) + 1) > for i in range(len(symbols)): > shifted1[i + 1] = (sym_bits[i] << (start[i] % 64)) >> 64 doubleshifts. > # for pedagogical purposes, not needed in final algorithm: > # orig_shifted0 = shifted0.copy() > # orig_shifted1 = shifted1.copy() > > # xor prefix-sum (can't use bitwise-or because it's not invertible): > for i in range(len(symbols)): > shifted0[i + 1] ^= shifted0[i] > for i in range(len(symbols)): > shifted1[i + 1] ^= shifted1[i] interesting. sorta making sense > # scatter or twin-pred: > out_to_in_map = [0] * (start[-1] // 64 + 1) > for i in range(len(start)): > out_to_in_map[start[i] // 64] = i sv.svstep gets indices, grep examples for "iota". binary incrementing numbers, turning into unary bitmask? interesting. oh - easy. shift then or-reduction (again). failfirst-cmpi truncates VL to ensure start[i] no greater than vector len.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=10ddc342b1fc075712668bb72463ec811d121949 for review (feel free to edit/fix), you can see what i attempted to do, use a 7-bit shift and if greater than 64 start zeroing out. this may not be sophisticated enough and/or conflict with the modulo 64 ideas, i leave it with you to best resolve?

(In reply to Luke Kenneth Casson Leighton from comment #5) > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=10ddc342b1fc075712668bb72463ec811d121949 > > for review (feel free to edit/fix), you can see what i attempted > to do, use a 7-bit shift and if greater than 64 start zeroing out. > this may not be sophisticated enough and/or conflict with the > modulo 64 ideas, i leave it with you to best resolve? it does conflict with the modulo-64 idea. one benefit of the modulo-64 idea is we don't need a 128-bit rotator, 64-bit will do afaict -- testing needed: for the RT <- ((RA || RB) << ((RT) % 64)) >> 64 variant: n <- (RT)[58:63] mask[0:63] <- MASK(0, 63 - n) # mux v[0:63] <- ((RA) & mask) | ((RB) & ~mask) RT <- ROTL64(v, n)

(In reply to Jacob Lifshay from comment #6) > one benefit of the modulo-64 idea is we don't need a 128-bit rotator, 64-bit > will do afaict -- testing needed: > > for the RT <- ((RA || RB) << ((RT) % 64)) >> 64 variant: tested python version: def rotl64(a, b): a %= 2 ** 64 return ((a << b % 64) | (a >> -b % 64)) % 2 ** 64 def dshl(a, b, n): mask = (1 << (64 - n % 64)) - 1 v = (a & mask) | (b & ~mask) return rotl64(v, n)

(In reply to Jacob Lifshay from comment #7) > def dshl(a, b, n): > mask = (1 << (64 - n % 64)) - 1 > v = (a & mask) | (b & ~mask) > return rotl64(v, n) https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0a43abe68960962f419b68d0357ad4b4274a9b74 muxes for selecting sources is next

I fleshed out dsld/dsrd: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a8332a548969f36d6e747c1b75735bc13194f8a9 and added unit tests for all bigint operations: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=63530e061b28e868f038c3a5515db50c3fb2b9c8 I added dsld/dsrd to minor_31.csv, but ran into issues when I attempted to convert it to pattern mode -- apparently power_decoder.py suffix handling is broken for patterns: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=864d526726aa2ed5497ca6ebe4445a021e16ef9a

(edit: fix typo -- RT input was given twice for ds[lr]ds) (In reply to Luke Kenneth Casson Leighton from comment #8) > muxes for selecting sources is next I know I put the input muxing in the pseudocode for dsld/dsrd since it seemed like that's what you were planning, but imho it would be better to have 4 different instruction mnemonics rather than an additional argument: suggested renames: add hlsz to mnemonic instead of a 0123 argument: dsld RT, RA, RB, 0 -> dsldh RT, RA, RB -- h for RT being hi input RT = dsld(RT, RA, RB) dsld RT, RA, RB, 1 -> dsldl RT, RA, RB -- l for RT being lo input RT = dsld(RA, RT, RB) dsld RT, RA, RB, 2 -> dslds RT, RA, RB -- s for RT being shift input RT = dsld(RA, RB, RT) dsld RT, RA, RB, 3 -> dsldz RT, RA, RB -- z for zero input RT = dsld(0, RA, RB) dsrd RT, RA, RB, 0 -> dsrdh RT, RA, RB -- h for RT being hi input RT = dsrd(RT, RA, RB) dsrd RT, RA, RB, 1 -> dsrdl RT, RA, RB -- l for RT being lo input RT = dsrd(RA, RT, RB) dsrd RT, RA, RB, 2 -> dsrds RT, RA, RB -- s for RT being shift input RT = dsrd(RA, RB, RT) dsrd RT, RA, RB, 3 -> dsrdz RT, RA, RB -- z for zero input RT = dsrd(RA, 0, RB)

(In reply to Jacob Lifshay from comment #9) > I fleshed out dsld/dsrd: > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=a8332a548969f36d6e747c1b75735bc13194f8a9 > > and added unit tests for all bigint operations: > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=63530e061b28e868f038c3a5515db50c3fb2b9c8 fantastic! > I added dsld/dsrd to minor_31.csv, but ran into issues when I attempted to > convert it to pattern mode yeah don't do that for now. (In reply to Jacob Lifshay from comment #10) > I know I put the input muxing in the pseudocode for dsld/dsrd since it > seemed like that's what you were planning, but imho it would be better to > have 4 different instruction mnemonics rather than an additional argument: that means proposing 8 instructions not 2. too many. fpgpr needs to be similarly drastically reduced. list of advised pseudoaliases instead. suggest pseudoalias u for upper. h confused with halfword.

(In reply to Jacob Lifshay from comment #0) > todo list: > * TODO(programmerjake): resolve difficulty of use issues with dsld/dsrd > (need splat to RT vector or copy of input as well as a separate scalar shift > for MSB/LSB word for bigint shift left/right/arithmetic-right) -- something > more like divmod2du/maddedu with an explicit carrying input/output will > likely work much better, though requires 4-arg instructions again... rright. okay. table 13 p1358, Power ISA v3.1, i didn't realise there are another 3 columns available (111000, 111001, 111010) as well as four columns back on p1353 (010000- 010011) therefore yes it's likely ok to use the 111000-111010 group for shift, making them 4-arg.

(In reply to Luke Kenneth Casson Leighton from comment #12) > rright. okay. table 13 p1358, Power ISA v3.1, i didn't realise there are > another 3 columns available (111000, 111001, 111010) as well as four columns > back on p1353 (010000- 010011) > > therefore yes it's likely ok to use the 111000-111010 group for shift, making > them 4-arg. well that went badly. Vector sv.dsld is required to have the HI-LO source vector offset by one, pointing to the same vector(+1). EXTRA2 is incapable of that. jacob i have no idea what you need different behaviour in dsld/dsrd for, so can't begin to go through some options. what's needed, here?

> # shift then or-reduce: > out[-1] |= dshlsd(sym_bits[i], 0, cur_start) if the intention here is to perform a type of vector-bitmask-merge-to-scalar (or the inverse, extract-from-scalar-to-vector) i don't have a problem with creating a special instruction for that although i feel it is important to try *really hard* to find an existing instruction which might do the job. the general idea being to have a vector of (offset,length)s to either extract (or insert) bits (fields) unfortunately i have a sneaking suspicion that VSX has something like this.

nope. if i understand what you want to do it is a *4-in 1-out* operation which is far too much. a shift-mask-vector followed by OR-reduction would do the trick, although the shift-mask-vector ideally needs to be a 3-in 1-out non-immediate version of rldcl: VA2-Form rldcl2 RA,RS,RB,RC (Rc=0) rldcl2. RA,RS,RB,RC (Rc=1) Pseudo-code: n <- (RB)[XLEN-5:XLEN-1] r <- ROTL64((RS), n) b <- (RC)[XLEN-5:XLEN-1] m <- MASK(b, (XLEN-1)) RA <- r & m or probably better m <- MASK(0, b) RA <- r & m this does interestingly fall into the bit-extract category, which i added in bitmanip about... 5 months ago?

http://lists.libre-soc.org/pipermail/libre-soc-dev/2022-October/005411.html the carry chaining version could be (there are a few alternative definitions): dsld RT, RA, RB, RC v = RA sh = RB % 64 v <<= sh mask = (1 << sh) - 1 v |= RC & mask RT = v & (2 ** 64 - 1) RS = v >> 64 dsrd RT, RA, RB, RC v = RA << 64 sh = RB % 64 v >>= sh RS = v & (2 ** 64 - 1) mask = ~((2 ** 64 - 1) >> sh) v >>= 64 v |= RC & mask RT = v dsld v <- [0]*XLEN || (RA) sh <- (RB)[XLEN-6:XLEN-1] v = ROTL128(v, sh) mask = (1 << sh) - 1 v <- v | (RC & mask) # RT = v[XLEN:XLEN*2-1] RS = v[0:XLEN-1]

got them working with a 128-bit ROTL, there's a way i know to do it without but i can't quite work it out. at least it's functional. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=eea94034582ed7686a6150062e321b46b87c3e1b https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=f21545d3f4bbb6fd27a42790afaa490540e8ed9a

(In reply to Luke Kenneth Casson Leighton from comment #17) > got them working with a 128-bit ROTL, there's a way i know > to do it without but i can't quite work it out. at least > it's functional. imho it's not quite right yet, since we have the carrying version, the bigint tests need to be: # don't use r1, it's the stack pointer sv.dsld *16, *16, 3, 4 # just like sv.maddedu, but shifting sv.dsrd/mrr *16, *16, 3, 4 # just like sv.divmod2du, but shifting also try signed right shift, where r4 is initialized to repl(msb, 64): sradi 4, 18, 63 sv.dsrd/mrr *16, *16, 3, 4 afaict RS needs to be the lsb bits and RT the msb bits for dsrd, because if you think about it, RS is the carry -- the bits shifted out of the result, those bits are shifted right so are at the lsb end.

(In reply to Jacob Lifshay from comment #18) > imho it's not quite right yet, since we have the carrying version, the > bigint tests need to be: > sv.dsld *16, *16, 3, 4 # just like sv.maddedu, but shifting > sv.dsrd/mrr *16, *16, 3, 4 # just like sv.divmod2du, but shifting with even-numbered RB, yes > also try signed right shift, where r4 is initialized to repl(msb, 64): > sradi 4, 18, 63 > sv.dsrd/mrr *16, *16, 3, 4 > > afaict RS needs to be the lsb bits and RT the msb bits for dsrd, because if > you think about it, RS is the carry -- the bits shifted out of the result, > those bits are shifted right so are at the lsb end. yep. i'd cut/paste 64-n instead of 128-n which had the effect of unintentionally swapping RS and RT as the carry-part and result-part... sort-of. --- a/openpower/isa/svfixedarith.mdwn +++ b/openpower/isa/svfixedarith.mdwn @@ -75,7 +75,7 @@ VA2-Form Pseudo-code: n <- (RB)[58:63] - v <- ROTL128((RA) || [0]*64, 64-n) + v <- ROTL128((RA) || [0]*64, 128-n) mask <- ¬MASK(n, 63) RT <- v[0:63] | ((RC) & mask) RS <- v[64:127] > afaict RS needs to be the lsb bits and RT the msb bits for dsrd, likely fixed by above, can you confirm?

(In reply to Luke Kenneth Casson Leighton from comment #19) > (In reply to Jacob Lifshay from comment #18) > > > imho it's not quite right yet, since we have the carrying version, the > > bigint tests need to be: > > sv.dsld *16, *16, 3, 4 # just like sv.maddedu, but shifting > > sv.dsrd/mrr *16, *16, 3, 4 # just like sv.divmod2du, but shifting > > with even-numbered RB, yes no, RB and RC are scalar, they can be odd numbered. > > afaict RS needs to be the lsb bits and RT the msb bits for dsrd, > > likely fixed by above, can you confirm? i meant that needed to be changed in the unit tests. I fixed the unit tests, and replaced scalar RB, RC regs with r3, r5: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=7d58514beb36c313ccf13a0f14686bd68738f40d (sorry, i just realized i forgot to split out formatting code as a separate commit)

it turns out that carrying-4-arg dsld/dsrd don't require extra2-3-2-2 format. i think the experiment is successful, so imho should be merged into master. this will resolve my concern in the top comment about dsld/dsrd being difficult to use. lkcl, what do you think?

(In reply to Jacob Lifshay from comment #20) > no, RB and RC are scalar, they can be odd numbered. doh, forgot. > > > afaict RS needs to be the lsb bits and RT the msb bits for dsrd, > > > > likely fixed by above, can you confirm? > > i meant that needed to be changed in the unit tests. I fixed the unit tests, > and replaced scalar RB, RC regs with r3, r5: > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=7d58514beb36c313ccf13a0f14686bd68738f40d nice. yep makes it clear scalar can be r0-r63. > (sorry, i just realized i forgot to split out formatting code as a separate > commit) doh :) (In reply to Jacob Lifshay from comment #21) > it turns out that carrying-4-arg dsld/dsrd don't require extra2-3-2-2 > format. hoo-rah. > i think the experiment is successful, so imho should be merged into > master. this will resolve my concern in the top comment about dsld/dsrd > being difficult to use. excellent, which means they can be added to ls004 > lkcl, what do you think? yep, unit tests pass, and i've been rebasing regularly.

for bigint shift, (as well as other useful stuff that needs dynamic shifting), we will want the equivalent of RVV's vslideup/vslidedown -- they shift a vector by a variable number of whole elements (dsld/dsrd handles shifts within elements). Technically, they can be done by using svindex, but that wastes a bunch of extra registers, takes extra time to set up the indices' values and is likely to be quite slow. Additional instructions for shifts by constant amounts aren't needed since we can use svstate.offset combined with adjusting register numbers in the instruction.

(In reply to Jacob Lifshay from comment #23) > for bigint shift, (as well as other useful stuff that needs dynamic > shifting), we will want the equivalent of RVV's vslideup/vslidedown -- they > shift a vector by a variable number of whole elements (dsld/dsrd handles > shifts within elements). svoffset. leave it in-place. vslide* may be synthesised by applying svoffset to a mv instruction. alternatively using twin-predication the front (or back) of a predicate can be set to zeros.

(In reply to Luke Kenneth Casson Leighton from comment #24) > (In reply to Jacob Lifshay from comment #23) > > for bigint shift, (as well as other useful stuff that needs dynamic > > shifting), we will want the equivalent of RVV's vslideup/vslidedown -- they > > shift a vector by a variable number of whole elements (dsld/dsrd handles > > shifts within elements). > > svoffset. leave it in-place. vslide* may be synthesised by applying > svoffset to a mv instruction. iirc svoffset isn't big enough. also iirc there is no easy way to set svoffset dynamically, you have to construct the spr state manually. > > alternatively using twin-predication the front (or back) of a predicate > can be set to zeros. that will work, but needs several additional instructions...also we'd probably have to have the hardware special-case detect shifting masks, since fully general twin-predication is likely to be nearly as slow as svindex.

removed from CryptoRouter milestone, since there is no funds left in the top-level subtask or the parent task of this task. This could go under bug #1011 if that gets more funds.