Bug 937 - instructions for bigint shift and prefix-code encode
Summary: instructions for bigint shift and prefix-code encode
Status: IN_PROGRESS
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on: 1237
Blocks: 933 817
  Show dependency treegraph
 
Reported: 2022-09-25 21:32 BST by Jacob Lifshay
Modified: 2023-12-19 08:16 GMT (History)
2 users (show)

See Also:
NLnet milestone: Future
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2022-09-25 21:32:02 BST
todo list:
* DONE: resolve difficulty of use issues with dsld/dsrd (need splat to RT vector or copy of input as well as a separate scalar shift for MSB/LSB word for bigint shift left/right/arithmetic-right) -- something more like divmod2du/maddedu with an explicit carrying input/output will likely work much better, though requires 4-arg instructions again...

* TODO: figure out how we want to quickly shift by a dynamic number of whole elements, svindex is too resource hungery. https://bugs.libre-soc.org/show_bug.cgi?id=937#c23


untested code for 3-in 1-out unsigned shift by signed instructions (dshlsd and dshrsd in code, we'll also want 3-in 1-out signed shift for handling sign extension for the MSB end of some bigint formats, and for signed 128-bit integers)

def u64(v):
    """convert to u64"""
    return int(v) % 2 ** 64

def i64(v):
    """convert to i64"""
    v = u64(v)
    if v >> 63:
        v -= 2 ** 64
    return v

def dshlsd(RA, RB, RC):
    """double-width unsigned-shift-left signed-shift dword"""
    RA = u64(RA)
    RB = u64(RB)
    RC = i64(RC)
    v = (RA << 64) | RB
    if RC <= -64 or RC >= 128:
        return 0
    elif RC < 0:
        v >>= -RC
    else:
        v <<= RC
    return u64(v >> 64)  # return high 64-bits

def dshrsd(RA, RB, RC):
    """double-width unsigned-shift-right signed-shift dword"""
    RA = u64(RA)
    RB = u64(RB)
    RC = i64(RC)
    v = (RA << 64) | RB
    if RC <= -64 or RC >= 128:
        return 0
    elif RC < 0:
        v <<= -RC
    else:
        v >>= RC
    return u64(v)  # return low 64-bits

def bigint_shl(inp, out, shift):
    if shift < 0:
        return bigint_shr(inp, out, -shift)
    offset = shift // 64
    shift %= 64
    for i in range(len(out)):
        out[i] = dshlsd(inp[i - offset], inp[i - offset - 1], shift)

def bigint_shr(inp, out, shift):
    if shift < 0:
        return bigint_shl(inp, out, -shift)
    offset = shift // 64
    shift %= 64
    for i in range(len(out)):
        out[i] = dshrsd(inp[i + offset + 1], inp[i + offset], shift)

def pcenc(symbols, start, symbol_table):
    """ prefix-codes encode
    symbols: list[int]
        symbol indexes
    start: int
        LSB0 bit-index to start encoding at, must be < 64
    symbol_table: list[tuple[int, int]]
        pairs of bits and bit lengths for symbols
    Returns: tuple[list[int], int]
        the output 64-bit words, and the LSB0 bit-index in the last word
        where trailing zero-padding starts.
    """
    out = [0]
    while len(symbols):
        for i in range(len(symbols)):
            # gather-load and shifts and ands to extract fields:
            sym_bits, sym_len = symbol_table[symbols[i]]
            # prefix-sum:
            cur_start = start
            start += sym_len
            # shift then or-reduce:
            out[-1] |= dshlsd(sym_bits[i], 0, cur_start)
            # data-dependent fail-first:
            if start >= 64:
                # part of outer loop
                carry = dshlsd(0, sym_bits[i], cur_start)
                start -= 64
                out.append(carry)
                symbols = symbols[i + 1:]
                break
     return out, start
Comment 1 Luke Kenneth Casson Leighton 2022-09-25 22:14:17 BST
with the shift-up placing zeros into the LSBs if RC>0
can you explain clearly why the following alteration
is not correct for pcdec?

    elif RC < 0:
        v <<= -RC
        return u64(v>>64)  # return hi 64-bits
    else:
        v >>= RC
        return u64(v)  # return low 64-bits

this will be fun

    for i in range(len(out)):
        out[i] = dshrsd(inp[i + offset + 1], inp[i + offset], shift)

it will need to be an overwrite (RT-as-source) in order to get it into
RM-1P-2S1D as dshrshd RT,RA,RB so that EXTRA3 may be used.  as 3-in 1-out
RT,RA,RB,RC there is no room for EXTRA3 it would be EXTRA2 which cannot
handle odd vector regnums except through svoffset

the only thing is, reverse-gear needed on dshlsd to avoid overwrite.
which is fine.

RT,RA,RB would be X-Form which would reduce opcode pressure greatly.

prefix-sum btw is doable with mapreduce using the right offsets of
RT RA and RB (1 different) probably RT=RA+1
Comment 2 Jacob Lifshay 2022-09-26 00:53:06 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> with the shift-up placing zeros into the LSBs if RC>0
> can you explain clearly why the following alteration
> is not correct for pcdec?

assuming you mean pcenc

> 
>     elif RC < 0:
>         v <<= -RC
>         return u64(v>>64)  # return hi 64-bits
>     else:
>         v >>= RC
>         return u64(v)  # return low 64-bits

it turns out that signed shift amounts aren't actually necessary for the pcenc algorithm as written above, i thought they were necessary because I was thinking of an alternative algorithm that i never got fully working.

> it will need to be an overwrite (RT-as-source) in order to get it into
> RM-1P-2S1D as dshrshd RT,RA,RB so that EXTRA3 may be used.  as 3-in 1-out
> RT,RA,RB,RC there is no room for EXTRA3 it would be EXTRA2 which cannot
> handle odd vector regnums except through svoffset

as mentioned on irc:
https://libre-soc.org/irclog/%23libre-soc.2022-09-25.log.html#t2022-09-25T23:24:46
imho we should have a few variants if we need overwrite, since that makes it more flexible:
> yes, it can be an overwrite variant, imho if we do that we should provide
> several variants for each input we overwrite: e.g. RT = op(RT, RA, RB),
> RT= op(RA, RT, RB), RT = op(RA, RB, RT), RT=op(0, RA, RB), RT=op(RA, 0, RB)

also on irc:
> for pcenc it has to reduce into several dynamically-determined outputs, so just a traditional mapreduce won't work

After writing that, I thought of a algorithm where we could use 3 prefix-sums
for the whole function, rather than a separate reduction for every output dword:
(again, code not tested)

def pcenc(symbols, symbol_table):
    """ prefix-codes encode
    symbols: list[int]
        symbol indexes
    symbol_table: list[tuple[int, int]]
        pairs of bits and bit lengths for symbols
    Returns: tuple[list[int], int]
        the output 64-bit words, and the total number of output bits.
    """
    # gather-load and shifts and ands to extract fields:
    sym_bits = [0] * len(symbols)
    sym_len = [0] * len(symbols)
    for i in range(len(symbols)):
        sym_bits[i], sym_len[i] = symbol_table[symbols[i]]

    # prefix-sum:
    start = [0] * (len(symbols) + 1)
    for i in range(len(symbols)):
        start[i + 1] = start[i] + sym_len[i]

    # shift:
    shifted0 = [0] * (len(symbols) + 1)
    for i in range(len(symbols)):
        shifted0[i + 1] = u64(sym_bits[i] << (start[i] % 64))
    shifted1 = [0] * (len(symbols) + 1)
    for i in range(len(symbols)):
        shifted1[i + 1] = (sym_bits[i] << (start[i] % 64)) >> 64

    # for pedagogical purposes, not needed in final algorithm:
    # orig_shifted0 = shifted0.copy()
    # orig_shifted1 = shifted1.copy()

    # xor prefix-sum (can't use bitwise-or because it's not invertible):
    for i in range(len(symbols)):
        shifted0[i + 1] ^= shifted0[i]
    for i in range(len(symbols)):
        shifted1[i + 1] ^= shifted1[i]

    # scatter or twin-pred:
    out_to_in_map = [0] * (start[-1] // 64 + 1)
    for i in range(len(start)):
        out_to_in_map[start[i] // 64] = i

    # gather and xor:
    out = [0] * len(out_to_in_map)
    for i in range(len(out_to_in_map) - 1):
        # extract a subsequence of orig_shifted0 by xor-ing the
        # start/end of the subsequence from the xor prefix-summed sequence
        out[i] = shifted0[out_to_in_map[i]]
        out[i] ^= shifted0[out_to_in_map[i + 1]]
    for i in range(1, len(out_to_in_map)):
        # extract a subsequence of orig_shifted0 by xor-ing the
        # start/end of the subsequence from the xor prefix-summed sequence
        out[i] ^= shifted1[out_to_in_map[i - 1]]
        out[i] ^= shifted1[out_to_in_map[i]]
    return out, start[-1]
Comment 3 Jacob Lifshay 2022-09-26 01:10:34 BST
(In reply to Jacob Lifshay from comment #2)
> > it will need to be an overwrite (RT-as-source) in order to get it into
> > RM-1P-2S1D as dshrshd RT,RA,RB so that EXTRA3 may be used.  as 3-in 1-out
> > RT,RA,RB,RC there is no room for EXTRA3 it would be EXTRA2 which cannot
> > handle odd vector regnums except through svoffset
> 
> as mentioned on irc:
> https://libre-soc.org/irclog/%23libre-soc.2022-09-25.log.html#t2022-09-25T23:
> 24:46
> imho we should have a few variants if we need overwrite, since that makes it
> more flexible:
> > yes, it can be an overwrite variant, imho if we do that we should provide
> > several variants for each input we overwrite: e.g. RT = op(RT, RA, RB),
> > RT= op(RA, RT, RB), RT = op(RA, RB, RT), RT=op(0, RA, RB), RT=op(RA, 0, RB)

If we change the shifts to shift mod XLEN (specifically XLEN, not XLEN * 2) instead of being signed shift, then we only need 8 ops total. This is like all other PowerISA shifts. shifting by >= XLEN is unnecessary for the pcenc algorithm in comment #2 and for bigint shift:
def shl_op(a, b, c):
    # just like x86 shld for 64-bit values
    v = (u64(a) << 64) | u64(b)
    v <<= c % 64
    return u64(v >> 64)

def shr_op(a, b, c):
    # just like x86 shrd for 64-bit values
    v = (u64(a) << 64) | u64(b)
    v >>= c % 64
    return u64(v)

* RT = shl_op(RT, RA, RB)
* RT = shl_op(RA, RT, RB)
* RT = shl_op(RA, RB, RT)
* RT = shl_op(0, RA, RB)
* RT = shr_op(RT, RA, RB)
* RT = shr_op(RA, RT, RB)
* RT = shr_op(RA, RB, RT)
* RT = shr_op(RA, 0, RB)

Unnecessary ops:
* RT = shl_op(RA, 0, RB) is just plain shl
* RT = shr_op(0, RA, RB) is just plain shr
* because the shift amount is never >= 64, sign/zero extension doesn't matter since none of the output bits would differ, hence no signed-right-shift is needed.
Comment 4 Luke Kenneth Casson Leighton 2022-09-26 02:48:20 BST
(In reply to Jacob Lifshay from comment #2)
 
>     # shift:
>     shifted0 = [0] * (len(symbols) + 1)
>     for i in range(len(symbols)):
>         shifted0[i + 1] = u64(sym_bits[i] << (start[i] % 64))
>     shifted1 = [0] * (len(symbols) + 1)
>     for i in range(len(symbols)):
>         shifted1[i + 1] = (sym_bits[i] << (start[i] % 64)) >> 64

doubleshifts.

>     # for pedagogical purposes, not needed in final algorithm:
>     # orig_shifted0 = shifted0.copy()
>     # orig_shifted1 = shifted1.copy()
> 
>     # xor prefix-sum (can't use bitwise-or because it's not invertible):
>     for i in range(len(symbols)):
>         shifted0[i + 1] ^= shifted0[i]
>     for i in range(len(symbols)):
>         shifted1[i + 1] ^= shifted1[i]

interesting. sorta making sense
 
>     # scatter or twin-pred:
>     out_to_in_map = [0] * (start[-1] // 64 + 1)
>     for i in range(len(start)):
>         out_to_in_map[start[i] // 64] = i

sv.svstep gets indices, grep examples for "iota".
binary incrementing numbers, turning into unary
bitmask? interesting. oh - easy. shift then or-reduction
(again).  failfirst-cmpi truncates VL to ensure
start[i] no greater than vector len.
Comment 5 Luke Kenneth Casson Leighton 2022-09-28 19:37:26 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=10ddc342b1fc075712668bb72463ec811d121949

for review (feel free to edit/fix), you can see what i attempted
to do, use a 7-bit shift and if greater than 64 start zeroing out.
this may not be sophisticated enough and/or conflict with the
modulo 64 ideas, i leave it with you to best resolve?
Comment 6 Jacob Lifshay 2022-09-28 20:11:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=10ddc342b1fc075712668bb72463ec811d121949
> 
> for review (feel free to edit/fix), you can see what i attempted
> to do, use a 7-bit shift and if greater than 64 start zeroing out.
> this may not be sophisticated enough and/or conflict with the
> modulo 64 ideas, i leave it with you to best resolve?

it does conflict with the modulo-64 idea.

one benefit of the modulo-64 idea is we don't need a 128-bit rotator, 64-bit will do afaict -- testing needed:

for the RT <- ((RA || RB) << ((RT) % 64)) >> 64 variant:

n <- (RT)[58:63]
mask[0:63] <- MASK(0, 63 - n)
# mux
v[0:63] <- ((RA) & mask) | ((RB) & ~mask)
RT <- ROTL64(v, n)
Comment 7 Jacob Lifshay 2022-09-28 20:51:42 BST
(In reply to Jacob Lifshay from comment #6)
> one benefit of the modulo-64 idea is we don't need a 128-bit rotator, 64-bit
> will do afaict -- testing needed:
> 
> for the RT <- ((RA || RB) << ((RT) % 64)) >> 64 variant:

tested python version:
def rotl64(a, b):
    a %= 2 ** 64
    return ((a << b % 64) | (a >> -b % 64)) % 2 ** 64

def dshl(a, b, n):
    mask = (1 << (64 - n % 64)) - 1
    v = (a & mask) | (b & ~mask)
    return rotl64(v, n)
Comment 8 Luke Kenneth Casson Leighton 2022-09-28 21:02:20 BST
(In reply to Jacob Lifshay from comment #7)

> def dshl(a, b, n):
>     mask = (1 << (64 - n % 64)) - 1
>     v = (a & mask) | (b & ~mask)
>     return rotl64(v, n)

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0a43abe68960962f419b68d0357ad4b4274a9b74

muxes for selecting sources is next
Comment 9 Jacob Lifshay 2022-09-29 04:21:43 BST
I fleshed out dsld/dsrd:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a8332a548969f36d6e747c1b75735bc13194f8a9

and added unit tests for all bigint operations:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=63530e061b28e868f038c3a5515db50c3fb2b9c8

I added dsld/dsrd to minor_31.csv, but ran into issues when I attempted to convert it to pattern mode -- apparently power_decoder.py suffix handling is broken for patterns:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=864d526726aa2ed5497ca6ebe4445a021e16ef9a
Comment 10 Jacob Lifshay 2022-09-29 04:37:36 BST
(edit: fix typo -- RT input was given twice for ds[lr]ds)

(In reply to Luke Kenneth Casson Leighton from comment #8)
> muxes for selecting sources is next

I know I put the input muxing in the pseudocode for dsld/dsrd since it seemed like that's what you were planning, but imho it would be better to have 4 different instruction mnemonics rather than an additional argument:
suggested renames:
add hlsz to mnemonic instead of a 0123 argument:

dsld RT, RA, RB, 0 -> dsldh RT, RA, RB -- h for RT being hi input
RT = dsld(RT, RA, RB)

dsld RT, RA, RB, 1 -> dsldl RT, RA, RB -- l for RT being lo input
RT = dsld(RA, RT, RB)

dsld RT, RA, RB, 2 -> dslds RT, RA, RB -- s for RT being shift input
RT = dsld(RA, RB, RT)

dsld RT, RA, RB, 3 -> dsldz RT, RA, RB -- z for zero input
RT = dsld(0, RA, RB)


dsrd RT, RA, RB, 0 -> dsrdh RT, RA, RB -- h for RT being hi input
RT = dsrd(RT, RA, RB)

dsrd RT, RA, RB, 1 -> dsrdl RT, RA, RB -- l for RT being lo input
RT = dsrd(RA, RT, RB)

dsrd RT, RA, RB, 2 -> dsrds RT, RA, RB -- s for RT being shift input
RT = dsrd(RA, RB, RT)

dsrd RT, RA, RB, 3 -> dsrdz RT, RA, RB -- z for zero input
RT = dsrd(RA, 0, RB)
Comment 11 Luke Kenneth Casson Leighton 2022-09-29 09:15:59 BST
(In reply to Jacob Lifshay from comment #9)
> I fleshed out dsld/dsrd:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=a8332a548969f36d6e747c1b75735bc13194f8a9
> 
> and added unit tests for all bigint operations:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=63530e061b28e868f038c3a5515db50c3fb2b9c8

fantastic!

> I added dsld/dsrd to minor_31.csv, but ran into issues when I attempted to
> convert it to pattern mode 

yeah don't do that for now.

(In reply to Jacob Lifshay from comment #10)

> I know I put the input muxing in the pseudocode for dsld/dsrd since it
> seemed like that's what you were planning, but imho it would be better to
> have 4 different instruction mnemonics rather than an additional argument:

that means proposing 8 instructions not 2. too many.
fpgpr needs to be similarly drastically reduced.
list of advised pseudoaliases instead.

suggest pseudoalias u for upper. h confused with halfword.
Comment 12 Luke Kenneth Casson Leighton 2022-10-22 13:37:52 BST
(In reply to Jacob Lifshay from comment #0)
> todo list:
> * TODO(programmerjake): resolve difficulty of use issues with dsld/dsrd
> (need splat to RT vector or copy of input as well as a separate scalar shift
> for MSB/LSB word for bigint shift left/right/arithmetic-right) -- something
> more like divmod2du/maddedu with an explicit carrying input/output will
> likely work much better, though requires 4-arg instructions again...

rright.  okay. table 13 p1358, Power ISA v3.1, i didn't realise there are
another 3 columns available (111000, 111001, 111010) as well as four columns
back on p1353 (010000- 010011)

therefore yes it's likely ok to use the 111000-111010 group for shift, making
them 4-arg.
Comment 13 Luke Kenneth Casson Leighton 2022-10-22 18:52:40 BST
(In reply to Luke Kenneth Casson Leighton from comment #12)
> rright.  okay. table 13 p1358, Power ISA v3.1, i didn't realise there are
> another 3 columns available (111000, 111001, 111010) as well as four columns
> back on p1353 (010000- 010011)
> 
> therefore yes it's likely ok to use the 111000-111010 group for shift, making
> them 4-arg.

well that went badly.  Vector sv.dsld is required to have the HI-LO source
vector offset by one, pointing to the same vector(+1).  EXTRA2 is incapable
of that.

jacob i have no idea what you need different behaviour in dsld/dsrd for,
so can't begin to go through some options.  what's needed, here?
Comment 14 Luke Kenneth Casson Leighton 2022-10-22 19:49:05 BST
>            # shift then or-reduce:
>            out[-1] |= dshlsd(sym_bits[i], 0, cur_start)

if the intention here is to perform a type of vector-bitmask-merge-to-scalar
(or the inverse, extract-from-scalar-to-vector) i don't have a problem
with creating a special instruction for that although i feel it is important
to try *really hard* to find an existing instruction which might do the job.

the general idea being to have a vector of (offset,length)s to either
extract (or insert) bits (fields)

unfortunately i have a sneaking suspicion that VSX has something like this.
Comment 15 Luke Kenneth Casson Leighton 2022-10-22 20:19:51 BST
nope.  if i understand what you want to do it is a *4-in 1-out* operation
which is far too much.  a shift-mask-vector followed by OR-reduction would
do the trick, although the shift-mask-vector ideally needs to be a
3-in 1-out non-immediate version of rldcl:

VA2-Form

rldcl2 RA,RS,RB,RC (Rc=0)
rldcl2. RA,RS,RB,RC (Rc=1)

Pseudo-code:

n <- (RB)[XLEN-5:XLEN-1]
r <- ROTL64((RS), n)
b <- (RC)[XLEN-5:XLEN-1]
m <- MASK(b, (XLEN-1))
RA <- r & m

or probably better

m <- MASK(0, b)
RA <- r & m

this does interestingly fall into the bit-extract category, which i added
in bitmanip about... 5 months ago?
Comment 16 Luke Kenneth Casson Leighton 2022-10-25 16:26:47 BST
http://lists.libre-soc.org/pipermail/libre-soc-dev/2022-October/005411.html

the carry chaining version could be (there are a few alternative
definitions):
dsld RT, RA, RB, RC
v = RA
sh = RB % 64
v <<= sh
mask = (1 << sh) - 1
v |= RC & mask
RT = v & (2 ** 64 - 1)
RS = v >> 64

dsrd RT, RA, RB, RC
v = RA << 64
sh = RB % 64
v >>= sh
RS = v & (2 ** 64 - 1)
mask = ~((2 ** 64 - 1) >> sh)
v >>= 64
v |= RC & mask
RT = v

dsld
v <- [0]*XLEN || (RA)
sh <- (RB)[XLEN-6:XLEN-1]
v = ROTL128(v, sh)
mask = (1 << sh) - 1
v <- v | (RC & mask) # 
RT = v[XLEN:XLEN*2-1]
RS = v[0:XLEN-1]
Comment 17 Luke Kenneth Casson Leighton 2022-10-27 15:49:05 BST
got them working with a 128-bit ROTL, there's a way i know
to do it without but i can't quite work it out.  at least
it's functional.
 
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=eea94034582ed7686a6150062e321b46b87c3e1b

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=f21545d3f4bbb6fd27a42790afaa490540e8ed9a
Comment 18 Jacob Lifshay 2022-10-27 17:17:06 BST
(In reply to Luke Kenneth Casson Leighton from comment #17)
> got them working with a 128-bit ROTL, there's a way i know
> to do it without but i can't quite work it out.  at least
> it's functional.

imho it's not quite right yet, since we have the carrying version, the bigint tests need to be:
# don't use r1, it's the stack pointer
sv.dsld *16, *16, 3, 4  # just like sv.maddedu, but shifting
sv.dsrd/mrr *16, *16, 3, 4  # just like sv.divmod2du, but shifting

also try signed right shift, where r4 is initialized to repl(msb, 64):
sradi 4, 18, 63
sv.dsrd/mrr *16, *16, 3, 4

afaict RS needs to be the lsb bits and RT the msb bits for dsrd, because if you think about it, RS is the carry -- the bits shifted out of the result, those bits are shifted right so are at the lsb end.
Comment 19 Luke Kenneth Casson Leighton 2022-10-27 23:41:15 BST
(In reply to Jacob Lifshay from comment #18)

> imho it's not quite right yet, since we have the carrying version, the
> bigint tests need to be:
> sv.dsld *16, *16, 3, 4  # just like sv.maddedu, but shifting
> sv.dsrd/mrr *16, *16, 3, 4  # just like sv.divmod2du, but shifting

with even-numbered RB, yes

> also try signed right shift, where r4 is initialized to repl(msb, 64):
> sradi 4, 18, 63
> sv.dsrd/mrr *16, *16, 3, 4
> 
> afaict RS needs to be the lsb bits and RT the msb bits for dsrd, because if
> you think about it, RS is the carry -- the bits shifted out of the result,
> those bits are shifted right so are at the lsb end.

yep. i'd cut/paste 64-n instead of 128-n which had the effect of
unintentionally swapping RS and RT as the carry-part and result-part...
sort-of.

--- a/openpower/isa/svfixedarith.mdwn
+++ b/openpower/isa/svfixedarith.mdwn
@@ -75,7 +75,7 @@ VA2-Form
 Pseudo-code:
 
     n <- (RB)[58:63]
-    v <- ROTL128((RA) || [0]*64, 64-n)
+    v <- ROTL128((RA) || [0]*64, 128-n)
     mask <- ¬MASK(n, 63)
     RT <- v[0:63] | ((RC) & mask)
     RS <- v[64:127]

> afaict RS needs to be the lsb bits and RT the msb bits for dsrd,

likely fixed by above, can you confirm?
Comment 20 Jacob Lifshay 2022-10-28 01:11:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #19)
> (In reply to Jacob Lifshay from comment #18)
> 
> > imho it's not quite right yet, since we have the carrying version, the
> > bigint tests need to be:
> > sv.dsld *16, *16, 3, 4  # just like sv.maddedu, but shifting
> > sv.dsrd/mrr *16, *16, 3, 4  # just like sv.divmod2du, but shifting
> 
> with even-numbered RB, yes

no, RB and RC are scalar, they can be odd numbered.

> > afaict RS needs to be the lsb bits and RT the msb bits for dsrd,
> 
> likely fixed by above, can you confirm?

i meant that needed to be changed in the unit tests. I fixed the unit tests, and replaced scalar RB, RC regs with r3, r5:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=7d58514beb36c313ccf13a0f14686bd68738f40d
(sorry, i just realized i forgot to split out formatting code as a separate commit)
Comment 21 Jacob Lifshay 2022-10-28 01:57:40 BST
it turns out that carrying-4-arg dsld/dsrd don't require extra2-3-2-2 format. i think the experiment is successful, so imho should be merged into master. this will resolve my concern in the top comment about dsld/dsrd being difficult to use.

lkcl, what do you think?
Comment 22 Luke Kenneth Casson Leighton 2022-10-28 08:32:49 BST
(In reply to Jacob Lifshay from comment #20)

> no, RB and RC are scalar, they can be odd numbered.

doh, forgot.

> > > afaict RS needs to be the lsb bits and RT the msb bits for dsrd,
> > 
> > likely fixed by above, can you confirm?
> 
> i meant that needed to be changed in the unit tests. I fixed the unit tests,
> and replaced scalar RB, RC regs with r3, r5:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=7d58514beb36c313ccf13a0f14686bd68738f40d

nice. yep makes it clear scalar can be r0-r63.

> (sorry, i just realized i forgot to split out formatting code as a separate
> commit)

doh :)

(In reply to Jacob Lifshay from comment #21)
> it turns out that carrying-4-arg dsld/dsrd don't require extra2-3-2-2
> format.

hoo-rah.

> i think the experiment is successful, so imho should be merged into
> master. this will resolve my concern in the top comment about dsld/dsrd
> being difficult to use.

excellent, which means they can be added to ls004
 
> lkcl, what do you think?

yep, unit tests pass, and i've been rebasing regularly.
Comment 23 Jacob Lifshay 2022-10-28 08:47:07 BST
for bigint shift, (as well as other useful stuff that needs dynamic shifting), we will want the equivalent of RVV's vslideup/vslidedown -- they shift a vector by a variable number of whole elements (dsld/dsrd handles shifts within elements).

Technically, they can be done by using svindex, but that wastes a bunch of extra registers, takes extra time to set up the indices' values and is likely to be quite slow.

Additional instructions for shifts by constant amounts aren't needed since we can use svstate.offset combined with adjusting register numbers in the instruction.
Comment 24 Luke Kenneth Casson Leighton 2022-10-28 12:01:58 BST
(In reply to Jacob Lifshay from comment #23)
> for bigint shift, (as well as other useful stuff that needs dynamic
> shifting), we will want the equivalent of RVV's vslideup/vslidedown -- they
> shift a vector by a variable number of whole elements (dsld/dsrd handles
> shifts within elements).

svoffset. leave it in-place.  vslide* may be synthesised by applying
svoffset to a mv instruction.

alternatively using twin-predication the front (or back) of a predicate
can be set to zeros.
Comment 25 Jacob Lifshay 2022-10-28 19:08:06 BST
(In reply to Luke Kenneth Casson Leighton from comment #24)
> (In reply to Jacob Lifshay from comment #23)
> > for bigint shift, (as well as other useful stuff that needs dynamic
> > shifting), we will want the equivalent of RVV's vslideup/vslidedown -- they
> > shift a vector by a variable number of whole elements (dsld/dsrd handles
> > shifts within elements).
> 
> svoffset. leave it in-place.  vslide* may be synthesised by applying
> svoffset to a mv instruction.

iirc svoffset isn't big enough. also iirc there is no easy way to set svoffset dynamically, you have to construct the spr state manually.
> 
> alternatively using twin-predication the front (or back) of a predicate
> can be set to zeros.

that will work, but needs several additional instructions...also we'd probably have to have the hardware special-case detect shifting masks, since fully general twin-predication is likely to be nearly as slow as svindex.
Comment 26 Jacob Lifshay 2023-09-08 02:16:20 BST
removed from CryptoRouter milestone, since there is no funds left in the top-level subtask or the parent task of this task. This could go under bug #1011 if that gets more funds.