Bug 973 - an easy way to shift registers up and down is needed
Summary: an easy way to shift registers up and down is needed
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 952
  Show dependency treegraph
 
Reported: 2022-10-28 23:43 BST by Jacob Lifshay
Modified: 2023-10-09 09:41 BST (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2022-10-28 23:43:38 BST
option 1:
expand svoffset and maybe add instruction for dynamically loading from reg:

instruction for dynamically loading offset should have an option for setting offset = RA >> 6 or -(RA >> 6) (negate after shift), needed for bigint shifts. there should also be a plain offset = small_imm +- (RA) option.

proposed pseudo-code:
v = (RA)
if sh6:
    v >>= 6  # RA counts bits to shift, we need 64-bit words
if negate:
    v = -v  # must happen after shift
offset = v + small_imm  # small_imm could be like 5-bit signed int
if shrink_vl:
    VL -= abs(v)  # needed to avoid reading/writing outside vector


option 2:
use twin-predication with special masks -- likely slow or needs extra hardware to recognize those masks.
https://bugs.libre-soc.org/show_bug.cgi?id=973#c2
Comment 1 Luke Kenneth Casson Leighton 2022-10-29 00:19:19 BST
bugreports like this need to be in the form of software engineering
style "requirements", rather than going straight to "this is how it's
going to be done".

"a problem is identified" rather than deciding - without a discussion -
what the solution is.  particularly when it's part of the spec, we need
to *have* the discussion and evaluate all options before making a
decision, in case there are questions during review.

bigint shifting is the canonical (obvious) case where moving registers
up and down dynamically is needed.  the easiest way that's achieved
right now is with twin-predication, setting only source or dest mask.
source mask with zeros at the front would result in "registers shifted
right" (down).  destination mask with zeros at the front is "registers
shifted left" (up).  an instruction which makes it easy to create
masks from integers is therefore a strong candidate (one that was
already under consideration in bitmanip, and this is a good use-case
for it)
Comment 2 Jacob Lifshay 2022-10-29 00:51:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
>  the easiest way that's achieved
> right now is with twin-predication, setting only source or dest mask.

The problem with using twin-predication, is that we would need special hardware to recognize those special masks and to optimize it to use slideup/down hardware (where elements can be grouped into large chunks), rather than general permutation hardware (where each element is processed independently -- likely just as slow as svindex).

> source mask with zeros at the front would result in "registers shifted
> right" (down).  destination mask with zeros at the front is "registers
> shifted left" (up).  an instruction which makes it easy to create
> masks from integers is therefore a strong candidate (one that was
> already under consideration in bitmanip, and this is a good use-case
> for it)

well, we already have instructions that would work relatively well for setting masks:
# r5 is shift amount in bits, r32..32+VL is little-endian bigint to shift
srdi r4, r5, 6  # convert from bits to 64-bit words
# bitmanip op would only replace li/sld pair, rest of insns are still needed
li r3, -1
sld r3, r3, r4
# initialize carry in to sign bit, use li r6, 0 for unsigned shift
sv.sradi r6, r32, 63
sv.dsrd/mrr *r16, *r32, r5, r6
# copy back to r32..32+VL shifting right by r4 words
sv.addi/srcmask=r3 *r32, *r16, 0
# final result is in r32..32+VL-r4

for shift left, just set destmask=r3 instead of srcmask=r3. the copy would need to be preceded by clearing r32..
Comment 3 Luke Kenneth Casson Leighton 2022-10-29 01:42:10 BST
(In reply to Jacob Lifshay from comment #2)

> The problem with using twin-predication, is that we would need special
> hardware to recognize those special masks and to optimize it to use
> slideup/down hardware (where elements can be grouped into large chunks),

... and?

that is neither a problem nor is it "our" problem.
it is exactly the kind of thing that goes into
"Architectural Notes", advising implementors to do
precisely and exactly that, should they want high
performance.

failure to provide optimal hardware is not a good
reason to either limit or add further complexity to
an ISA.

SIMD is at the total opposite spectrum of Scalable
Vector ISAs: the entire premise of Scalable Vectors
is precisely that hardware is more intelligent,
where SIMD imposes internal microarchitecture at
the programmer.

but we cannot keep adding more complexity to the REMAP
side of the ISA, it is already causing me some concern.

> rather than general permutation hardware (where each element is processed
> independently -- likely just as slow as svindex).

cyc!ic buffer.  discussed and described many times at least
18 months ago. introduces latency but is far lower cost.
svindex is not "automatically and inherently slow",
i have no idea why you preserve that false notion in
your mind.  it requires extra reads (cached) on the GPR file,
that does not *automatically* make it "slow" under all
possible circumstances and all possible implementations.

> li r3, -1
> sld r3, r3, r4

hmm hmm i thought it would be more complex than that.
the plan/thought was to have general-purpose (RA, RB) direct
access to MASK(), able to specify the start as well
as end.  above is equivalent (i think?) to MASK(0, 63-r4)
is that right?
Comment 4 Jacob Lifshay 2023-10-09 06:26:18 BST
turns out bigint divmod using knuth's algorithm d most likely needs dynamic svoffset too, since the outer loop offsets where some of the stuff in the inner loop is read/written. see bug #1044 comment #35

simplified loop is like:

for j in ...:
    qhat = divrem(un[j], vn[n - 1])
    qhat = fixup(qhat, un[j - 1], vn[n - 2])
    for i in ...:
        un[i + j] -= vn[i] * qhat  # plus some carry stuff
    if borrowed:
        qhat -= 1
        for i in ...:
            un[i + j] += vn[i]  # plus some carry stuff
    q[j] = qhat
Comment 5 Luke Kenneth Casson Leighton 2023-10-09 07:57:35 BST
(In reply to Jacob Lifshay from comment #4)
> turns out bigint divmod using knuth's algorithm d most likely needs dynamic
> svoffset too, 

use 1<<r3 predicate mask with a vector reg
or in Vertical-First use a scalar reg sv.xxxx.

those are the two main ways to do dynamic single-reg
targetting.

l.
Comment 6 Jacob Lifshay 2023-10-09 09:41:36 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #4)
> > turns out bigint divmod using knuth's algorithm d most likely needs dynamic
> > svoffset too, 
> 
> use 1<<r3 predicate mask with a vector reg
> or in Vertical-First use a scalar reg sv.xxxx.
> 
> those are the two main ways to do dynamic single-reg
> targetting.

that's fine for scalar, but bigint divmod most likely needs something like the following:

for j in ...:
    # more stuff in j loop
    # j is dynamic and would be nice to use svoffset
    for i in range(VL):
        prod[i] = maddedu(vn[i], qhat, ...)
    for i in range(VL):
        # different inputs use/don't-use j, so even
        # twin-predication can't do this:
        un[i + j] = subfe(prod[i], un[i + j])
    # more stuff in j loop