697 – SVP64 Reduce Modes

Bug 697 - SVP64 Reduce Modes

Summary: SVP64 Reduce Modes

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:	952 213
	Show dependency tree / graph

Reported:	2021-09-16 18:32 BST by Luke Kenneth Casson Leighton
Modified:	2022-10-14 15:46 BST (History)
CC List:	2 users (show)

See Also:
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2021-09-16 18:32:27 BST

several different types of Reduce Modes need to be defined and written up

* Scalar
* Tree-reduction
* Sub-Vector Horizontal

Comment 1 Luke Kenneth Casson Leighton 2021-09-18 12:21:04 BST

see https://bugs.libre-soc.org/show_bug.cgi?id=213#c111

Comment 2 Jacob Lifshay 2022-02-01 22:07:13 GMT

depending on how it goes, we may want to reverse the order of the tree-reduction to match what rust might define as the canonical tree-reduction order -- it seems to work better on arm.

https://github.com/rust-lang/portable-simd/issues/235#issuecomment-1027331064

Comment 3 Luke Kenneth Casson Leighton 2022-02-02 00:44:46 GMT

there is already a "reverse gear" bit in SVP64

Comment 4 Jacob Lifshay 2022-02-02 01:03:30 GMT

(In reply to Luke Kenneth Casson Leighton from comment #3)
> there is already a "reverse gear" bit in SVP64

i don't think that does what we might need here...

reversed means it reduces like so (@ are adds;
this one is more efficient on arm):

0    1    2    3    4    5    6    7
|    |    |    |    |    |    |    |
@----|----|----|----/    |    |    |
|    @----|----|---------/    |    | distance=4
|    |    @----|--------------/    |
|    |    |    @-------------------/
|    |    |    |
@----|----/    | distance=2
|    @---------/
|    |
@----/ distance=1
|

the original (current SimpleV) reduce goes distance=1,2,4...

0    1    2    3    4    5    6    7
|    |    |    |    |    |    |    |
@----/    @----/    @----/    @----/ distance=1
|    .    |    .    |    .    |
@---------/    .    @---------/ distance=2
|    .    .    .    |
@-------------------/ distance=4
|

Comment 5 Jacob Lifshay 2022-02-02 01:10:35 GMT

additional conversation here:
https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/topic/pairwise.20ops.20would.20be.20nice/near/270315967

Comment 6 Jacob Lifshay 2022-02-02 01:27:22 GMT

public archive:
https://zulip-archive.rust-lang.org/stream/257879-project-portable-simd/topic/pairwise.20ops.20would.20be.20nice.html#270315967

Comment 7 Jacob Lifshay 2022-02-02 01:52:38 GMT

me on github:
> turns out Arm's faddv instruction which does reduction uses a pattern like
> the current SimpleV and what I originally proposed in this thread
> (distance=1,2,4 from diagram) -- except that it only matches when the
> vector length is a power of 2.

Comment 8 Luke Kenneth Casson Leighton 2022-02-02 11:44:09 GMT

(In reply to Jacob Lifshay from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > there is already a "reverse gear" bit in SVP64
> 
> i don't think that does what we might need here...

the bit is available so can be made to mean anything
that is necessary or needed.

https://libre-soc.org/openpower/sv/normal/

0-1	2	3 4	description
00	0	dz sz	normal mode
00	1	0 RG	scalar reduce mode (mapreduce), SUBVL=1
00	1	1 /	parallel reduce mode (mapreduce), SUBVL=1

so there's one bit spare which can still be made available
for a parallel-reduce reverse-gear mode

> reversed means it reduces like so (@ are adds;
> this one is more efficient on arm):

the scalar-reduce is a misnomer, it basically - note the careful/clumsy
negation - "prevents the vector-loop from stopping just because the
result is a scalar"

normally when a result is a scalar, the vector-loop terminates at the
very first element (because it's a scalar, duh).

however in scalar-reduce mode the looping does *not* stop, and so you
can do this:

 ADD r3.s, r3.s, r10,v  

and that will do:

 for i in range(VL):
      iregs[3+i] = iregs[3+i] + iregs[10+i]

the scalar-reduce bit tells the hardware-loop *NOT* to stop just because
there has been one result dropped into r3 already.  something like this
would cover both non-reduce and reduce:

 for i in range(VL):
      iregs[3+i] = iregs[3+i] + iregs[10+i]
      if not scalar_reduce_mode:
          if RT is scalar:
              break

the other trick is partial sum, by specifying the same register offset
by one in src and dest you get the *effect* of a reduction:


 ADD r4.v, r3.v, r10,v  

 for i in range(VL):
      iregs[4+i] = iregs[3+i] + iregs[10+i]

thus you end up with *partial results* (partial sum) and, oh
look, the last element happens to be the full reduced sum.

reverse-gear is *needed* here because you might want the result
to be at the start.

 for i in reverse(range(VL)):
      iregs[4+i] = iregs[3+i] + iregs[10+i]

also you might want the reverse-gear even when the result is a
scalar because by starting at the opposite end of the results,
FMACs (or FADDs) may end up producing different results depending
on the accuracy *and* order of the numbers being operated on.


note that the pseudocode above says how things must *look* as
far as *programmers* are concerned.  it does *NOT* dictate how
the hardware is actually implemented, just that it must have the
same net effect *as* the pseudocode.

if hardware implementors discover that the above pseudo-code
algorithms can be implemented as parallel tree-reduce *great*
[but what they can't do is implement a hardware algorithm that,
 under certain circumstances, produces a different result from
 that which would be produced by the pseudo-code]

so that's scalar-reduce.  moving on to parallel-reduce...


here's the algorithm (bottom of the page):
https://libre-soc.org/openpower/sv/svp64/appendix/

with that bit spare (bit 4) it can be allocated for anything-that-is-needed
reverse-gear, reverse-mapping, or banana-fishing. something useful preferably.

my concern with that algorithm is that it critically relies on creating
state (the vi nonpredicated array) where there's actually no space to
put that state if there is an interrupt to be serviced in the middle
of the reduction operation

(all SVP64 operations are currently designed and required to be re-entrant)

with vi being indices, that's a hell of a lot of state. HOWEVER i *believe*
it may be possible to shoe-horn this algorithm into the SVREMAP system.

jacob could you please try morphing the algorithm into something that
looks pretty much exactly like this:

def reduction_yielder_for_RA():
    for something in something:
       yield offset_idx_for_RA

def reduction_yielder_for_RB():
    for something in something:
       yield offset_idx_for_RB

def reduction_yielder_for_RT():
    for something in something:
       yield offset_idx_for_RT

for i in range(some_function_of(VL)):
     RA_offs = yield reduction_yielder_for_RA()
     RB_offs = yield reduction_yielder_for_RB()
     RT_offs = yield reduction_yielder_for_RT()
     regs[RT+RT_offs] = regs[RA+RA_offs] + regs[RB+RB_offs]

note that that is quite literally the standard SVP64 vector loop,
but oh look! there happen to be some offsets added, how did those
get there? :)

the separation of the offsets from the computation itself will
fit directly into the SVREMAP system, and allow the front-end
multi-issue engine to run (independently) creating multiple
sequences of offsets in one clock cycle, to hit the back-end
parallel SIMD ALUs with as many operations as they can handle.

SVREMAP *can* store state (and is saved on an interrupt
context-switch) so you *can* do analysis of the predicate
bits and do-something-else-because-of-them().

but please please note jacob that what we *cannot* have is *two*
different operations (unless one of them can be "faked-up" by
one of the operands being zero or one)

what we *cannot* have is:

for i in range(some_function_of(VL)):
     RA_offs = yield reduction_yielder_for_RA()
     RB_offs = yield reduction_yielder_for_RB()
     RT_offs = yield reduction_yielder_for_RT()

     if (some_condition)
        regs[RT+RT_offs] = regs[RA+RA_offs] + regs[RB+RB_offs]
     else
        regs[RT+RT_offs] = regs[RA+RA_offs]

the *only* way that could conceivably be done - and even this
is a bit iffy:

     if (some_condition)
        src2 = regs[RB+RB_offs]
     else
        src2 = 0

        regs[RT+RT_offs] = regs[RA+RA_offs] + src2

i.e. use zeroing (which would have to be assumed/implied, because
there's just no room in the SV RM Mode table for another bit)

it would still be iffy for multiply-reduce because the definition
of zeroing is, well, to put a zero in, not a one.

Comment 9 Luke Kenneth Casson Leighton 2022-02-02 11:46:14 GMT

(In reply to Jacob Lifshay from comment #5)
> additional conversation here:
> https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/
> topic/pairwise.20ops.20would.20be.20nice/near/270315967

please do post relevant parts of that into here, or come up with a way
for the discussion there to be automatically archived under Libre-SOC
resources, in order to satisfy the Transparency obligations we're under.

likewise anything on "github" or other external resource not under our
control [or responsibility]

Comment 10 Luke Kenneth Casson Leighton 2022-02-02 12:01:04 GMT

def reduce(  vl,  vec, pred, pred,):
    j = 0
    vi = [] # array of lookup indices to skip nonpredicated
    for i, pbit in enumerate(pred):
       if pbit:
           vi[j] = i
           j += 1
    step = 2
    while step <= vl
        halfstep = step // 2
        for i in (0..vl).step_by(step)
            other = vi[i + halfstep]
            i = vi[i]
            other_pred = other < vl && pred[other]
            if pred[i] && other_pred
                vec[i] += vec[other]
            pred[i] |= other_pred
         step *= 2

would turn into something like this:

def i_yielder(....)
    j = 0
    vi = [] # array of lookup indices to skip nonpredicated
    for i, pbit in enumerate(pred):
       if pbit:
           vi[j] = i
           j += 1
    step = 2
    while step <= vl
        halfstep = step // 2
        for i in (0..vl).step_by(step)
            other = vi[i + halfstep]
            i = vi[i]
            other_pred = other < vl && pred[other]
            if pred[i] && other_pred
                yield i
            pred[i] |= other_pred
         step *= 2

def other_yielder(....)
    j = 0
    vi = [] # array of lookup indices to skip nonpredicated
    for i, pbit in enumerate(pred):
       if pbit:
           vi[j] = i
           j += 1
    step = 2
    while step <= vl
        halfstep = step // 2
        for i in (0..vl).step_by(step)
            other = vi[i + halfstep]
            i = vi[i]
            other_pred = other < vl && pred[other]
            if pred[i] && other_pred
                yield other
            pred[i] |= other_pred
         step *= 2

and now that fits the SVREMAP pattern

for i in range(some_function_of(VL)):
     RA_offs = yield reduction_yielder_for_RA() # other
     RB_offs = yield reduction_yielder_for_RB() # i
     RT_offs = yield reduction_yielder_for_RT() # also i

     regs[RT+RT_offs] = regs[RA+RA_offs] + regs[RB+RB_offs]

except for the bit about pred[i] |= other_pred which i will leave you
to work out.

bear in mind that it's fine to be [much] slower restarting in the 
middle of a context-switch. so if the pred[i] has to be stored in
throw-away state which is re-computed, that's perfectly fine.
[where overloading some arbitrary register or adding yet another
 SPR to store it is not fine]

(matrix-multiply REMAP has to take a long time to work out
 where it was interrupted, for non-power-two, because you
 can't exactly do a cascading set of MOD/DIV operations in
 one clock cycle)

also bear in mind that in Rc=1 mode if the predicate being
relied on by the algorithm is overwritten by Rc=1 then you may
end up with corrupted results (even without an interrupt in the
middle)

Comment 11 Jacob Lifshay 2022-02-02 12:21:52 GMT

the algorithm we'd want to use is based on the algorithm without indirection through the vi table:
def reduce(vl, vec, pred):
    step = 1;
    while step < vl
        step *= 2;
        for i in (0..vl).step_by(step)
            other = i + step / 2;
            other_pred = other < vl && pred[other];
            if pred[i] && other_pred
                vec[i] += vec[other];
            else if other_pred
                vec[i] = vec[other];
            pred[i] |= other_pred;
    # scalar result is now in vec[0]

idk why the version with vi was added to the wiki, but it destroys most of the benefits of the above version. the above version specifically reduces in a tree pattern where changes in vl/pred don't affect where in the tree add/copy are performed, allowing a cpu to efficiently implement tree reduction without needing a full crossbar on the alu inputs or outputs. when some part of pred is 0, it changes the adds to copies, but doesn't change the position or order of any other operations.

Comment 12 Luke Kenneth Casson Leighton 2022-02-02 13:01:30 GMT

(In reply to Jacob Lifshay from comment #11)
> the algorithm we'd want to use is based on the algorithm without indirection
> through the vi table:
> def reduce(vl, vec, pred):
>     step = 1;
>     while step < vl
>         step *= 2;
>         for i in (0..vl).step_by(step)
>             other = i + step / 2;
>             other_pred = other < vl && pred[other];
>             if pred[i] && other_pred
>                 vec[i] += vec[other];
>             else if other_pred
>                 vec[i] = vec[other];
>             pred[i] |= other_pred;
>     # scalar result is now in vec[0]
> 
> idk why the version with vi was added to the wiki, 

because it is not possible to have two completely separate
types of operations (one a MV, the other arithmetic), and
because internal state is not permitted (or practical)
except that which will fit into SVREMAP SPRs.

if it is wrong, and you want this in SVP64, it is your responsibility
to work through the full implementation details.

> but it destroys most of
> the benefits of the above version. the above version specifically reduces in
> a tree pattern where changes in vl/pred don't affect where in the tree
> add/copy are performed, allowing a cpu to efficiently implement tree
> reduction without needing a full crossbar on the alu inputs or outputs. when
> some part of pred is 0, it changes the adds to copies, but doesn't change
> the position or order of any other operations.

"changing adds to MVs" is not permitted.  changing the input parameter
on an add so that is in *equivalent* to a MV is fine [edit: actually
it isn't]

however imposing on the decoder to drop 1 (or 1.0) into the appropriate
operation, to make any given operation a MV, this is pretty dicey especially
when it comes to reduce on CR operations.

therefore, on balance, it's not happening (too complex)

an algorithm which tracks where the operand is and therefore *does not
need the MV at all* would on the other hand be perfectly fine, as long as,
again, it fits into SVREMAP.

(yes, really, the DCT REMAP does some incredibly strange-looking non-linear
indices remapping, don't ask me to explain it here, it took me 6+ weeks to work
out)

so, please can you rewrite the algorithm in a form that uses yield iterators
and does not require a MV. yes this means having a special index array that
allows de-referencing of non-masked elements, and i suspect that will be
perfectly fine.

will post separately with the basic principle

Comment 13 Luke Kenneth Casson Leighton 2022-02-02 13:05:41 GMT

def reduce(vl, vec, pred):
    step = 1;
    while step < vl
        step *= 2;
        for i in (0..vl).step_by(step)
            other = i + step / 2;
            other_pred = other < vl && pred[other];
            if pred[i] && other_pred
                vec[i] += vec[other];
            else if other_pred
                vec[i] = vec[other];
            pred[i] |= other_pred;


transform ==>

def reduce(vl, vec, pred):
    step = 1;
    indices =[]
    while step < vl
        step *= 2;
        for i in (0..vl).step_by(step)
            other = i + step / 2;
            other_pred = other < vl && pred[other];
            if pred[i] && other_pred
                indices[something] = other
            else if other_pred
                indices[something] = other
            pred[i] |= other_pred;

    step = 1;
    while step < vl
        step *= 2;
        for i in (0..vl).step_by(step)
            other = i + step / 2;
            vec[i] += vec[indices[somethingother]];

and that to then be further morphed to a REMAP yield form

Comment 14 Luke Kenneth Casson Leighton 2022-02-02 13:39:41 GMT

followup: now i think about it, index-redirection will be needed
on *both* i *and* other, such that


     reg[redirect1[i]] += reg[redirect2[other]]

and then, obviously, the yield on RT to return redirect1[i] NOT i,
and the yield on RA to return redirect2[other] NOT other

the first stage of the remaps will therefore compute the tables
redirect1 and redirect2 such that the MV operation is entirely
redundant.

one critically important reason for this is Vertical-First Mode

[yes parallel reduce will work in Vertical-First Mode!!!]


if there is a MV operation substitution then any given ADD (or
whatever) will SILENTLY be replaced with a MV and the programmer
will have absolutely no idea why.

yes, really, parallel-reduce in VF mode i can already see some useful
things there, involving multiple operations tracking the exact same
reduction schedule: doing a divide on the src operand or computing
the inputs to the schedule on-the-fly, or even (this would be horrific)
exiting the reduction loop early depending on partial results,
or "fixing" the values if they happen to overflow.

lots of possibilities and they are all actively interfered with if a
MV is involved

Comment 15 Luke Kenneth Casson Leighton 2022-02-02 16:57:08 GMT

def reduce_yield(vl, vec, pred):
    step = 1;
    while step < vl
        redirects = list(range(vl)
        step *= 2;
        for i in (0..vl).step_by(step)
            other = i + step / 2;
            ri = redirects[i]
            ro = redirects[other]
            other_pred = other < vl && pred[ro];
            if pred[ri] && other_pred
                yield (ri, ro)
            else if other_pred
                redirects[ri] = redirects[ro]
            pred[ri] |= other_pred;

i think you'll find that the above eliminates the need for the MV.

it works by having an array of indices which initially start off 0..VL
and whenever you would like to *have had* a mv, instead of performing
an actual mv, the data is left *in place* but the "redirect" array adjusted
dynamically such that a subsequent ADD will take the data from the
in-place location

there are probably some issues to work through (what if no ADD operations
actually take place, what circumstances cause that, and is it really a
problem)

required state for reentrancy: the predicate and the redirects array.

the predicate can be re-read, no problem, as long as Rc=1 hasn't damaged it.

the redirects array can be either cached or it can be restored (slowly)
by re-running up to SVSTATE.srcstep-1 iterations WITHOUT doing any ADDs
at which point redirects should be back to where it should be.

Comment 16 Luke Kenneth Casson Leighton 2022-02-03 13:58:40 GMT

(for people from the portable-simd link, firstly hello!
 secondly, here is a link explaining Vertical-First Mode
 https://m.youtube.com/watch?v=fn2KJvWyBKg)

Comment 17 Luke Kenneth Casson Leighton 2022-02-03 14:12:11 GMT

(In reply to Luke Kenneth Casson Leighton from comment #15)

> it works by having an array of indices which initially start off 0..VL
> and whenever you would like to *have had* a mv, instead of performing
> an actual mv, the data is left *in place* but the "redirect" array adjusted
> dynamically such that a subsequent ADD will take the data from the
> in-place location

this btw was exactly how i worked out the in-place DCT algorithm
(which is a world-first achievement, to have in-place DCT i.e.
without having to use double the number of vector registers)

in that case it turned out that the algorithm used for the index
remapping was incredibly simple and elegant (idx XOR (idx>>1))

but to get to that stage took a lot of incremental steps (6 weeks worth)
and one of those steps is to rewrite the algorithm as a "yield"er


i have a sneaking suspicion that even in the case where there is only one
ADD needed because all but 2 elements are predicated out, the remapping
will still work because the remapping will identify (eventually) the
rightmost element.

the only case i think not covered would be when *all* elements except
*one* are predicated out.  this would be expected to perform a straight
MV, copying the element into element 0 and that will not happen.
it can be "fixed" by adding one extra element (a zero)

one unit test that will definitely be needed and quite interesting
is when the predicated elements are all in the top half (covered by
"other").

Comment 18 Luke Kenneth Casson Leighton 2022-02-03 17:13:16 GMT

(In reply to Jacob Lifshay from comment #4)

> 0    1    2    3    4    5    6    7
> |    |    |    |    |    |    |    |
> @----|----|----|----/    |    |    |
> |    @----|----|---------/    |    | distance=4
> |    |    @----|--------------/    |
> |    |    |    @-------------------/
> |    |    |    |
> @----|----/    | distance=2
> |    @---------/
> |    |
> @----/ distance=1
> |
> 
> the original (current SimpleV) reduce goes distance=1,2,4...
> 
> 0    1    2    3    4    5    6    7
> |    |    |    |    |    |    |    |
> @----/    @----/    @----/    @----/ distance=1
> |    .    |    .    |    .    |
> @---------/    .    @---------/ distance=2
> |    .    .    .    |
> @-------------------/ distance=4
> |

once converted to REMAP Form there are plenty of bits available
that will not only allow the step to go 1 2 4 8 instead of 8 4 2 1
but also to invert the order of the inner loop as well, such that
the reduction result ends up in the *last* element not the first.

(again this is already done in DCT/FFT REMAP, for iDCT and iFFT)

the current SVP64 RM Mode fields can then become a shortcut for setting
up the most popular reduction REMAPs

Comment 19 Jacob Lifshay 2022-03-23 15:33:17 GMT

I think what happened is the discussion on Zulip mostly stalled without us deciding on anything other than something like: fast reproducible reductions would be nice, serial reductions are terrible

Comment 20 Luke Kenneth Casson Leighton 2022-03-23 16:09:11 GMT

the crucial thing that's needed, to make this algorithm work, is that it
musn't include or depend on a "MV" instruction (even as a micro-coded op).

the trick that i did in FFT and DCT there to avoid the need for MVs
was to have an array-of-indices where the *indices* were altered
(swapped around) if you needed to perform a MV.  the actual "MV"
did *not* actually occur but instead because the index had been
altered, the next ADD/MUL/WHATEVER would use the [index-redirected]
element as one of its srces.

it will be... hair-raising but not insurmountable.  the only "problem"
being that if there is only one element (somewhere half-way down the
array), then no ADD/MUL/WHATEVER gets applied, and consequently the
result is effectively invalid because no data actually got "reduced".
if a MV was used then obviously that one element would be MVed down
into the lowest-element spot in the result.

Comment 21 Jacob Lifshay 2022-03-23 16:51:49 GMT

imho moving is absolutely necessary if you want to have tree-reductions run quickly without needing a full-crossbar somewhere on the ALU output -> other ALU input path.

here, i define running quickly to mean not needing to delay extra clock cycles because you're running register values through a separate slow-but-general lane-crossing mechanism.

basically, a tree-reduction only needs to move data in a few set inter-lane paths, but if you skip moving, then you need to be able to read from any high-index element when combining with lower index elements.

e.g. if you have registers organized so every 8th element is matched with each ALU, then tree-reduction with moving allows getting away with only having the fast inter-lane paths for a few paths:

needed paths with moving (assuming reducing into r0):
ALU lane:                 0  1  2  3  4  5  6  7
                          |  |  |  |  |  |  |  |
r0.el0/r4.el0/r8.el0/...--X--|--|--|--|--|--|--|
r0.el1/r4.el1/r8.el1/...--X--X--|--|--|--|--|--|
r1.el0/r5.el0/r9.el0/...--X--|--X--|--|--|--|--|
r1.el1/r5.el1/r9.el1/...--|--|--X--X--|--|--|--|
r2.el0/r6.el0/r10.el0/...-X--|--|--|--X--|--|--|
r2.el1/r6.el1/r10.el1/...-|--|--|--|--X--X--|--|
r3.el0/r7.el0/r11.el0/...-|--|--|--|--X--|--X--|
r3.el1/r7.el1/r11.el1/...-|--|--|--|--|--|--X--X

needed paths with moving (reducing into any reg):
ALU lane:                 0  1  2  3  4  5  6  7
                          |  |  |  |  |  |  |  |
r0.el0/r4.el0/r8.el0/...--X--|--|--|--X--|--X--|
r0.el1/r4.el1/r8.el1/...--X--X--|--|--|--|--|--|
r1.el0/r5.el0/r9.el0/...--X--|--X--|--|--|--X--|
r1.el1/r5.el1/r9.el1/...--|--|--X--X--|--|--|--|
r2.el0/r6.el0/r10.el0/...-X--|--X--|--X--|--|--|
r2.el1/r6.el1/r10.el1/...-|--|--|--|--X--X--|--|
r3.el0/r7.el0/r11.el0/...-|--|--X--|--X--|--X--|
r3.el1/r7.el1/r11.el1/...-|--|--|--|--|--|--X--X


needed paths without moving and remapping instead (assuming reducing into r0):
ALU lane:                 0  1  2  3  4  5  6  7
                          |  |  |  |  |  |  |  |
r0.el0/r4.el0/r8.el0/...--X--X--X--X--X--X--X--X
r0.el1/r4.el1/r8.el1/...--X--X--X--X--X--X--X--X
r1.el0/r5.el0/r9.el0/...--X--X--X--X--X--X--X--X
r1.el1/r5.el1/r9.el1/...--X--X--X--X--X--X--X--X
r2.el0/r6.el0/r10.el0/...-X--X--X--X--X--X--X--X
r2.el1/r6.el1/r10.el1/...-X--X--X--X--X--X--X--X
r3.el0/r7.el0/r11.el0/...-X--X--X--X--X--X--X--X
r3.el1/r7.el1/r11.el1/...-X--X--X--X--X--X--X--X

the top right triangle of the remapping crossbar is needed for reductions with more than 8 elements, since they start needing to read from registers r4..r7 and r8..r11 and ...

Comment 22 Luke Kenneth Casson Leighton 2022-03-23 17:54:17 GMT

(In reply to Jacob Lifshay from comment #21)
> imho moving is absolutely necessary if you want to have tree-reductions run
> quickly without needing a full-crossbar somewhere on the ALU output -> other
> ALU input path.

not the ISA-design-level's problem and therefore invalid.

> here, i define running quickly to mean not needing to delay extra clock
> cycles because you're running register values through a separate
> slow-but-general lane-crossing mechanism.

this is an architectural designers decision that should in absolutely
no way cause us with "ISA Design Hats" on to base our decisions on.

> basically, a tree-reduction only needs to move data in a few set inter-lane
> paths, but if you skip moving, then you need to be able to read from any
> high-index element when combining with lower index elements.

you are conflating architectural internals with the responsibility of ISA
designers.

the architectural designer may choose to spot the patterns of ADDs (or whatever)
and insert the prerequisite MVs as micro-ops *at their discretion*

a FSM architecture (TestIssuer) has no problems of any kind

another architectural designer may have 12R8W register files
and therefore have no lanes of any kind.

another designer may have cyclic shift buffers which substitute for
full crossbars.

etc etc etc etc.

none of these architectural decisions have anything *at all* to do with
ISA-level design except inasmuch that they all have to be considered
(and to some extent "implementation advice hints" given)

second, bear in mind, that the schedule of ops is entirely deterministic
based on having read the predicate mask.  ops may be issued (scheduled)
and analysed long before actual execution takes place.

no MVs.

Comment 23 Jacob Lifshay 2022-03-23 18:10:46 GMT

(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Jacob Lifshay from comment #21)
> > imho moving is absolutely necessary if you want to have tree-reductions run
> > quickly without needing a full-crossbar somewhere on the ALU output -> other
> > ALU input path.
> 
> not the ISA-design-level's problem and therefore invalid.
> 
> > here, i define running quickly to mean not needing to delay extra clock
> > cycles because you're running register values through a separate
> > slow-but-general lane-crossing mechanism.
> 
> this is an architectural designers decision that should in absolutely
> no way cause us with "ISA Design Hats" on to base our decisions on.
> 
> > basically, a tree-reduction only needs to move data in a few set inter-lane
> > paths, but if you skip moving, then you need to be able to read from any
> > high-index element when combining with lower index elements.
> 
> you are conflating architectural internals with the responsibility of ISA
> designers.
> 
> the architectural designer may choose to spot the patterns of ADDs (or
> whatever)
> and insert the prerequisite MVs as micro-ops *at their discretion*
> 
> a FSM architecture (TestIssuer) has no problems of any kind
> 
> another architectural designer may have 12R8W register files
> and therefore have no lanes of any kind.
> 
> another designer may have cyclic shift buffers which substitute for
> full crossbars.
> 
> etc etc etc etc.
> 
> none of these architectural decisions have anything *at all* to do with
> ISA-level design except inasmuch that they all have to be considered
> (and to some extent "implementation advice hints" given)

well, your argument for *not* having moves is based off of mostly the same things iirc:
having alus also be able to do moves is *also* only a microarchitectural decision.

> 
> second, bear in mind, that the schedule of ops is entirely deterministic
> based on having read the predicate mask.  ops may be issued (scheduled)
> and analysed long before actual execution takes place.

that's true, it applies to the case with moves too, so isn't a good argument to choose either moves or no-moves.

> 
> no MVs.

having moves allows a much simpler algorithm imho, as well as having a consistent place where the reduction result goes (element 0, rather than whatever random spot the remap algorithm happens to leave it).

I can work out the HDL details if you like. Would creating a fsm that creates the tree-reduction ops be sufficient for now?

Comment 24 Luke Kenneth Casson Leighton 2022-03-24 00:21:48 GMT

(In reply to Jacob Lifshay from comment #23)

> well, your argument for *not* having moves is based off of mostly the same
> things iirc:
> having alus also be able to do moves is *also* only a microarchitectural
> decision.

ALUs never do MVs. ok very basic ones would. usually a pseudoop add rt ra 0

> > 
> > second, bear in mind, that the schedule of ops is entirely deterministic
> > based on having read the predicate mask.  ops may be issued (scheduled)
> > and analysed long before actual execution takes place.
> 
> that's true, it applies to the case with moves too, so isn't a good argument
> to choose either moves or no-moves.

when combined with Vertical First mode it will melt people's brains,
and also complicate ExtraV coherence to have the decision in the
separated logic to suddenly do a 2op MV rather than a 3op or 4op
ADD/MUL/whatever.

too much.

 
> > no MVs.
> 
> having moves allows a much simpler algorithm imho,

probably, but where's the fun in that? :)
also retrospectively i found that DCT REMAP
turned out to use Gray Coding. that was...
unexpected and beautiful, and i would never
have noticed the simplicity of it if i had
thought "lets take the simple route"

> as well as having a
> consistent place where the reduction result goes (element 0, rather than
> whatever random spot the remap algorithm happens to leave it).

i know. i worked through the caveats: only a single element (all others
predicated out) would be the one that remained invalid.

even 2 elements should / would target the correct result-indexed destination
element regardless of the positions of 2 active bits of predicate.

> I can work out the HDL details if you like. Would creating a fsm that
> creates the tree-reduction ops be sufficient for now?

given that this will end up being a type of REMAP
can i recommend following the path i did there with
MMATRIX and FFT DCT which was:

* a braindead obvious python demo algorithm
* conversion to a yield generator which purely
  returns indices that
* get blatted on top of a scalar op then
* integrate the yield generator into ISACaller
* and then implement the HDL FSM

yes it is a lot of work, it is why MM FFT DCT took me
about 8 weeks or more.

ISACaller integration will be essential so might as well
do incremental.

the first priority would therefore be to do the braindead
demo.  i think the first version i did for MATRIX REMAP
didnt even do the mmult itself, just printed out the
indices.

also i would like to see what the algorithm generates
(the indices) to see if it is in fact workable.

DCT/FFT REMAP has caveats: power-of-two.

no point spending time doing HDL FSM if the algorithm turns
out to be borked.  need to find that out early.

will find links later

Comment 25 Luke Kenneth Casson Leighton 2022-03-24 01:26:09 GMT

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD

the function there, is the earliest REMAP algorithm first
implemented, done as a generator.  used in ISACaller directly
which is a bundle of fun because ISACaller itself uses yield.

there is also an actual matrix mult demo/test somewhere.

Comment 26 Jacob Lifshay 2022-03-25 13:04:49 GMT

added python generator for tree-reduce with remap (passed in as a dict or something else indexable)

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=2fe0ce6285864927127d8226171c566886b87e89

Comment 27 Jacob Lifshay 2022-03-25 13:10:30 GMT

(In reply to Jacob Lifshay from comment #26)
> added python generator for tree-reduce with remap (passed in as a dict or
> something else indexable)
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=2fe0ce6285864927127d8226171c566886b87e89

I got thoroughly distracted trying to make pretty ascii-art tree graphs for use by the above generator, WIP code here:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=f684acfa32ba1ef8c52abc5876da2b73696862dd

(lkcl, please don't remove all the type annotations or refactor all the code in text_tree_graph.py, it makes it waay easier for me to figure out, i'll remove them when I'm done...unless you want to finish getting it working? we should probably put this off for several months at least...)

Comment 28 Luke Kenneth Casson Leighton 2022-03-25 14:26:16 GMT

(In reply to Jacob Lifshay from comment #26)
> added python generator for tree-reduce with remap (passed in as a dict or
> something else indexable)
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=2fe0ce6285864927127d8226171c566886b87e89

ok good.  next step is to remove the MV, this is a hard requirement
not up for debate.  see multiple answers and comments above already
given and see comment #15 for the basic direction of the required
final algorithm.

i thought of a potential solution for the (single, only) case where
there is only one predicate bit in the source, and that is to output
a co-result (new predicate, likely have to be CRs, or r30)
containing an indicator of which elements are valid.

in the case where there was more than 1 src bit the co-result would
be 000000001 but in the 1-src-bit mask bit the co-result would be
an *exact* copy of that very same mask.

this solution would only work if the src vector was also the dest vector
which is not unreasonable and the final algorithm should take that into
account.  i have a suspicion it will be a hard requirement anyway: there
is no way that an alhorithm which requires a vector of temporary registers
will be acceptable or viable: all REMAP algorithms MUST be re-entrant
without requiring storage of inflight intermediary computations, only
the existing regfile can (theoretically) be used for that.

(In reply to Jacob Lifshay from comment #27)

> I got thoroughly distracted trying to make pretty ascii-art tree graphs for
> use by the above generator, WIP code here:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=f684acfa32ba1ef8c52abc5876da2b73696862dd

nice

> (lkcl, please don't remove all the type annotations or refactor all the code
> in text_tree_graph.py, it makes it waay easier for me to figure out, i'll
> remove them when I'm done...unless you want to finish getting it working? we
> should probably put this off for several months at least...)

no problem with that

Comment 29 Luke Kenneth Casson Leighton 2022-05-18 14:24:09 BST

drat.

when looking at viota i realised that for scalar reduction
when predicate masks are involved everything goes to hell.
the current definition of scalar reduction is one that
disables "stopping" on SV loops if the destination is a
scalar because the destination can be used as an accumulator

    sv.add/mr r0, r3.v, r0

this will use r0 as a scalar accumulator

    sv.madd/mr r0, r8.v, r16.v, r0

this is basically dotproduct.

a prefix-sum (fibonacci) can be done as:

    sv.addi/mr r1.v, r1.v, r2.v, 1

but - and this is the problem - if predication is enabled on
that sv.addi it all goes to hell.

first question: can a useful schedule be created which skips over
masked-out elements

second question: can the registers be easily determined (which one
is the accumulator)

third question: are there enough bits in MODE to include this.

Comment 30 Luke Kenneth Casson Leighton 2022-05-18 18:53:55 BST

something along these lines:

* RB-RA (the register *numbers* not GPR(RB)-GPR(RA) is computed and stored,
  call it offs
* RT-RA likewise call it roffs
* a sequence of offsets is computed from the predicate mask
  0b1001101 would create indices [0, 2, 3, 6], call it idxs

the for-loop therefore goes:

   for i in range(len(idxs)-MAX(roffs, offs)):
      ra = RA + idxs[i]
      rb = RA + idxs[i+offs]
      result = OPERATION(GPR(ra), GPR(rb))
      rt = RA + idxs[i+roffs]
      GPR(rt) = result

taking twin-predication into account would involve using the dest
predicate to create a separate list of indices.

if RT != RA then this could be used to e.g. create a vector sequence of
differences but e.g. skip certain elements.

if RT == RA then as usual for mr and mr/RG (reverse gear) mode this becomes
a cumulative sum.

if RT<RA then the first few starting elements are skipped to always make
sure idxs[] referencing is within range

the bit that's conceptually new is that the actual register numbers are computed
from the difference to RA as a baseline register.

Comment 31 Luke Kenneth Casson Leighton 2022-05-21 16:20:51 BST

00	0	dz sz	normal mode
00	1	0 RG	scalar reduce mode (mapreduce), SUBVL=1
00	1	1 /	parallel reduce mode (mapreduce), SUBVL=1
00	1	SVM RG	subvector reduce mode, SUBVL>1

add:

00	1	1 0	parallel reduce mode (mapreduce), SUBVL=1
00      1       1 1     scalar relative reduce mode