1071 – add parallel prefix sum remap mode

Bug 1071 - add parallel prefix sum remap mode

Summary: add parallel prefix sum remap mode

Status:	IN_PROGRESS

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:

Depends on:
Blocks:	1009
	Show dependency tree / graph

Reported:	2023-04-26 20:31 BST by Jacob Lifshay
Modified:	2024-01-02 00:28 GMT (History)
CC List:	5 users (show)

See Also:	1042 1060 1059 864 1027 1075 1155 1242
NLnet milestone:	NLnet.2022-08-107.ongoing
total budget (EUR) for completion of task and all subtasks:	2000
budget (EUR) for this task, excluding subtasks' budget:	2000
parent task for budget allocation:	1027
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	jacob=2000

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2023-04-26 20:31:27 BST

parallel prefix sum (also known as `scan`) is a commonly useful operation, since there's space in remap modes (mode 0b11) now that some were removed I think we should add it.

some of the applications of parallel prefix sum:
https://en.wikipedia.org/wiki/Prefix_sum#Applications
* Counting sort/Radix sort (the most common sorting algorithm used on GPUs)
* Decoding Grey code
* Parallel Polynomial Interpolation
* Computing Factorials (prefix-product of 1,2,3,4,...)

* WIP: jacob: add SVSHAPE.skip driven update of remap_preduce_yield.py
    done, except as a new function iterate_indices2, only thing left is
    to merge into the original function after deciding if we want to
    delete `steps` reversing or not. See comment #8
    D3CISION MADE: NO. REVERSING STAYS. END OF DISCUSSION.
* DONE: jacob: add *really* short demo2() bare minimum test
* DONE: jacob: add unit tests verifying that remap_preduce_yield.py's prefix-sum matches known working code in nmutil.prefix_sum.
* DONE: jacob: integrate into svshape pseudocode
* DONE: luke: update wiki
* DONE: lkcl: add into ISACaller svshape.py
* DONE: jacob: add some unit tests into test_caller_svp64_preduce.py
  these should be "showcaseable" i.e. if presented on a conference
  they should be easy to understand.
* TODO jacob: put pseudocode into spec/RFC
  https://libre-soc.org/openpower/sv/remap/appendix/
* TODO jacob: restore reversing mode.

Comment 1 Jacob Lifshay 2023-04-26 20:54:57 BST

we can derive the algorithm from the already-known-working nmutil prefix_sum.py which implements both the work-efficient and the low-depth algorithms:
https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l35

Comment 2 Luke Kenneth Casson Leighton 2023-04-26 21:22:41 BST

how is it different from parallel prefix reduction bug #864?

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;hb=HEAD

Comment 3 Jacob Lifshay 2023-04-26 21:25:08 BST

(In reply to Luke Kenneth Casson Leighton from comment #2)
> how is it different from parallel prefix reduction bug #864?
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/
> decoder/isa/remap_preduce_yield.py;hb=HEAD

parallel reduction is a reduction aka has only one output, the rest of the output vector is intermediate results that aren't a prefix-sum. parallel reduction is basically 1/2 a prefix-sum operation.

Comment 4 Luke Kenneth Casson Leighton 2023-04-26 21:29:53 BST

https://libre-soc.org/openpower/sv/remap/

3.1 Parallel Reduction Mode

Creates the Schedules for Parallel Tree Reduction.

    submode=0b00 selects the left operand index
    submode=0b01 selects the right operand index

it's *just* about possible to squeeze in alternative modes
without going overboard - svshape's 3rd dimension (zdim)
could in theory be used to indicate an alternative reduction
algorithm.

Comment 5 Jacob Lifshay 2023-04-26 21:30:15 BST

basically the parallel reduction algorithm we use is similar to only the first loop in work-efficient prefix-sum, the second loop is missing
https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l71

Comment 6 Jacob Lifshay 2023-04-26 21:33:31 BST

(In reply to Jacob Lifshay from comment #5)
> basically the parallel reduction algorithm we use is similar to only the
> first loop in work-efficient prefix-sum, the second loop is missing
> https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;
> h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l71

for a visual diagram, see:
https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=HEAD#l164

Comment 7 Luke Kenneth Casson Leighton 2023-04-26 21:40:29 BST

(In reply to Jacob Lifshay from comment #3)

> parallel reduction is a reduction aka has only one output, the rest of the
> output vector is intermediate results that aren't a prefix-sum. parallel
> reduction is basically 1/2 a prefix-sum operation.

ahh that second half of the tree in the image on the wikipedia page.
okaaay.

right.  yes, it's doable.  it's pretty seriously late in the day but
i like it.

ok.

so if SVyd is free... which it is...

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=c350b3cf86fadcfdd22663c7119249e00328ff4a;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l281

then at line 301 and 305 "if SVyd == 1 then set submode=0b10 0b11
instead of submode = 0b00 0b01 in SVSHAPE0 and SVSHAPE1 respectively.

then in svshape.py:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/svshape.py;h=8b4533755a214d243b509ca20994a7b3ff651cdf;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l166

line 170 needs sub-mode detection... or does it... no.
iterate_preduce_indices() is where sub-mode would be
looked at and the second half done.

do you want to have a go at that? keep it *real* simple - one step at a
time, can you add support for

* SVSHAPE.skip == 0b10 -> submode 10 (LH argument) parallel-prefix
* SVSHAPE.skip == 0b11 -> submode 11 (RH argument) parallel-prefix

line 30 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;h=a8093803da46ef81a1254b2b2c5b00f153568578;hb=74a1b2ad43661599e3c3810f742149c17e83f542#l34

and you *should* just be able to add the 2nd half onto the end (line 43
onwards) and do

   "if SVSHAPE.skip in [0b00, 0b01]: return"

and that should basically be it.  add a *real* simple 2nd "demo()" function,
please *don't* go massively overboard, or make significant changes to this
code ok?

one step at a time, remap_preduce_yield.py first.

Comment 8 Jacob Lifshay 2023-04-26 21:53:20 BST

as mentioned on irc, please add a todo list in the top comment.

(In reply to Luke Kenneth Casson Leighton from comment #7)
> iterate_preduce_indices() is where sub-mode would be
> looked at and the second half done.
> 
> do you want to have a go at that? keep it *real* simple

yeah, i can do that, probably later today

Comment 9 Jacob Lifshay 2023-04-27 06:34:09 BST

(In reply to Luke Kenneth Casson Leighton from comment #7)
> and you *should* just be able to add the 2nd half onto the end (line 43
> onwards) and do
> 
>    "if SVSHAPE.skip in [0b00, 0b01]: return"

turns out that reduce and prefix-sum are different enough that merging their loops will be pretty confusing, so I just stuck all the prefix-sum code in a big if. If/when we really need to merge them, we can refactor it then. I think the pseudo-code will also benefit from having prefix-sum completely separate from reduction, since it makes the loops much easier to understand.

> 
> and that should basically be it.  add a *real* simple 2nd "demo()" function,
> please *don't* go massively overboard, or make significant changes to this
> code ok?
While I was reading the existing code, I noticed you have:
if SVSHAPE.invxyz[1]: steps.reverse()
which makes the function much more complex and isn't actually useful (it flips the reduction tree vertically, exchanging the root end with the leaves end), so I think that should be removed and steps merged into the loop rather than being a separate loop generating a list.

Therefore, rather than modifying the existing function, I copied the function to iterate_indices2, removed steps reversing, and added prefix-sum to the copy. since prefix-sum is in an if all by itself, it can be easily copied into the existing code if we don't want to remove steps reversing.

I added demo_prefix_sum as requested.

I also added test_remap_reduce_yield.py to verify that the prefix-sum functionality matches nmigen.prefix_sum

Comment 10 Jacob Lifshay 2023-04-27 06:35:32 BST

oh, I forgot to link to the commit:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0b6592c574f814d81cfede4c74c50b583590db13

Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Wed Apr 26 22:09:19 2023 -0700

    add scan/prefix-sum support to copy of parallel-reduce remap iterator

Comment 11 Luke Kenneth Casson Leighton 2023-04-27 10:26:04 BST

commit d40763cd6e186ad9b17ce6f974a38b4c4877965e (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Thu Apr 27 10:23:27 2023 +0100

    link in new parallel-prefix REMAP schedule

so that's now linked in to SVSHAPE

Comment 12 Jacob Lifshay 2023-04-28 07:08:27 BST

other thing I noticed, we need a svshape instruction that can take a dynamic length and produce the prefix-sum or reduce schedules, currently svshape only supports constant lengths.

Comment 13 Luke Kenneth Casson Leighton 2023-04-28 09:06:43 BST

(In reply to Jacob Lifshay from comment #12)
> other thing I noticed, we need a svshape instruction that can take a dynamic
> length and produce the prefix-sum or reduce schedules, currently svshape
> only supports constant lengths.

they're all very deliberately immediate-based because it otherwise
means that the establishment of REMAP Schedules is dependent on
GPR registers.  the sole exception is Indexed REMAP and even that
is an "indicator to trigger the process of cacheing GPRs" rather than
"actual Hazard"

what's the story there? why is dynamic length needed?

Comment 14 Jacob Lifshay 2023-04-28 09:17:07 BST

(In reply to Luke Kenneth Casson Leighton from comment #13)
> (In reply to Jacob Lifshay from comment #12)
> > other thing I noticed, we need a svshape instruction that can take a dynamic
> > length and produce the prefix-sum or reduce schedules, currently svshape
> > only supports constant lengths.
> 
> they're all very deliberately immediate-based because it otherwise
> means that the establishment of REMAP Schedules is dependent on
> GPR registers.  the sole exception is Indexed REMAP and even that
> is an "indicator to trigger the process of cacheing GPRs" rather than
> "actual Hazard"

I was expecting the input to be VL rather than a GPR...

> 
> what's the story there? why is dynamic length needed?

because, for code that is trying to reduce or prefix-sum a large array, it needs to be able to have a dynamic length for the vector tail (or similar scenarios):
(not the most efficient algorithm, but needs to be possible for non-associative ops):
chunk_size = 64
sum = 0
for i in range(0, length, chunk_size):
    sum += reduce_chunk(VL=chunk_size)
sum += reduce_tail(VL=length % chunk_size)  # dynamic VL here

or, for prefix-sum:

chunk_size = 64
sum = 0
for i in range(0, len(array), chunk_size):
    vec = load(array[i:][:chunk_size],VL=chunk_size)
    vec = vec.prefix_sum(length=chunk_size)
    vec += splat(sum)
    sum = vec[-1]
    store(vec,array[i:][:chunk_size],VL=chunk_size)
remainder = len(array) % chunk_size
vec = load(array[i:][:remainder],VL=remainder)
vec = vec.prefix_sum(length=remainder)  # dynamic VL here
vec += splat(sum)
store(vec,array[i:][:remainder],VL=remainder)

Comment 15 Luke Kenneth Casson Leighton 2023-04-28 09:37:57 BST

(In reply to Jacob Lifshay from comment #14)
> (In reply to Luke Kenneth Casson Leighton from comment #13)

> > what's the story there? why is dynamic length needed?
> 
> because, for code that is trying to reduce or prefix-sum a large array, it
> needs to be able to have a dynamic length for the vector tail (or similar
> scenarios):
> for i in range(0, length, chunk_size):
>     sum += reduce_chunk(VL=chunk_size)
> sum += reduce_tail(VL=length % chunk_size)  # dynamic VL here

drat, you're right.

of course. okaaay sigh so this involves creating a new Form, with Parallel
Reduction similar to svshape2 "carving out" its own niche... urrr...

Comment 16 Jacob Lifshay 2023-04-28 10:04:04 BST

(In reply to Luke Kenneth Casson Leighton from comment #15)
> drat, you're right.

:)

> of course. okaaay sigh so this involves creating a new Form, with Parallel
> Reduction similar to svshape2 "carving out" its own niche... urrr...

as mentioned on irc, I think we should use one more XO bit and support a GPR for X (not Y or Z) for all svshape modes.

SV Shape Dynamic:
svshaped RA, SVyd, SVzd, SVrm, vf

if RA = 0 then x_len <- VL
else x_len <- (RA)
VL <- remapped_len(x_len, ...)
# since we don't want to set MAXVL from a GPR:
MAXVL <- remapped_len(x_len=MAXVL, ...)

Comment 17 Jacob Lifshay 2023-04-28 10:09:26 BST

I finished adding prefix-sum to svshape, and added some prefix-sum tests which pass! (both for add and sub)

The tests I added are pretty short and to-the-point, but not quite what you were asking for, luke, so if you're happy with their showcase-ability we can mark that as DONE.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b6a45579add1491f23a06ca5437c5e5e5989e75e

commit b6a45579add1491f23a06ca5437c5e5e5989e75e
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Fri Apr 28 01:49:30 2023 -0700

    prefix-sum remap works!

commit 1f451e7846d4561a36a67d38814d814ea59f3802
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Fri Apr 28 01:48:03 2023 -0700

    change order to tuple in remap preduce tests/demos to match rest of simulator

commit 7facebc376274d533c49f34f0b296154ab30f044
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Fri Apr 28 01:47:12 2023 -0700

    fix <u and >u with int arguments

Comment 18 Luke Kenneth Casson Leighton 2023-04-30 16:55:16 BST

(In reply to Jacob Lifshay from comment #17)
> I finished adding prefix-sum to svshape, and added some prefix-sum tests
> which pass! (both for add and sub)
> 
> The tests I added are pretty short and to-the-point, but not quite what you
> were asking for, luke, so if you're happy with their showcase-ability we can
> mark that as DONE.

yes looks great.

can you cut/paste the pseudocode into the Appendix of REMAP?
https://libre-soc.org/openpower/sv/remap/appendix/

it goes into the RFC and constitutes "questions feedback"
so i can justify putting budget your way for it.

Comment 19 Luke Kenneth Casson Leighton 2023-05-01 09:25:46 BST

(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Jacob Lifshay from comment #14)

> > for i in range(0, length, chunk_size):
> >     sum += reduce_chunk(VL=chunk_size)
> > sum += reduce_tail(VL=length % chunk_size)  # dynamic VL here
> 
> drat, you're right.
> 
> of course. okaaay sigh so this involves creating a new Form, with Parallel
> Reduction similar to svshape2 "carving out" its own niche... urrr...

ah. realised there's a complication. VL and MAXVL determines the number
of operations carried out, not the width of the reduction.

for now the tail will have to be done by recursive macros using
immediates manually.

Comment 20 Jacob Lifshay 2023-05-01 09:30:00 BST

(In reply to Luke Kenneth Casson Leighton from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #15)
> > (In reply to Jacob Lifshay from comment #14)
> 
> > > for i in range(0, length, chunk_size):
> > >     sum += reduce_chunk(VL=chunk_size)
> > > sum += reduce_tail(VL=length % chunk_size)  # dynamic VL here
> > 
> > drat, you're right.
> > 
> > of course. okaaay sigh so this involves creating a new Form, with Parallel
> > Reduction similar to svshape2 "carving out" its own niche... urrr...
> 
> ah. realised there's a complication. VL and MAXVL determines the number
> of operations carried out, not the width of the reduction.
> 
> for now the tail will have to be done by recursive macros using
> immediates manually.

that's why I'm proposing a svshape with GPR/VL input, so it can calculate the proper VL (which is a little less than 2x bigger but doesn't follow any simple formula I could find).

MAXVL would be adjusted following the same algorithm but using MAXVL as the input rather than VL or a GPR.

Comment 21 Jacob Lifshay 2023-05-01 09:31:29 BST

(In reply to Jacob Lifshay from comment #20)
> that's why I'm proposing a svshape with GPR/VL input, so it can calculate
> the proper VL (which is a little less than 2x bigger but doesn't follow any
> simple formula I could find).
> 
> MAXVL would be adjusted following the same algorithm but using MAXVL as the
> input rather than VL or a GPR.

for details, see https://bugs.libre-soc.org/show_bug.cgi?id=1071#c16

Comment 22 Luke Kenneth Casson Leighton 2023-05-01 13:00:41 BST

(In reply to Jacob Lifshay from comment #16)

> if RA = 0 then x_len <- VL
> else x_len <- (RA)
> VL <- remapped_len(x_len, ...)
> # since we don't want to set MAXVL from a GPR:
> MAXVL <- remapped_len(x_len=MAXVL, ...)

this will fail on the next iteration unless preceded by a
setvl MAXVL=n instruction.

my feeling is that this should be done after the first round,
investigating EXT1xx variants of SVP64 Management instructions.

Comment 23 Jacob Lifshay 2023-05-01 16:34:02 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Jacob Lifshay from comment #16)
> 
> > if RA = 0 then x_len <- VL
> > else x_len <- (RA)
> > VL <- remapped_len(x_len, ...)
> > # since we don't want to set MAXVL from a GPR:
> > MAXVL <- remapped_len(x_len=MAXVL, ...)
> 
> this will fail on the next iteration unless preceded by a
> setvl MAXVL=n instruction.

yup, that's the idea -- run setvl first -- hence why I was initially thinking it would only read from VL and not a GPR. maybe just reserve one uncommon SVxd 8mmediate (59?) to mean to instead read SVxd from VL? that way you don't need 2x the encoding space.

though i also just now thought of a way around needing setvl -- have 2 MAXVL fields or SPRs, tentatively named OP_MAXVL (used as a limit on element/subvector operations) and REG_MAXVL (used as a backup for OP_MAXVL and for whenever doing register offsetting like RS=RT+MAXVL, since there we care about register counts not how many operations our random svshape needs) both set by setvl when setting MAXVL, but svshape and friends only set OP_MAXVL, computing it from REG_MAXVL or the immediates. it may be also useful to likewise split VL into two, but not as necessary.

Comment 24 Luke Kenneth Casson Leighton 2023-05-01 18:30:30 BST

(In reply to Jacob Lifshay from comment #23)
> (In reply to Luke Kenneth Casson Leighton from comment #22)
> > (In reply to Jacob Lifshay from comment #16)

> > this will fail on the next iteration unless preceded by a
> > setvl MAXVL=n instruction.
> 
> yup, that's the idea -- run setvl first -- hence why I was initially
> thinking it would only read from VL and not a GPR. maybe just reserve one
> uncommon SVxd 8mmediate (59?) to mean to instead read SVxd from VL? 

maaybeee... although realistically this is an entirely new instruction
so in theory could do anything.

> that way you don't need 2x the encoding space.

irony: 64-bit instruction encodings (for e.g. svshape) are just as
long as 2 32-bit instructions. sigh.

> though i also just now thought of a way around needing setvl -- have 2 MAXVL
> fields or SPRs, tentatively named OP_MAXVL (used as a limit on
> element/subvector operations) and REG_MAXVL (used as a backup for OP_MAXVL
> and for whenever doing register offsetting like RS=RT+MAXVL, 

DCT and FFT rely on RS=RT+MAXVL being separate

> since there we
> care about register counts not how many operations our random svshape needs)
> both set by setvl when setting MAXVL, but svshape and friends only set
> OP_MAXVL, computing it from REG_MAXVL or the immediates. it may be also
> useful to likewise split VL into two, but not as necessary.

very reluctant to go to a 2nd SPR for SV state. especially right now.

Comment 25 Jacob Lifshay 2023-12-31 11:44:51 GMT

(In reply to Luke Kenneth Casson Leighton from bug #1155 comment #43)
> not happening, i have said this before and it is inviolate and not
> to be discussed further under this bugreport as it is firmly out of
> scope. they can write directly to SVSHAPEs and take the consequences
> after reading the spec's warnings.

by choosing to not allow dynamic size on svshape instructions you are forcing parallel reduction to be much less powerful than RVV (RVV supports dynamic reduction sizes), since computing the right values of VL from the length is complex enough that it really shouldn't be done by software (it would be maybe 10-20 instructions), hardware can be substantially more efficient since computing the length is basically a sum of several simple-ish terms, where carry-save adders shine.

Also, setvl reads from registers, if svshape reads from registers it behaves very similarly in how the sv-prefix loop expander could have to wait for the value to be computed, and you had no problems with setvl...

Comment 26 Luke Kenneth Casson Leighton 2024-01-01 01:59:20 GMT

(In reply to Jacob Lifshay from comment #25)
> (In reply to Luke Kenneth Casson Leighton from bug #1155 comment #43)
> > not happening, i have said this before and it is inviolate and not
> > to be discussed further under this bugreport as it is firmly out of
> > scope. they can write directly to SVSHAPEs and take the consequences
> > after reading the spec's warnings.
> 
> by choosing to not allow dynamic size on svshape instructions 

READ WHAT I WROTE.

the consequences are too damaging on the hardware, making it impractical.


you are
> forcing parallel reduction to be much less powerful than RVV (RVV supports
> dynamic reduction sizes),

i don't care what RVV does, here. they do not have precise interrupt
guarantees as a hard requirement for a start.

> since computing the right values of VL from the
> length is complex enough that it really shouldn't be done by software (it
> would be maybe 10-20 instructions), 

then you should explain and illustrate that instead of saying
"it MUST be dynamic" with no justification or explanation, given that
you have no idea of how to properly think through the consequences.

svindex and svoffset already compute a 2nd dimension based on VL,
however it is done from immediates (not GPR Hazards).

> hardware can be substantially more
> efficient since computing the length is basically a sum of several
> simple-ish terms, where carry-save adders shine.

then when you have explained the problem, which i have no idea of its
scope, potential solutions can be investigated *without* catastrophically
damaging SV.

> Also, setvl reads from registers, if svshape reads from registers it behaves
> very similarly in how the sv-prefix loop expander could have to wait for the
> value to be computed, and you had no problems with setvl...

it's the one and only entrypoint, and as such is limited in Register
Hazards.

svshape has FIVE register hazards as outputs already. the absolute
last thing we need is to make that SIX, one of which is a GPR as input.
the consequences would be catastrophic on the Hazard Mamagement.

HARD NO means HARD NO, jacob.


so.

please describe *exactly* the algorithm, or illustrate it clearly,
so that i can assess it properly.

Comment 27 Luke Kenneth Casson Leighton 2024-01-01 02:56:17 GMT

ok, so jacob, you need to examine the registers in and out on all
of the sv management instructions, and memorise them, and start to
understand - and accept - fully, the consequences for implementability.

* setvl has GPR in, GPR out, SVSTATE in and out, and CTR in and out.
  both SVSTATE and CTR are "state" at the same peer level
  as PC and MSR so are exceptional: they are not entirely
  normal Hazards (just like PC is not a norml hazard).
  *and* Rc=1.

  *SEVEN* registers. that's insane.

  some implementations will just have a hard time keeping
  the IPC to 1, here. it is what it is.

* svstep is GPR out and SVSTATE in and out (i added GPR in
  recently but will likely back it out) *and* Rc=1

* with the exception of svindex the REMAP instructions are ONLY
  modifying "state" at the level of PC and MSR: SVSTATE and SVSHAPE*.

svindex had to have some very special conditions placed on it to
not cause catastrophic damage to implementability (read the spec on
this please).

Comment 28 Jacob Lifshay 2024-01-01 03:11:38 GMT

(In reply to Luke Kenneth Casson Leighton from comment #26)
> the consequences are too damaging on the hardware, making it impractical.

I have read what you wrote and I disagree.
> 
> you are
> > forcing parallel reduction to be much less powerful than RVV (RVV supports
> > dynamic reduction sizes),
> 
> i don't care what RVV does, here. they do not have precise interrupt
> guarantees as a hard requirement for a start.

if we choose to have a dynamic svshape instruction, it shouldn't affect precise interrupt guarantees, since either the svshape instruction is finished or it's not and the outputs aren't yet written.

the hard part is the SV loop afterwards, which will have precise interrupts regardless of if the svshape registers are set from a svshape instruction's immediates, from a dynamic svshape instruction, or from a mtspr.

> 
> > since computing the right values of VL from the
> > length is complex enough that it really shouldn't be done by software (it
> > would be maybe 10-20 instructions), 
> 
> then you should explain and illustrate that instead of saying
> "it MUST be dynamic" with no justification or explanation, given that
> you have no idea of how to properly think through the consequences.

the current algorithm:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=33a02e6612065f290d840e15a596dfc2177de5e5;hb=fa603a1e9f2259d86acf4e9451937a000d099289#l309

step <- 0b0000001
i <- 0b0000000
do while step <u itercount
    newstep <- step[1:6] || 0b0
    j[0:6] <- 0b0000000
    do while (j+step <u itercount)
        j <- j + newstep
        i <- i + 1
    step <- newstep
# VL in Parallel-Reduce is the number of operations
vlen[0:6] <- i

now, obviously that's not very suited to hardware in that form, so we refactor it:
replace inner loop with expression:
i <- i + (itercount + step - 1) / newstep

replace step and newstep with shifting (since they're always powers of 2):
shift <- 0
i <- 0b0000000
do while (1 << shift) <u itercount
    i <- i + (itercount + (1 << shift) - 1) >> (shift + 1)
    shift <- shift + 1
# VL in Parallel-Reduce is the number of operations
vlen[0:6] <- i

assume itercount is in [0,127], change loop to fixed number of iterations (all extra iterations will add 0, so are no-ops):
i <- 0b0000000
do shift = 0 to 6
    i <- i + (itercount + (1 << shift) - 1) >> (shift + 1)
# VL in Parallel-Reduce is the number of operations
vlen[0:6] <- i

unroll:
i <- itercount >> 1
i <- i + (itercount + 1) >> 2
i <- i + (itercount + 3) >> 3
i <- i + (itercount + 7) >> 4
i <- i + (itercount + 15) >> 5
i <- i + (itercount + 31) >> 6
i <- i + (itercount + 63) >> 7
# VL in Parallel-Reduce is the number of operations
vlen[0:6] <- i

now, we finally have something suitable for HW!

in software, this would be 12 adds, and 7 shifts, so 19 instructions. in HW, using carry-save adders for the final sum, this could likely be done in 1 cycle, 2 at the very most.

> svshape has FIVE register hazards as outputs already. the absolute
> last thing we need is to make that SIX, one of which is a GPR as input.
> the consequences would be catastrophic on the Hazard Mamagement.

3 points:

1. for prefix-sum/reduction, it is sufficient to read from VL, so that makes hazard-tracking that much simpler since it's one possible register, and not 128.

2. I see no reason why all SVSTATE registers can't be hazard-tracked as one big register (since they're almost always used together anyway), which reduces the number of register hazards for dynamic svshape to 2-in (VL and MAXVL) 3-out (VL and MAXVL and the SVSTATE* register).

3. as you mentioned while I was writing this, SVSTATE, VL, and MAXVL are similar to PC, so, assuming we're talking about a big OoO core, they don't need a huge hazard-tracking matrix, instead there's basically a branch, cancelling all following speculative instructions and then re-issuing them with the correct state.

Comment 29 Jacob Lifshay 2024-01-01 03:16:20 GMT

(In reply to Jacob Lifshay from comment #28)
> 2. I see no reason why all SVSTATE registers can't be hazard-tracked as one
> big register (since they're almost always used together anyway), which
> reduces the number of register hazards for dynamic svshape to 2-in (VL and
> MAXVL) 3-out (VL and MAXVL and the SVSTATE* register).

I meant SVSHAPE registers, sorry.

Comment 30 Jacob Lifshay 2024-01-01 03:18:00 GMT

(In reply to Luke Kenneth Casson Leighton from comment #27)
> svindex had to have some very special conditions placed on it to
> not cause catastrophic damage to implementability (read the spec on
> this please).

I have, recall I tried to convince you to add out-of-range zeroing...

Comment 31 Jacob Lifshay 2024-01-01 03:23:49 GMT

(In reply to Jacob Lifshay from comment #28)
> in software, this would be 12 adds, and 7 shifts, so 19 instructions. in HW,
> using carry-save adders for the final sum, this could likely be done in 1
> cycle, 2 at the very most.

note, 19 instructions isn't even including the extra instructions to set SVSHAPE*.

Comment 32 Luke Kenneth Casson Leighton 2024-01-01 17:29:58 GMT

(In reply to Jacob Lifshay from comment #28)
> (In reply to Luke Kenneth Casson Leighton from comment #26)
> > the consequences are too damaging on the hardware, making it impractical.
> 
> I have read what you wrote and I disagree.

you can disagree all you like on ths one: it's not happening.
any *other* way is perfectly fine, but adding GPR dependencies
to REMAP SVSTATE/SVSHAPE is an absolute inviolate hard NO and
that is the end of the discussion, no matter how much you
disagree, or do not like it, or how much it degrades "performance".

a separate instruction that carries out the job of computing
the contents of SVSHAPE, such that it can be moved in with mtspr?

not a problem at all.

changing of sv* REMAP instructions to add a GPR as a Hazard?
get it and accept it: the answer is NO.



> if we choose to have a dynamic svshape instruction, it shouldn't affect
> precise interrupt guarantees, 

it destroys implementability by creating critical linkage between
GPRs at Decode and Issue.

it

is

not

going

to

happen.

get this through your head, STOP ARGUING, and start questioning WHY
until you do actually understand why, ok?


> the hard part is the SV loop afterwards, which will have precise interrupts
> regardless of if the svshape registers are set from a svshape instruction's
> immediates, from a dynamic svshape instruction, or from a mtspr.

and dozens of other instructions with outstanding GPR Hazards.

if there was a separate register file for Indices this would
not be a problem.  transfer from GPRs to "REMAP-Indices-regfile"
would be a single 1-in 1-out instruction.

we have been through this already with the CR regfile and with
the GPR-FPR instructions.

GPR-FPR instructions are *very deliberately* only 1-in 1-out as
anything else is catastrophic for the size of the Hazard Matrices.

> the current algorithm:
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/
> simplev.mdwn;h=33a02e6612065f290d840e15a596dfc2177de5e5;
> hb=fa603a1e9f2259d86acf4e9451937a000d099289#l309

ok let me take a look, try to understand what is going on.