864 – implement parallel prefix reduction in simulator

Bug 864 - implement parallel prefix reduction in simulator

Summary: implement parallel prefix reduction in simulator

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	High enhancement
Assignee:	Dmitry Selyutin

URL:	https://libre-soc.org/openpower/sv/sv...

Depends on:
Blocks:	233
	Show dependency tree / graph

Reported:	2022-06-21 10:36 BST by Luke Kenneth Casson Leighton
Modified:	2023-04-26 21:21 BST (History)
CC List:	3 users (show)

See Also:	911 1071
NLnet milestone:	NLNet.2019.10.031.Video
total budget (EUR) for completion of task and all subtasks:	3000
budget (EUR) for this task, excluding subtasks' budget:	3000
parent task for budget allocation:	233
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	ghostmansd = { amount = 900, submitted = 2022-09-13, paid = 2022-09-15 } lkcl = { amount = 1800, submitted = 2022-09-12, paid = 2022-09-15 } [jacob] amount = 300 submitted = 2022-09-15 paid = 2022-09-19

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2022-06-21 10:36:07 BST

the index-redirection SVSTATE-context-only parallel reduction
algorithm needs implementing (single instruction only, strict
RISC paradigm) and if possible and practical including
predication.

Comment 1 Luke Kenneth Casson Leighton 2022-06-21 14:09:04 BST

first version including a "yield" version which does not require
a move operation.

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/preduce.py;h=5fb4f26f6d349e5fdf0d15cb50f66a4da6fa40e4;hb=13c4a19e93caf4dc817c508f70f3af42749a7a87

Comment 2 Jacob Lifshay 2022-06-21 21:17:01 BST

(In reply to Luke Kenneth Casson Leighton from comment #1)
> first version including a "yield" version which does not require
> a move operation.
> 
> https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/preduce.py;
> h=5fb4f26f6d349e5fdf0d15cb50f66a4da6fa40e4;
> hb=13c4a19e93caf4dc817c508f70f3af42749a7a87

try testing with a predicate of [0,0,0,0,1,1,1,1], you'll see that element 0 (the expected output by the compiler) never gets written to, hence why moves are necessary.

also, you can think of reduction as doing operations where the predicates are on each element operation's *inputs* rather than its outputs, so, obviously it needs to do something different when one input is predicated off and the other input is still on -- move (or something similar -- e.g. integer subtract could be defined to assume predicated off inputs are replaced with 0 so a - b with a predicated off would output -b).

Comment 3 Luke Kenneth Casson Leighton 2022-06-25 19:46:06 BST

(In reply to Jacob Lifshay from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #1)
> > first version including a "yield" version which does not require
> > a move operation.
> > 
> > https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/preduce.py;
> > h=5fb4f26f6d349e5fdf0d15cb50f66a4da6fa40e4;
> > hb=13c4a19e93caf4dc817c508f70f3af42749a7a87
> 
> try testing with a predicate of [0,0,0,0,1,1,1,1], you'll see that element 0
> (the expected output by the compiler) never gets written to,

    vec = [1, 2, 3, 4, 9, 5, 6, 8]
    prd = [0, 0, 0, 0, 1, 1, 1, 1]

    output: [1, 2, 3, 4, 28, 5, 14, 8]

that's expected behaviour.  the first four are masked-out, and
do not get altered. the sum ends up in the first non-masked-out element 
(element 4) 9+5+6+8=28


what *actually* happened was:

round 1:

* 6+8 [14] got stored in element 6
* 9+5 [14] got stored in element 4

round 2:

* 14+14 [28] got stored in element 4

this is "expected behaviour" and i have a suspicion that it does not
require the ORing of the predicate mask.  which would be a huge
advantage because it drastically reduces complexity and reduces state
especially on context-switches.


> hence why moves are necessary.

i've said it about five times now, it is getting very irritating for me
that you are not listening.

moves are NOT happening: please adapt your thinking so that the addition
of moves is entirely and 100% removed from all advocated solutions because
i have made it repeatedly and abundantly clear that moves are NOT going
to be part of the specification.  doing so violates strict RISC principles
and will cause no end of problems.

please help to SOLVE the problem rather than continuing to repeat "moves
are necessary moves are necessary"

they are NOT happening - period.

> also, you can think of reduction as doing operations where the predicates
> are on each element operation's *inputs* rather than its outputs, so,

there's two (for Twin-Predication)

> obviously it needs to do something different when one input is predicated
> off and the other input is still on -- move (or something similar -- e.g.
> integer subtract could be defined to assume predicated off inputs are
> replaced with 0 so a - b with a predicated off would output -b).

unfortunately, changing the inputs to zeros would result in some instructions
(multiply-prefix) outputting zero.

then when considering "ah but you can just make those have a 1 instead"
this massively complicates the decode-and-issue phase, introducing an
inter-dependency between the "Abstracted Schedule" and the "Element Operation"
which violates the strict RISC design principle of SV.

there was something else.  it occurred to me last week.  i can't remember
it, now.  something related to the predicate masks. it was important, and
was a reason why the original algorithm (with modifying predicate masks)
could not be used.

Comment 4 Jacob Lifshay 2022-06-26 09:30:20 BST

(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > try testing with a predicate of [0,0,0,0,1,1,1,1], you'll see that element 0
> > (the expected output by the compiler) never gets written to,
> 
>     vec = [1, 2, 3, 4, 9, 5, 6, 8]
>     prd = [0, 0, 0, 0, 1, 1, 1, 1]
> 
>     output: [1, 2, 3, 4, 28, 5, 14, 8]
> 
> that's expected behaviour.  the first four are masked-out, and
> do not get altered. the sum ends up in the first non-masked-out element 
> (element 4) 9+5+6+8=28

the result not being in a statically predicable location renders predicated reduction *all but useless* from a compiler perspective -- the compiler needs to know which *fixed scalar* register to read the result from, the register can't jump around when the predicate changes.

(unpredicated reduction is unaffected)

since you obviously don't like my proposed solution (moving is imho the best fully correct solution -- not all operations have an identity, so substituting masked-off lanes with the identity element doesn't really work), please propose some yourself.

Comment 5 Jacob Lifshay 2022-06-26 09:36:28 BST

(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > obviously it needs to do something different when one input is predicated
> > off and the other input is still on -- move (or something similar -- e.g.
> > integer subtract could be defined to assume predicated off inputs are
> > replaced with 0 so a - b with a predicated off would output -b).
> 
> unfortunately, changing the inputs to zeros would result in some instructions
> (multiply-prefix) outputting zero.

I meant that the input is Absent when that predicate bit is zero, not that the input is replaced with zero. different instructions would handle absent inputs different ways .. e.g. bitwise-and would treat an absent input as -1, division could treat an absent input as 1.

imho this still follows the risc paradigm since the svp64 prefix could care less which binary op it's reducing with, all it knows is it didn't supply some of the inputs because those predicate bits were zeros.

Comment 6 Luke Kenneth Casson Leighton 2022-06-26 10:06:36 BST

(In reply to Jacob Lifshay from comment #4)

> the result not being in a statically predicable location renders predicated
> reduction *all but useless* from a compiler perspective

dynamically predictable.  result will always be in the position
of the very first predicate-bit.
sv.mv with source set to scalar and mask set to the same predicate
gets that, predictably.

please drop the need to insist that moves are necessary.  7th time of
repeating this.

Comment 7 Jacob Lifshay 2022-06-26 10:12:12 BST

(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #4)
> 
> > the result not being in a statically predicable location renders predicated
> > reduction *all but useless* from a compiler perspective
> 
> dynamically predictable.  result will always be in the position
> of the very first predicate-bit.
> sv.mv with source
i think you meant dest

> set to scalar and mask set to the same predicate
> gets that, predictably.

yes, but, because sv.mv rt.s, ra.v acts like a mv.x, it'll likely be very slow...imho it should be avoided.

Comment 8 Luke Kenneth Casson Leighton 2022-06-26 11:10:45 BST

(In reply to Jacob Lifshay from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #6)
> > dynamically predictable.  result will always be in the position
> > of the very first predicate-bit.
> > sv.mv with source
> i think you meant dest

no, source.  the selection of the result(s) of the [parallel-prefixed]
output become a source to a second sv.mv, using the same predicate mask.
the destination is a scalar.  you have it below: "...sv.mv rt.s, ra.v acts..."

> > set to scalar and mask set to the same predicate
> > gets that, predictably.
> 
> yes, but, because sv.mv rt.s, ra.v acts like a mv.x, it'll likely be very
> slow...

that's down to implementors to spot the pattern and optimise it.

* "dest is scalar therefore only 1st bit of predicate matters"
* "only 1st bit matters therefore only one mv needs to be done"

basically: Officially Not Our Problem.  worth mentioning as an Implementor's
Note, but Not Our Problem.

other methods: don't damn well use predication in the first place.  use
a non-zero predicate mv to reduce them down to a contiguous set, first.

Comment 9 Luke Kenneth Casson Leighton 2022-08-28 19:19:20 BST

dmitry this needs to have binutils assembler-syntax for the
parallel-prefix (probably "/pp", go figure) added to binutils
and sv/trans/svp64.py. oh er and disasm.

Comment 10 Luke Kenneth Casson Leighton 2022-09-05 01:09:42 BST

added sv/trans/svp64.py support (non-subvector mode)

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=8ea6301cecb6f6a61b140cf1b10f23bdbb161961

added to SVP64RMModeDecode

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=806cd6e0f2239ee12aef3d2ee96cf2a14a54d13b

Comment 11 Luke Kenneth Casson Leighton 2022-09-06 15:28:41 BST

commit 80c5c65b2e24c0c813e18e118f79bc120ffbfb20 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Tue Sep 6 15:28:05 2022 +0100

    REMAP parallel-reduce:
    https://bugs.libre-soc.org/show_bug.cgi?id=864
    * add 0b0111 csv entry for svshape
    * stop sv/trans/svp64.py raising exception for SVrm=0b0111
    * add beginnings of svshape SVrm=0b0111 to simplev.mdwn
    * add first unit test

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=80c5c65b2e24c0c813e18e118f79bc120ffbfb20

Comment 12 Luke Kenneth Casson Leighton 2022-09-06 15:51:05 BST

commit e88b945123f2c3f047748a44fdf446775517076a (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Tue Sep 6 15:50:35 2022 +0100

    add first functional confirmed unit test for parallel reduce REMAP
    https://bugs.libre-soc.org/show_bug.cgi?id=864
    using remap_preduce_yield directly, confirmed operational through ISACaller
    added through decoder/isa/svshape.py which is responsible
    in SVSHAPE.getiterator() for returning the appropriate yield-iterator

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=e88b945123f2c3f047748a44fdf446775517076a

Comment 13 Jacob Lifshay 2022-09-07 09:16:00 BST

one thing that needs to be tested is an associative but not commutative function, fcpsgn is a good candidate (there aren't many that are single instructions). this ensures the order is correct.

you can also test a neither associative nor commutative function to ensure the tree is evaluated in the proper order, such as sub or andc.

Comment 14 Luke Kenneth Casson Leighton 2022-09-07 18:39:17 BST

(In reply to Jacob Lifshay from comment #13)
> one thing that needs to be tested is an associative but not commutative
> function, fcpsgn is a good candidate (there aren't many that are single
> instructions). this ensures the order is correct.
> 
> you can also test a neither associative nor commutative function to ensure
> the tree is evaluated in the proper order, such as sub or andc.

hm good call on both counts.  manual inspection will
be needed because by using the same yield algorithm
it is risky.

Comment 15 Luke Kenneth Casson Leighton 2022-09-07 19:03:21 BST

after some thought, adding in predication is too complex in hardware
and the simulator, or, more to the point, the existing predication
system should not be altered.

REMAP predication requires *post* index remap lookup, and in the case
of Matrix that requires three separate predicates: one for A matrix
one for B matrix and another for the result.

DCT/FFT needs one predicate per element (post remap) and Indexed as well.

all of which is too much to tackle immediately

Comment 16 Luke Kenneth Casson Leighton 2022-09-07 20:31:09 BST

happy with these unit tests - predication will have to wait though.
ghostmansd, as this is a REMAP mode now, you don't need to add
anything to binutils, with the possible exception of checking
that SVRM=0b0111 is okay, in svshape. can you confirm?

Comment 17 Luke Kenneth Casson Leighton 2022-09-12 21:22:56 BST

answer is yes.
https://sourceware.org/pipermail/binutils/2022-September/122783.html

i saw in those patches there's exceptions for 0b0111 and 0b1000
(7 and 8).