Bug 1077 - evaluate removing /vec234 from instructions, move subvl to SVSTATE
Summary: evaluate removing /vec234 from instructions, move subvl to SVSTATE
Status: RESOLVED WONTFIX
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 952
  Show dependency treegraph
 
Reported: 2023-05-03 03:55 BST by Luke Kenneth Casson Leighton
Modified: 2023-06-02 18:50 BST (History)
5 users (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation: 1056
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-05-03 03:55:00 BST
there is an anomaly in SVP64 at the moment: subvl is not
part of SVSTATE, it is part of the prefix.
it needs to be evaluated if it is sensible to add subvl
to SVSTATE, freeing up 2 bits in RM Prefixes.
Comment 1 Luke Kenneth Casson Leighton 2023-05-03 04:04:51 BST
this is a BIG spec change but it is one that is important to get right.

the issue is that SVSTATE is supposed to define in advance the
number of registers needed, so that hardware gets a head-start
on reserving space, does not reserve too much.

MAXVL is supposed to say "ok we have MAXVL=5, that means each
instruction could use only up to 5 regs, no need to speculatively
reserve 6" and you know that holds true until MAXVL changes.

unfortuately, /vec2 doubles that.  /vec3 triples it. and unlike
setting MAXVL, the  /vec are on a *per instruction* basis.  so
one minute you reserve 5 regs, the next you reserve 10.

yes elwidth halves quarters etc. but that is less not more regs.

if you look closely at ISACaller subvl is anomalous, and in
Vertical-First Mode causes minor havoc.

by moving subvl to SVSTATE we also free up 2 bits, there are no
RESERVED bits at all.

but SVSTATE only has 6 bits free. that would go down to 4.

thoughts appreciated
Comment 2 Jacob Lifshay 2023-05-03 04:05:58 BST
imho it is not sensible to remove subvl from the prefix because GPU code often has subvl different in each instruction, so setvl or similar would need to be constantly reran just to change subvl, this would easily be many times more setvl ops because otherwise setvl is usually ran once per shader or once per function.

All of that is unless we do what some other GPU compilers do, which is to convert all code to scalar code and only then run the full-function-vectorization -- this essentially splits all subvl=2/3/4 ops into 2/3/4 copies of a subvl=1 op and never uses subvl=2/3/4, essentially removing subvl completely from generated code.
Comment 3 Jacob Lifshay 2023-05-03 04:12:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> unfortuately, /vec2 doubles that.  /vec3 triples it. and unlike
> setting MAXVL, the  /vec are on a *per instruction* basis.  so
> one minute you reserve 5 regs, the next you reserve 10.

that applies equally to operations that write two registers such as dshl or all the FFT ops.
> 
> yes elwidth halves quarters etc. but that is less not more regs.
> 
> if you look closely at ISACaller subvl is anomalous, and in
> Vertical-First Mode causes minor havoc.

that's cuz subvl basically says that the basic element type is a mathematical 2/3/4-vector rather than a scalar type...or at least that's what I intended.
Comment 4 Jacob Lifshay 2023-05-03 04:28:01 BST
(In reply to Jacob Lifshay from comment #2)
> imho it is not sensible to remove subvl from the prefix because GPU code
> often has subvl different in each instruction, so setvl or similar would
> need to be constantly reran just to change subvl, this would easily be many
> times more setvl ops because otherwise setvl is usually ran once per shader
> or once per function.

as an example, take this not-unusual blur shader from supertuxkart:
https://github.com/supertuxkart/stk-code/blob/fd12829e5b0c2f91fe522df5d8c74e3bb905b0cc/data/shaders/gaussian6h.frag
it would need like 15 setvl ops, and that's without even unrolling that loop! something like >30 for the loop-unrolled version! If we instead keep the status-quo with support for subvl in the prefix, the entire function needs just one setvl op.
Comment 5 Jacob Lifshay 2023-05-03 04:38:32 BST
(In reply to Jacob Lifshay from comment #2)
> imho it is not sensible to remove subvl from the prefix because GPU code
> often has subvl different in each instruction
> 
> All of that is unless we do what some other GPU compilers do,
> <snip>essentially
> removing subvl completely from generated code.

because of that, I think we have 3 valid choices:
1. leave subvl as-is

2. remove subvl completely since it's mostly only useful for GPU code and we can do what other GPU compilers do to not need it at all (see the second paragraph of comment #2)

3. move subvl from the prefix to SVSTATE -- imho this renders subvl largely useless because GPU compilers are likely to need to follow option #2 due to the large number of setvl ops that would be required. This is also imho the worst option from the HW perspective since it needs to propagate extra state in the decode pipeline from setvl to instructions afterward because subvl would be constantly changing.

I prefer either 1 or 2.
Comment 6 Jacob Lifshay 2023-05-03 04:40:41 BST
(In reply to Jacob Lifshay from comment #5)
> 3. move subvl from the prefix to SVSTATE -- imho this renders subvl largely
> useless because GPU compilers are likely to need to follow option #2 due to
> the large number of setvl ops that would be required. This is also imho the
> worst option from the HW perspective since it needs to propagate extra state
> in the decode pipeline from setvl to instructions afterward because subvl
> would be constantly changing.

also this makes reserving registers even worse because it's now dynamic state instead of easily predictable from the instruction encoding and doesn't even eliminate the large register allocations possibly needed for subvl=4
Comment 7 Dmitry Selyutin 2023-05-03 05:09:50 BST
So we free 2 bits in RM (thus for each instruction) and move these two into SVSTATE (thus global). Before doing this, few questions:
1. Do we have use for two bits freed in RM? Perhaps some specifiers can be simplified.
2. Do we have a real need in vec2/vec3/vec4 at all? If it can be emulated and emulation is cheap and/or this is a rarely used functionality, I'd simply drop it.

Dropping vec from binutils and insndb is a relatively straightforward task; I wouldn't say the same words about adding anything there.
Comment 8 Jacob Lifshay 2023-05-03 05:20:39 BST
(In reply to Dmitry Selyutin from comment #7)
> So we free 2 bits in RM (thus for each instruction) and move these two into
> SVSTATE (thus global).

or just remove subvl completely. this also conveniently frees up extra bits in SVSTATE (no need for tracking subvector steps) so we can make VL larger for future SVP64 versions and/or allows us to split MAXVL into OP_MAXVL and REG_MAXVL as proposed in the prefix-sum bug's comments.

> Before doing this, few questions:
> 1. Do we have use for two bits freed in RM? Perhaps some specifiers can be
> simplified.

I think we should expand all register specifiers to 3 bits if possible, it'd be nice to never have to worry about which registers can be encoded or not, since that helps make register allocation that much more complex...
Comment 9 Dmitry Selyutin 2023-05-03 05:31:04 BST
Jacob, perfect case with registers! Yes, that was one of the places where our modules operandi is doomed.
Comment 10 Luke Kenneth Casson Leighton 2023-05-03 09:31:00 BST
(In reply to Jacob Lifshay from comment #8)

> or just remove subvl completely. 

no. that destroys PACK/UNPACK.  please examine the specification.
let me catch up on the other comments later.
Comment 11 Jacob Lifshay 2023-05-03 09:54:03 BST
(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Jacob Lifshay from comment #8)
> 
> > or just remove subvl completely. 
> 
> no. that destroys PACK/UNPACK.

PACK/UNPACK can be entirely replaced by shuffles using matrix remaps (for in-register or in-memory) or strided ld/st (for in-memory), so the functionality isn't gone, it's just less accessible.

This is exactly what the strided ld/st instructions in vector ISAs are designed for, PACK/UNPACK is just a less flexible version (only supports interleaved data of all the same type with no holes) that works on more than just ld/st.

example:
# loading an array of rgb data into red, green, and blue vectors
# r3 = array ptr
# r4 = array len
# r32.. = red vector
# r40.. = green vector
# r48.. = blue vector
setvl 0, r4, 64, 0, 1, 1
lbz/elwid=8/els *r32, 3(r3)  # load every 3rd byte to r32..
addi r3, r3, 1  # increment starting address
lbz/elwid=8/els *r40, 3(r3)
addi r3, r3, 1
lbz/elwid=8/els *r48, 3(r3)
Comment 12 Luke Kenneth Casson Leighton 2023-05-03 10:01:11 BST
(In reply to Jacob Lifshay from comment #2)
> imho it is not sensible to remove subvl from the prefix because GPU code
> often has subvl different in each instruction, so setvl or similar would
> need to be constantly reran just to change subvl, this would easily be many
> times more setvl ops because otherwise setvl is usually ran once per shader
> or once per function.

on its own that's a good enough reason for me.

> All of that is unless we do what some other GPU compilers do, which is to
> convert all code to scalar code and only then run the
> full-function-vectorization -- this essentially splits all subvl=2/3/4 ops
> into 2/3/4 copies of a subvl=1 op and never uses subvl=2/3/4, essentially
> removing subvl completely from generated code.

i have a feeling that this was what ultimately must have been decided
in RVV: early drafts (0.6) had subvl but i think it got dropped.  given
the GPU long-term goals i'm inclined to say "keep".
Comment 13 Dmitry Selyutin 2023-05-03 10:25:37 BST
Guys, even if we close this task with EWONTFIX, I still feel obliged to ask: do we have other options to simplify register remapping process and make it more consistent? This is one of the most gory places in the whole asm/disasm process, so I need to raise this question. :-)
Comment 14 Luke Kenneth Casson Leighton 2023-05-03 12:03:23 BST
(In reply to Dmitry Selyutin from comment #13)
> Guys, even if we close this task with EWONTFIX, I still feel obliged to ask:
> do we have other options to simplify register remapping process and make it
> more consistent? This is one of the most gory places in the whole asm/disasm
> process, so I need to raise this question. :-)

yes there's something called SVP64Single which *after* things stabilise
with SVP64(Vector) we can start looking at it. this page is a stub so
as to not forget that it exists
https://libre-soc.org/openpower/sv/svp64-single/

basically there are no reg-numbering holes in SVP64Single but there
is no looping as a result.
Comment 15 Luke Kenneth Casson Leighton 2023-05-03 12:04:50 BST
(In reply to Jacob Lifshay from comment #11)

> PACK/UNPACK can be entirely replaced by shuffles
> using matrix remaps (for
> in-register or in-memory)

or, you can apply PACK/UNPACK as a 4th dimension on top of matrix.

PACK/UNPACK is also the interaction-point on mv.swizzle.

*please* read the spec.
Comment 16 Luke Kenneth Casson Leighton 2023-05-03 12:37:24 BST
also Matrix REMAP is very expensive to set up, large latency expected,
four SVSHAPE SPRs to be written to, where PACK/UNPACK is 2 bits in
SVSHAPE and consequently extremely quick to establish. this has *all*
been written in detail in the spec, *please* read it.
Comment 17 Luke Kenneth Casson Leighton 2023-05-03 13:47:12 BST
(In reply to Jacob Lifshay from comment #4)

> as an example, take this not-unusual blur shader from supertuxkart:
> https://github.com/supertuxkart/stk-code/blob/
> fd12829e5b0c2f91fe522df5d8c74e3bb905b0cc/data/shaders/gaussian6h.frag
> it would need like 15 setvl ops, and that's without even unrolling that
> loop! something like >30 for the loop-unrolled version! If we instead keep
> the status-quo with support for subvl in the prefix, the entire function
> needs just one setvl op.

yowser. that's bad.  ok so it does make sense.


(In reply to Jacob Lifshay from comment #3)

> that applies equally to operations that write two registers such as dshl or
> all the FFT ops.

yes very true - those with an implicit RS=RT+MAXVL (elwidth overrides
notwithstanding).

overall then i think we just have to suck it up and go with /vec2-4 on
a per-instruction basis.  SVSTATE.hphint helps reduce some of the
Hazards (by declaring how many elements have no hazards).
Comment 18 Luke Kenneth Casson Leighton 2023-05-03 13:53:55 BST
(In reply to Dmitry Selyutin from comment #7)
> So we free 2 bits in RM (thus for each instruction) and move these two into
> SVSTATE (thus global). Before doing this, few questions:
> 1. Do we have use for two bits freed in RM? Perhaps some specifiers can be
> simplified.

yes.  the 5-bit mode is terribly cramped. modes that should be possible
together (independently) are mutually-exclusive.  source/dest-zeroing
is mostly disabled (crippled). and there's no future expansion room.

> 2. Do we have a real need in vec2/vec3/vec4 at all? If it can be emulated
> and emulation is cheap and/or this is a rarely used functionality, I'd
> simply drop it.

in GPU code and in vector pack/unpack it is incredibly useful.  imagine
being able to "unzip" RGB data, process it, then "zip" it back up again.
like this:

   https://libre-soc.org/openpower/sv/load-store.svg

so although the data (from memory or registers doesn't matter which)
could be R G B R G B you actually want to process all Rs together,
process all Gs together, process all Bs together, so UNPACK with
a vec3 will do precisely that: grab all R R R R R in sequence
grab all G G G G G in sequence grab all B B B B B in sequence.

VSX *only* has dual (twin) pack/unpack.  it does *not* have 3-wide
and it does not have 4-wide.  vpkss.  they're fixed size only.
basically back around 2003 when screens were 15-bit RGB, IBM added some
hard-coded instructions to cope with that. then things moved on...
Comment 19 Dmitry Selyutin 2023-06-02 14:23:38 BST
As I understand, we're inclined to keep vec2/vec3/vec4. Therefore this should be closed with EWONTFIX. Correct?
Comment 20 Luke Kenneth Casson Leighton 2023-06-02 14:45:43 BST
(In reply to Dmitry Selyutin from comment #19)
> As I understand, we're inclined to keep vec2/vec3/vec4. Therefore this
> should be closed with EWONTFIX. Correct?

correct, it was a horror-show :)
Comment 21 Jacob Lifshay 2023-06-02 18:50:28 BST
evaluating is a valid task even though we decided to not do anything