Bug 535 - setvl/setvli encoding & future reg file expansion
Summary: setvl/setvli encoding & future reg file expansion
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: All All
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2020-11-30 20:31 GMT by Jacob Lifshay
Modified: 2021-01-29 00:15 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.10.Standards
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation: 213
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2020-11-30 20:31:52 GMT
https://libre-soc.org/openpower/sv/setvl/
http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html

important part is reserving 2-3 bits to indicate that the program expects a certain reg file size > 128
Comment 1 Luke Kenneth Casson Leighton 2020-12-01 04:55:56 GMT
XFX-Form with a single XO, and RT allocated to the register to be set RT=min(MVL, min(RT, VL)), there are 10 remaining bits to use to select VL, MVL, and still have space for expansion.  this without compromising on encoding space for MVL, assuming it's ok to have a bit that specifies to set MVL and VL at the same time.  OE=1 (bit 31) can be used to indicate "future encoding"

   bits 0:5   major op
   bits 6:10  (RT|0)
   bits 11:16 immed
   bit  17    set MVL to immed
   bit  18    set VL to immed
   bit  19:20 reserved
   bit  21:30 XO
   bit  31    reserved

totals 3 spare bits for future.

interesting thing about when bit 17 and 18 are zero, RT is set to VL (actually min(min(RT, VL), MVL)) but without first setting VL to the immediate.

recommend also having (RT|0) to indicate that VL is to be set (and MVL) but RT is not.

quite a bit of pseudocode but point is there is neither pressure to compromise on encoding space or reserved space.
Comment 2 Luke Kenneth Casson Leighton 2020-12-01 14:14:01 GMT
i took another look at RVV setvl/i and some example loops: a critical aspect of setvl/i is that there is an RT and an RA.

* RA is the "number of times that we would *like* to run the hw-loop"
* RT contains a copy of VL that is set to the *actual* hw-loop quantity
* at the end of the loop, the actual amount can be subtracted.

loop:
    vsetvli a3, a0   #  update a3 with vl (# of elements this iteration)
    sub a0, a0, a3   # Decrement count by vl
    bnez a0, loop    # Any more?

if we do not have both RT and RA the "lack" of the same will require an additional mr (register copy) instruction.
Comment 3 Luke Kenneth Casson Leighton 2020-12-01 15:38:27 GMT
nuts.  with the need for both RA and RT the number of bits required is
increased by 5.  that needs to do the crops trick of taking up 5 bits of
XO.

so that'd be an entire column of 32 (XO[0:4]) for use as other fields,
whilst XO[5:9] would be a constant (selecting one of the columns, if
you look at v3.0B p1156 Table 20 Book III Appendix C)
Comment 4 Luke Kenneth Casson Leighton 2020-12-01 16:16:14 GMT
ha! totally cool! one of the things that can be done is, Rc=1 will set CR0 based on the new value of VL.

that means that no extra test of "VL >= 0" would be needed in some circumstances, meaning that an entire instruction could be saved in some types of loops.  cool!
Comment 5 Jacob Lifshay 2020-12-01 17:01:52 GMT
I would expect that the final encoding efficiently support the 2 most common usecases:
1. set VL to a constant
mvl = immed1
vl = immed1

2. set VL to a variable
mvl = immed1
vl = min(RA, mvl)
RT = vl

We don't generally need the option to not set mvl or not set vl, since that's only relevant in very obscure usecases (can't think of any at the moment). I'm assuming you can just use a mfspr to read vl. Honestly, if vl can only be set using setvl[i], we can just get rid of mvl entirely, since no other instructions can increase VL (some can decrease it). If mvl is not visible, then using `setvl 0, ra, 64` is sufficient to set vl to any supported value for context switching.

mtspr could be entirely equivalent to `setvl 0, ra, 64` or just not supported (always traps no matter privilege mode) for writing VL.
Comment 6 Luke Kenneth Casson Leighton 2020-12-01 17:23:10 GMT
(In reply to Jacob Lifshay from comment #5)
> I would expect that the final encoding efficiently support the 2 most common
> usecases:
> 1. set VL to a constant
> mvl = immed1
> vl = immed1
> 
> 2. set VL to a variable
> mvl = immed1
> vl = min(RA, mvl)
> RT = vl
> 
> We don't generally need the option to not set mvl or not set vl, since
> that's only relevant in very obscure usecases (can't think of any at the
> moment). 

setting VL but not MVL is needed to not disrupt a previously-set MVL

setting MVL but not VL is needed likewise to not disrupt (except by truncation VL=min(VL, MVL)) a previously-set VL.

> I'm assuming you can just use a mfspr to read vl.

by wasting one extra instruction, by that instruction being a 100% unavoidable mandatory part of absolutely all and any loops...

... yes.

[translation: not a good idea]

> Honestly, if vl
> can only be set using setvl[i], we can just get rid of mvl entirely, 

mvl defines the number of elements that the vector is to cover.

unlike in RVV it is not a hard architectural quantity.

it is therefore mandatory to have MVL be part of the setvli instruction, otherwise how can SV know where to stop overwriting the regfile?

> mtspr could be entirely equivalent to `setvl 0, ra, 64` or just not
> supported (always traps no matter privilege mode) for writing VL.

no, use of mtspr is definitely not equivalent because that cuts out the setting of RT.

the thing is that setvl/i is actually quite complex.  it took me several weeks to understand it fully in RVV, and then even longer to realise and accept that the capabilities of RVV setvl were needed (in full)... *and in addition* the ability to set MVL at the same time was required.

the critical, critical part of setvl is this:

     RT = VL = min(min(VL, MAXVL), RA)

if all those calculations are not included then SV is severely penalised because it requires, at the bare minimum, two to three mandatory *additional* instructions (in an inner loop!) and at least one other as fixed overhead (outside the loop) just to achieve the same effect.


suggestions to only use mtspr, mfspr, cut out mvl as an immediate are absolutely guaranteed to destroy the value of SV as an efficient instruction-compact Vectorisation system, or in the case of cutting out mvl destroy it entirely.

setvl needs extremely careful study to properly understand.  it's best to start from examples, such as the daxpy one at sigarch.  there are a few others (strncpy) but the daxpy one is the easiest to start from
Comment 7 Jacob Lifshay 2020-12-01 17:44:10 GMT
(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #5)
> > I would expect that the final encoding efficiently support the 2 most common
> > usecases:
> > 1. set VL to a constant
> > mvl = immed1
> > vl = immed1
> > 
> > 2. set VL to a variable
> > mvl = immed1
> > vl = min(RA, mvl)
> > RT = vl
> > 
> > We don't generally need the option to not set mvl or not set vl, since
> > that's only relevant in very obscure usecases (can't think of any at the
> > moment). 
> 
> setting VL but not MVL is needed to not disrupt a previously-set MVL
> 
> setting MVL but not VL is needed likewise to not disrupt (except by
> truncation VL=min(VL, MVL)) a previously-set VL.

Yes, but practically speaking, what's the usecase for setting VL or MVL separately from each other? The compiler knows MVL's constant value since it did register allocation, so can just supply it to all setvl[i] instructions. no mvl spr needed.

> > I'm assuming you can just use a mfspr to read vl.
> 
> by wasting one extra instruction, by that instruction being a 100%
> unavoidable mandatory part of absolutely all and any loops...

did you notice that setvl writes to an output register?

> ... yes.
> 
> [translation: not a good idea]
> 
> > Honestly, if vl
> > can only be set using setvl[i], we can just get rid of mvl entirely, 

I should clarify, I only meant removing the mvl register, not the mvl immediate field or the calculation using mvl.

> > mtspr could be entirely equivalent to `setvl 0, ra, 64` or just not
> > supported (always traps no matter privilege mode) for writing VL.
> 
> no, use of mtspr is definitely not equivalent because that cuts out the
> setting of RT.

I didn't say mtspr was equivalent to all setvl instructions, just that specific one which opts out of writing RT by setting the register field to 0.

> the thing is that setvl/i is actually quite complex.  it took me several
> weeks to understand it fully in RVV, and then even longer to realise and
> accept that the capabilities of RVV setvl were needed (in full)... *and in
> addition* the ability to set MVL at the same time was required.
> 
> the critical, critical part of setvl is this:
> 
>      RT = VL = min(min(VL, MAXVL), RA)

yup, that's completely retained. what isn't needed is keeping the maxvl value around in a spr for later, since the compiler knows its value as a compile-time constant (it has to since it allocated the registers) and can just put maxvl in the immediate of all relevant setvl[i] instructions.

We can even include setting CR0 (if Rc = 1) to allow jumps on VL == 0 immediately after setvl.

For setvli, since the value VL is set to is a constant, having a destination register is much less important.

> if all those calculations are not included then SV is severely penalised
> because it requires, at the bare minimum, two to three mandatory
> *additional* instructions (in an inner loop!) and at least one other as
> fixed overhead (outside the loop) just to achieve the same effect.
> 
> 
> suggestions to only use mtspr, mfspr, cut out mvl as an immediate are
> absolutely guaranteed to destroy the value of SV as an efficient
> instruction-compact Vectorisation system, or in the case of cutting out mvl
> destroy it entirely.

yup, hence why that's not what I'm proposing.
Comment 8 Luke Kenneth Casson Leighton 2020-12-01 18:37:20 GMT
whoops hit srnd

(In reply to Jacob Lifshay from comment #7)

> did you notice that setvl writes to an output register?

yes.  optionally set MVL, optionally set VL, optionally set RT.

i figured, "why the heck not".  

the write to the RT register, when sourced from RA, is how loops are efficiently encoded.

[at one point i advocated "stuff having a separate VL SPR, actually mark one of the INT regs *as* VL" however the complexity at the backend, with the so-marked GPR effectively becoming a Read Hazard to EVERY instruction... this was too much to think through.  it would be beautiful, though: setvl goes from the inner loop at the least].


> I should clarify, I only meant removing the mvl register, not the mvl
> immediate field or the calculation using mvl.

ah ok.  see below, i added c.setvl and c.setmvli

> > > mtspr could be entirely equivalent to `setvl 0, ra, 64` or just not
> > > supported (always traps no matter privilege mode) for writing VL.
> > 
> > no, use of mtspr is definitely not equivalent because that cuts out the
> > setting of RT.
> 
> I didn't say mtspr was equivalent to all setvl instructions, just that
> specific one which opts out of writing RT by setting the register field to 0.

that simply does not put VL into RT.  VL is still set... just not transferred into RT.

this is for circumstances where there is no loop, but a (single) sequence of Vector Ops would save a huge number of instructions.

common circumstances for this include entry points to functions, where a long contiguous run of registers needs to be stored on the stack.

the history of computing ISAs as you are no doubt aware is littered with Bad Examples Of How Not To Do That.  even OpenPOWER has had the good sense to retire load/store-multi, and ARM retired the same (all but the 2-reg variants)

there will be plenty more like that, including accidental occurrences of sequential use of registers in uniform-looking structs.

a loop would be inappropriate, a "normal" vector regfile wouldn't work, however SV by a complete coincidence has what's needed.

and for these one-offs, reading VL or RA, or writing VL to RT, these are not necessary.

all you want is:

    setvli VL=MVL=5
    v.ld ra, rb # load 5 regs from 5 adrs

or:

    setvli VL=MVL=5
    v.mv ra, rb # copy 5 regs


> > the thing is that setvl/i is actually quite complex.  it took me several
> > weeks to understand it fully in RVV, and then even longer to realise and
> > accept that the capabilities of RVV setvl were needed (in full)... *and in
> > addition* the ability to set MVL at the same time was required.
> > 
> > the critical, critical part of setvl is this:
> > 
> >      RT = VL = min(min(VL, MAXVL), RA)
> 
> yup, that's completely retained. what isn't needed is keeping the maxvl
> value around in a spr for later, 

i think i get what you are saying.  it is: if the only location where MVL is specified is in the setvli instruction, if the only place it is ever used is in this instruction, then it effectively becomes local state and need not be stored in an SPR.

the answer to that is in the form of the two Compressed instructions i added 1hr ago:

   * setmvli immed
   * setvl rt, ra

by a nice coincidence there happened to be a non-register immediate spare slot with 6 bits free for an immediate, and another spare slot in the 16 bit logical brownfield encoding.

this covers the majority use-cases: setting MVL outside the loop (16bit), setting VL inside the loop (16bit).

if MVL has been set to a fixed quantity for several loops (start of a function) 16 bits are saved by way of splitting MVL setting from VL setting.


> since the compiler knows its value as a
> compile-time constant (it has to since it allocated the registers) and can
> just put maxvl in the immediate of all relevant setvl[i] instructions.

and compressed.setmvli


> We can even include setting CR0 (if Rc = 1) to allow jumps on VL == 0
> immediately after setvl.

awesome, isn't it? :)  i love CRs.  see comment #4 i put Rc support in.  i mean, it's part of XO-form so why not.

> For setvli, since the value VL is set to is a constant, having a destination
> register is much less important.

no, it's critical.  without it, loops are forced to read VL after the setvl instruction.

this increases inner loop overhead by one instruction.  given that some vector loops will only be 5-6 ops this is a whopping 15-20% increase.
Comment 9 Luke Kenneth Casson Leighton 2020-12-01 18:43:15 GMT
whoops hit send by accident


loop:
      setvl VL=min(MVL,r5) # without vl
                           # into dest
      mfspr r3, VL # forced to add this
                   # as an extra op
      # ok now we know the amount
      # of vector elements that will be
      # done (in r3), r3 can be sub'd
      sub. r5, r5, r3 # copy of VL sub'd
      bnz loop # r5 not zero, go again

unless VL is copied into RT, the mfspr is 100% mandatory and that's an entire instruction, mandatory overhead, in inner loops.
Comment 10 Luke Kenneth Casson Leighton 2020-12-01 18:48:35 GMT
(In reply to Luke Kenneth Casson Leighton from comment #8)

>     setvli VL=MVL=5
>     v.mv ra, rb # copy 5 regs

there's a whole stack of usecases like this, and it was deemed sufficiently valuable and useful for RT=0 to have made it into RVV.

likewise RA=0 was deemed valuable and useful (see wiki.page with link to online RVV setvl)

what they couldn't think of a use for is RA=0,RT=0 however in our case this can be used to set MVL or VL.
Comment 11 Jacob Lifshay 2020-12-01 18:48:44 GMT
(In reply to Luke Kenneth Casson Leighton from comment #8)
> the answer to that is in the form of the two Compressed instructions i added
> 1hr ago:
> 
>    * setmvli immed
>    * setvl rt, ra

Now, that's a good reason to keep a mvl spr.

> > We can even include setting CR0 (if Rc = 1) to allow jumps on VL == 0
> > immediately after setvl.
> 
> awesome, isn't it? :)  i love CRs.  see comment #4 i put Rc support in.  i
> mean, it's part of XO-form so why not.

yup, that's what I was referencing.

> > For setvli, since the value VL is set to is a constant, having a destination
> > register is much less important.
> 
> no, it's critical.  without it, loops are forced to read VL after the setvl
> instruction.

your talking about setvl, not setvli. setvli sets both VL and MVL to a known constant for use with fixed-length vectors, no need to loop.
Comment 12 Luke Kenneth Casson Leighton 2020-12-01 19:23:18 GMT
(In reply to Jacob Lifshay from comment #11)

> > > For setvli, since the value VL is set to is a constant, having a destination
> > > register is much less important.

> your talking about setvl, not setvli. setvli sets both VL and MVL to a known
> constant for use with fixed-length vectors, no need to loop.

ohh yepyep got it.

ok, i know what happened.  the instruction as designed (and pseudocode) is a hybrid all-in-one, where setvli effectively becomes a pseudo-op not a real op.

so i misunderstood, with setvl being synonymous with setvli.

my feeling is, it's not worth having separate setvl and setvli instructions, such that setvli not having a dest reg is moot.

in *compressed*, do we want a separate setvli? mmm... maayyybeee... although i don't think there's space.


(In reply to Jacob Lifshay from comment #11)

> > > We can even include setting CR0 (if Rc = 1) to allow jumps on VL == 0
> > > immediately after setvl.
> > 
> > awesome, isn't it? :)  i love CRs.  see comment #4 i put Rc support in.  i
> > mean, it's part of XO-form so why not.
> 
> yup, that's what I was referencing.

ok,ok, this is hilarious: if we allow setvl to be an SV-P48 prefixable instruction, it *might* be possible (stress: might) to get CR0 retargetted at an alternative CR.

one downside of Rc=1 is you can't doecify an alternative CR, end result you have to move it to another CR then the bc can use that alternative target.

but... if SV-P48 can set the alternative to CR0 the extra instruction is saved.
Comment 13 Luke Kenneth Casson Leighton 2020-12-01 19:27:35 GMT
(In reply to Luke Kenneth Casson Leighton from comment #12)

> in *compressed*, do we want a separate setvli? mmm... maayyybeee... although
> i don't think there's space.

there is! wha-hey!
Comment 14 Jacob Lifshay 2020-12-01 19:31:21 GMT
(In reply to Luke Kenneth Casson Leighton from comment #12)
> in *compressed*, do we want a separate setvli? mmm... maayyybeee... although
> i don't think there's space.

sounds like the perfect place to do some immediate reencoding to just use common values, giving us 1 bit for setvli vs. setmvli

> 
> (In reply to Jacob Lifshay from comment #11)
> 
> > > > We can even include setting CR0 (if Rc = 1) to allow jumps on VL == 0
> > > > immediately after setvl.
> > > 
> > > awesome, isn't it? :)  i love CRs.  see comment #4 i put Rc support in.  i
> > > mean, it's part of XO-form so why not.
> > 
> > yup, that's what I was referencing.
> 
> ok,ok, this is hilarious: if we allow setvl to be an SV-P48 prefixable
> instruction, it *might* be possible (stress: might) to get CR0 retargetted
> at an alternative CR.
> 
> one downside of Rc=1 is you can't doecify an alternative CR, end result you
> have to move it to another CR then the bc can use that alternative target.

can't bc just use cr0 as-is?
Comment 15 Luke Kenneth Casson Leighton 2020-12-01 19:39:32 GMT
(In reply to Jacob Lifshay from comment #14)
> (In reply to Luke Kenneth Casson Leighton from comment #12)
> > in *compressed*, do we want a separate setvli? mmm... maayyybeee... although
> > i don't think there's space.
> 
> sounds like the perfect place to do some immediate reencoding to just use
> common values, giving us 1 bit for setvli vs. setmvli

do have a quick look at the imm table, see if it's worthwhile.
 
> > ok,ok, this is hilarious: if we allow setvl to be an SV-P48 prefixable
> > instruction, it *might* be possible (stress: might) to get CR0 retargetted
> > at an alternative CR.
> > 
> > one downside of Rc=1 is you can't doecify an alternative CR, end result you
> > have to move it to another CR then the bc can use that alternative target.
> 
> can't bc just use cr0 as-is?

yes, but think about it: intervening ops between the setvl and the branchpoint will likely have trashed cr0 (other Rc=1 ops).  if either the intervening ops can be retargetted or both the setvl and bc are retargetted..

.. of course, the irony is that if the overhead of using two SV-P48 prefixes comes to 32 bit, you might as well just eat the cr move.  unless you use the 16-bit variant.
Comment 16 Jacob Lifshay 2020-12-01 20:07:59 GMT
(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Jacob Lifshay from comment #14)
> > (In reply to Luke Kenneth Casson Leighton from comment #12)
> > > ok,ok, this is hilarious: if we allow setvl to be an SV-P48 prefixable
> > > instruction, it *might* be possible (stress: might) to get CR0 retargetted
> > > at an alternative CR.
> > > 
> > > one downside of Rc=1 is you can't doecify an alternative CR, end result you
> > > have to move it to another CR then the bc can use that alternative target.
> > 
> > can't bc just use cr0 as-is?
> 
> yes, but think about it: intervening ops between the setvl and the
> branchpoint will likely have trashed cr0 (other Rc=1 ops).  if either the
> intervening ops can be retargetted or both the setvl and bc are retargetted..

Umm, wouldn't it be just:

my_fn:
li r3, 1000
setvl. r4, r3, 64
beq cr0, end
loop:
sub r3, r3, r4
...
setvl. r4, r3, 64
bne cr0, loop
end:
blr
Comment 17 Luke Kenneth Casson Leighton 2020-12-01 20:45:01 GMT
(In reply to Jacob Lifshay from comment #16)

> Umm, wouldn't it be just:
> 
> my_fn:
> li r3, 1000
> setvl. r4, r3, 64
> beq cr0, end
> loop:
> sub r3, r3, r4
> ...
> setvl. r4, r3, 64
> bne cr0, loop
> end:
> blr

err... yes! :)

or:

my_fn:
  li r3, 1000
  b test
loop:
  sub r3, r3, r4
  ...
test:
  setvl. r4, r3, 64
  bne cr0, loop
end:
  blr

which saves one instruction and, at the same time (just as you also wrote) avoids running empty vector instructions if VL=0 on the first iteration.

that was always an odd quirk of all RVV examples.

i don't know why, the use of CR0 just feels more natural than using a sub. on r3 although doing so is effectively exactly the same thing.

let's hope that setting VL does not involve huge OoO resets/delays when it comes to implementation.
Comment 18 Luke Kenneth Casson Leighton 2020-12-23 19:33:39 GMT
jacob says VL should be allowed to be set to zero because
then you can have a CR branch-condition test.
Comment 19 Luke Kenneth Casson Leighton 2020-12-23 19:54:26 GMT
(In reply to Luke Kenneth Casson Leighton from comment #18)
> jacob says VL should be allowed to be set to zero because
> then you can have a CR branch-condition test.

right: ok, so this is possible: VL can indeed be set to zero.  if the source
register RA contains zero, VL will be set to zero.  then, when that happens (as long as you used "setvl." i.e. Rc=1 mode) the CR will indeed be set so that its eq-to-zero flag will be raised.

however setting the *immediate* it makes no sense to me for setvli to be able to set VL to zero.
Comment 20 Jacob Lifshay 2020-12-23 20:04:00 GMT
(In reply to Luke Kenneth Casson Leighton from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #18)
> > jacob says VL should be allowed to be set to zero because
> > then you can have a CR branch-condition test.
> 
> right: ok, so this is possible: VL can indeed be set to zero.  if the source
> register RA contains zero, VL will be set to zero.  then, when that happens
> (as long as you used "setvl." i.e. Rc=1 mode) the CR will indeed be set so
> that its eq-to-zero flag will be raised.

sounds good!

> 
> however setting the *immediate* it makes no sense to me for setvli to be
> able to set VL to zero.

yup, the field being zero can instead mean VL=64