Bug 1080 - allowing LD/ST-Update to select individual regsters needed
Summary: allowing LD/ST-Update to select individual regsters needed
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 1047 1079 1056
  Show dependency treegraph
 
Reported: 2023-05-08 18:02 BST by Luke Kenneth Casson Leighton
Modified: 2023-09-03 10:54 BST (History)
3 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-05-08 18:02:32 BST
offsets by one are only possible with EXTRA3 (or scalar registers).
some thought is needed on how to turn several LDST instructions
from EXTRA2 to EXTRA3.

see SV CSVs https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=openpower/isatables;hb=HEAD

stwu,LDST_IMM,,2P,EXTRA2,EN,d:RA,s:RS,s:RA,0,RA_OR_ZERO,0,RS,0,0,0,RA

this is easy, make RA-src same as RA-dest

lwzu,LDST_IMM,,2P,EXTRA2,EN,d:RT,d:RA,s:RA,0,RA_OR_ZERO,0,0,RT,0,0,RA

likewise.

lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA
stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA

these become harder as the encoding space is only 6 bits (and there
are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits
of EXTRA

Field Name	Field bits	Description
Rdest_EXTRA2	10:11	extends Rdest (R*_EXTRA2 Encoding)
Rsrc1_EXTRA2	12:13	extends Rsrc1 (R*_EXTRA2 Encoding)
Rsrc2_EXTRA2	14:15	extends Rsrc2 (R*_EXTRA2 Encoding)
MASK_SRC	16:18	Execution Mask for Source

Field Name	Field bits	Description
Rdest_EXTRA2	10:11	extends Rdest (R*_EXTRA2 Encoding)
Rsrc1_EXTRA2	12:13	extends Rsrc1 (R*_EXTRA2 Encoding)
Rdest2_EXTRA2	14:15	extends Rdest2 (R*_EXTRA2 Encoding)
MASK_SRC	16:18	Execution Mask for Source

some analysis of options and consequences is needed.
Comment 1 Luke Kenneth Casson Leighton 2023-05-08 20:01:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #0)

> lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA
> stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA
> 
> these become harder as the encoding space is only 6 bits (and there
> are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits
> of EXTRA

this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER

> MASK_SRC	16:18	Execution Mask for Source

so has to stay. that leaves just 6 bits to cover 3 registers.

here's the bits of RM:

Field Name	Field bits	Description
MASKMODE	0	Execution (predication) Mask Kind
MASK	1:3	Execution Mask
SUBVL	8:9	Sub-vector length
ELWIDTH	4:5	Element Width
ELWIDTH_SRC	6:7	Element Width for Source
EXTRA	10:18	Register Extra encoding
MODE	19:23	changes Vector behaviour

can't lose mask. can't lose SUBVL (priority for Pack/Unpack, already
discussed bug #1077). *could* consider ELWIDTH_SRC, what effect does
that have?

* Vector of RB offsets could no longer be compressed
* SEA becomes pointless

could ELWIDTH instead be considered, and the operation width
(ld lw lh lb) be used in its place?

* yes as long as losing saturation and sign-extending is ok.
  (setting a larger ELWIDTH than the operation is a way to
   do zero or sign extending without needing intermediary
   registers to perform the extsb/h/w. losing
   ELWIDTH would require the extra instruction
   and registers).
Comment 2 Jacob Lifshay 2023-05-08 21:51:05 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> (In reply to Luke Kenneth Casson Leighton from comment #0)
> 
> > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA
> > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA
> > 
> > these become harder as the encoding space is only 6 bits (and there
> > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits
> > of EXTRA
> 
> this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER

please define VINDEX -- it is non-standard terminology -- do you mean load/store with index remap? that's basically gather/scatter but done using a different mechanism.

actually, assuming the above definition of VINDEX, none of splat/gather/scatter (also includes VINDEX since that's basically gather/scatter) need more than one predicate. They work just fine on other ISAs with at most one predicate (e.g. RVV and AVX2/AVX512 all have separate splat/scatter/gather/compress/expand instructions that only have 1 predicate). The only load/store ops that need more than one predicate are compress/expand load/store (since they are only expressible by twin-predication in SVP64 since there are no dedicated compress/expand instructions or SVP64 MODEs), which can easily be done using ld/std (and maybe the *u or *x versions, but not both) instead of ldux/stdux. 

iirc the plan was originally to have twin-predication only on 1-in/1-out operations, which ldux/stdux clearly are not.

> 
> > MASK_SRC	16:18	Execution Mask for Source
> 
> so has to stay. that leaves just 6 bits to cover 3 registers.
> 
> here's the bits of RM:
> 
> Field Name	Field bits	Description
> MASKMODE	0	Execution (predication) Mask Kind
> MASK	1:3	Execution Mask
> SUBVL	8:9	Sub-vector length
> ELWIDTH	4:5	Element Width
> ELWIDTH_SRC	6:7	Element Width for Source
> EXTRA	10:18	Register Extra encoding
> MODE	19:23	changes Vector behaviour
> 
> can't lose mask. can't lose SUBVL (priority for Pack/Unpack, already
> discussed bug #1077). *could* consider ELWIDTH_SRC, what effect does
> that have?
> 
> * Vector of RB offsets could no longer be compressed
> * SEA becomes pointless
> 
> could ELWIDTH instead be considered, and the operation width
> (ld lw lh lb) be used in its place?
> 
> * yes as long as losing saturation and sign-extending is ok.

simple -- just set ELWIDTH larger than the load op and the load op intrinsically will do the sign/zero extend, no need for SVP64 to add sign/zero extension on top of that. (with the sole exception of signed bytes, thanks PowerISA for being non-orthogonal)

saturation can still be done -- saturating from the load's type to the dest type (ELWIDTH + saturation's unsigned/signed bit).

so this removes any need for ELWIDTH_SRC on any load/store ops afaict.
Comment 3 Luke Kenneth Casson Leighton 2023-05-08 23:22:43 BST
(In reply to Jacob Lifshay from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #1)
> > (In reply to Luke Kenneth Casson Leighton from comment #0)
> > 
> > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA
> > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA
> > > 
> > > these become harder as the encoding space is only 6 bits (and there
> > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits
> > > of EXTRA
> > 
> > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER
> 
> please define VINDEX 

sm=1<<r3. or just sm=r3 where one bit is set. there is probably another
name for it.


>-- it is non-standard terminology -- do you mean
> load/store with index remap?

no. i would have said Indexed REMAP.

> predicate). The only load/store ops that need more than one predicate are
> compress/expand load/store

i.e. all of them (as far as the actual scalar ld/sts are concerned)

> iirc the plan was originally to have twin-predication only on 1-in/1-out
> operations, which ldux/stdux clearly are not.

the address (EA) is considered to be "1" in this case.
 
> > could ELWIDTH instead be considered, and the operation width
> > (ld lw lh lb) be used in its place?
> > 
> > * yes as long as losing saturation and sign-extending is ok.
> 
> simple -- just set ELWIDTH larger than the load op and the load op
> intrinsically will do the sign/zero extend, no need for SVP64 to add
> sign/zero extension on top of that. (with the sole exception of signed
> bytes, thanks PowerISA for being non-orthogonal)

deep joy.

and it isn't _particularly_ useful to do shorter (load then truncate,
that's just dumb).

> saturation can still be done -- saturating from the load's type to the dest
> type (ELWIDTH + saturation's unsigned/signed bit).
> 
> so this removes any need for ELWIDTH_SRC on any load/store ops afaict.

okaay. now we are cooking with gas.

next stage, given two free bits, is to work out what regs can be
expanded from EXTRA2 to EXTRA3.

* lwzux RT,RA,RB

if vectorised and used for list-pointer-chaining, it is RT and RA that
must be allowed to be one-different.  RB, because it is not updated,
need not be EXTRA3.

* stdux RS,RA,RB

likewise.

aieee this is going to be fun.
Comment 4 Jacob Lifshay 2023-05-08 23:49:39 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > (In reply to Luke Kenneth Casson Leighton from comment #1)
> > > (In reply to Luke Kenneth Casson Leighton from comment #0)
> > > 
> > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA
> > > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA
> > > > 
> > > > these become harder as the encoding space is only 6 bits (and there
> > > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits
> > > > of EXTRA
> > > 
> > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER
> > 
> > please define VINDEX 
> 
> sm=1<<r3. or just sm=r3 where one bit is set. there is probably another
> name for it.

the standard name is extractelement or extract https://llvm.org/docs/LangRef.html#extractelement-instruction

imho it may be more efficient to simply add r3 to the load address and perform a scalar load (optionally SVP64 prefixed) rather than setting sm=1<<r3, since that's much simpler and simple hardware then won't issue VL load ops for only one of them to succeed.

extractelement is only really useful when extracting from a vector already in registers, since you can't always just add to the address for that.
Comment 5 Jacob Lifshay 2023-05-08 23:50:34 BST
(In reply to Jacob Lifshay from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > (In reply to Jacob Lifshay from comment #2)
> > > (In reply to Luke Kenneth Casson Leighton from comment #1)
> > > > (In reply to Luke Kenneth Casson Leighton from comment #0)
> > > > 
> > > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA
> > > > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA
> > > > > 
> > > > > these become harder as the encoding space is only 6 bits (and there
> > > > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits
> > > > > of EXTRA
> > > > 
> > > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER
> > > 
> > > please define VINDEX 
> > 
> > sm=1<<r3. or just sm=r3 where one bit is set. there is probably another
> > name for it.
> 
> the standard name is extractelement or extract
> https://llvm.org/docs/LangRef.html#extractelement-instruction
> 
> imho it may be more efficient to simply add r3 to the load address and
> perform a scalar load (optionally SVP64 prefixed) rather than setting
> sm=1<<r3, since that's much simpler and simple hardware then won't issue VL
> load ops for only one of them to succeed.
> 
> extractelement is only really useful when extracting from a vector already
> in registers, since you can't always just add to the address for that.

oh, and insert or insertelement for the other way: https://llvm.org/docs/LangRef.html#insertelement-instruction
Comment 6 Luke Kenneth Casson Leighton 2023-05-14 17:05:54 BST
(In reply to Jacob Lifshay from comment #4)

> the standard name is extractelement or extract
> https://llvm.org/docs/LangRef.html#extractelement-instruction
> 
> imho

(please do drop that, it's an affectation that gets tiring. we don't need
to know that your opinion is "humble" - here we just need to know what your
[valued] insights are, as third-person-objective constructive input.
also please trim)

> it may be more efficient to simply add r3 to the load address and
> perform a scalar load (optionally SVP64 prefixed) rather than setting
> sm=1<<r3, since that's much simpler and simple hardware then won't issue VL
> load ops for only one of them to succeed.

ta-daaa, now you're getting it.  and that's an optimisation that would
be performed by hardware that chose to implement micro-coding (which does
*not* mean "like intel does it", it just means "some form of rewriting"
rather than "straight naive 1:1". microwatt does micro-coding into OP_ADD)
Comment 7 Jacob Lifshay 2023-05-14 17:26:09 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #4)
> > it may be more efficient to simply add r3 to the load address and
> > perform a scalar load (optionally SVP64 prefixed) rather than setting
> > sm=1<<r3, since that's much simpler and simple hardware then won't issue VL
> > load ops for only one of them to succeed.
> 
> ta-daaa, now you're getting it.  and that's an optimisation that would
> be performed by hardware that chose to implement micro-coding (which does
> *not* mean "like intel does it", it just means "some form of rewriting"
> rather than "straight naive 1:1". microwatt does micro-coding into OP_ADD)

umm, you seem to have missed my point which is that programmers should write a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4, 0(r5) since simple cpus won't perform that optimization since that's more complex to do.
Comment 8 Jacob Lifshay 2023-05-14 17:27:59 BST
(In reply to Jacob Lifshay from comment #7)
> umm, you seem to have missed my point which is that programmers should write
> a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4,
> 0(r5) since simple cpus won't perform that optimization since that's more
> complex to do.

whoops, that should have been sv.ld/sm=1<<r3 r4, 0(*r5)
Comment 9 Luke Kenneth Casson Leighton 2023-05-27 13:39:26 BST
(In reply to Jacob Lifshay from comment #7)

> umm, you seem to have missed my point which is that programmers should write
> a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4,
> 0(*r5) since simple cpus won't perform that optimization since that's more
> complex to do.

blech, costs an extra register (RB=r3) but it is the same thing... or is it?
ermermerm... oh! it isn't! not quite - it's a multiply/shift on r3.  and
needs a vector source.

no you can use /els then the immediate becomes a multiplier... let me
check, i can never remember

    if RA.isvec:
        svctx.ldstmode = indexed
    elif els == 0:
        svctx.ldstmode = unitstride
    elif immediate != 0:
        svctx.ldstmode = elementstride

and then:

        elif svctx.ldstmode == elementstride:
          # element stride mode
          srcbase = ireg[RA]
          offs = i * immed              # j*immed for a ST

and... oh hang on if you really want r3 as an index, you can do
element-strided on RB:

        if els and !RA.isvec and !RB.isvec:
            svctx.ldstmode = elementstride

        if svctx.ldstmode == elementstride:
            EA = ireg[RA] + ireg[RB]*j   # register-strided


so the syntax for that is:

   sv/ldx/els  *RT, RA, RB  # yes, just scalar on RB.