offsets by one are only possible with EXTRA3 (or scalar registers). some thought is needed on how to turn several LDST instructions from EXTRA2 to EXTRA3. see SV CSVs https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=openpower/isatables;hb=HEAD stwu,LDST_IMM,,2P,EXTRA2,EN,d:RA,s:RS,s:RA,0,RA_OR_ZERO,0,RS,0,0,0,RA this is easy, make RA-src same as RA-dest lwzu,LDST_IMM,,2P,EXTRA2,EN,d:RT,d:RA,s:RA,0,RA_OR_ZERO,0,0,RT,0,0,RA likewise. lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA these become harder as the encoding space is only 6 bits (and there are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits of EXTRA Field Name Field bits Description Rdest_EXTRA2 10:11 extends Rdest (R*_EXTRA2 Encoding) Rsrc1_EXTRA2 12:13 extends Rsrc1 (R*_EXTRA2 Encoding) Rsrc2_EXTRA2 14:15 extends Rsrc2 (R*_EXTRA2 Encoding) MASK_SRC 16:18 Execution Mask for Source Field Name Field bits Description Rdest_EXTRA2 10:11 extends Rdest (R*_EXTRA2 Encoding) Rsrc1_EXTRA2 12:13 extends Rsrc1 (R*_EXTRA2 Encoding) Rdest2_EXTRA2 14:15 extends Rdest2 (R*_EXTRA2 Encoding) MASK_SRC 16:18 Execution Mask for Source some analysis of options and consequences is needed.
(In reply to Luke Kenneth Casson Leighton from comment #0) > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > these become harder as the encoding space is only 6 bits (and there > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > of EXTRA this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER > MASK_SRC 16:18 Execution Mask for Source so has to stay. that leaves just 6 bits to cover 3 registers. here's the bits of RM: Field Name Field bits Description MASKMODE 0 Execution (predication) Mask Kind MASK 1:3 Execution Mask SUBVL 8:9 Sub-vector length ELWIDTH 4:5 Element Width ELWIDTH_SRC 6:7 Element Width for Source EXTRA 10:18 Register Extra encoding MODE 19:23 changes Vector behaviour can't lose mask. can't lose SUBVL (priority for Pack/Unpack, already discussed bug #1077). *could* consider ELWIDTH_SRC, what effect does that have? * Vector of RB offsets could no longer be compressed * SEA becomes pointless could ELWIDTH instead be considered, and the operation width (ld lw lh lb) be used in its place? * yes as long as losing saturation and sign-extending is ok. (setting a larger ELWIDTH than the operation is a way to do zero or sign extending without needing intermediary registers to perform the extsb/h/w. losing ELWIDTH would require the extra instruction and registers).
(In reply to Luke Kenneth Casson Leighton from comment #1) > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > > > these become harder as the encoding space is only 6 bits (and there > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > > of EXTRA > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER please define VINDEX -- it is non-standard terminology -- do you mean load/store with index remap? that's basically gather/scatter but done using a different mechanism. actually, assuming the above definition of VINDEX, none of splat/gather/scatter (also includes VINDEX since that's basically gather/scatter) need more than one predicate. They work just fine on other ISAs with at most one predicate (e.g. RVV and AVX2/AVX512 all have separate splat/scatter/gather/compress/expand instructions that only have 1 predicate). The only load/store ops that need more than one predicate are compress/expand load/store (since they are only expressible by twin-predication in SVP64 since there are no dedicated compress/expand instructions or SVP64 MODEs), which can easily be done using ld/std (and maybe the *u or *x versions, but not both) instead of ldux/stdux. iirc the plan was originally to have twin-predication only on 1-in/1-out operations, which ldux/stdux clearly are not. > > > MASK_SRC 16:18 Execution Mask for Source > > so has to stay. that leaves just 6 bits to cover 3 registers. > > here's the bits of RM: > > Field Name Field bits Description > MASKMODE 0 Execution (predication) Mask Kind > MASK 1:3 Execution Mask > SUBVL 8:9 Sub-vector length > ELWIDTH 4:5 Element Width > ELWIDTH_SRC 6:7 Element Width for Source > EXTRA 10:18 Register Extra encoding > MODE 19:23 changes Vector behaviour > > can't lose mask. can't lose SUBVL (priority for Pack/Unpack, already > discussed bug #1077). *could* consider ELWIDTH_SRC, what effect does > that have? > > * Vector of RB offsets could no longer be compressed > * SEA becomes pointless > > could ELWIDTH instead be considered, and the operation width > (ld lw lh lb) be used in its place? > > * yes as long as losing saturation and sign-extending is ok. simple -- just set ELWIDTH larger than the load op and the load op intrinsically will do the sign/zero extend, no need for SVP64 to add sign/zero extension on top of that. (with the sole exception of signed bytes, thanks PowerISA for being non-orthogonal) saturation can still be done -- saturating from the load's type to the dest type (ELWIDTH + saturation's unsigned/signed bit). so this removes any need for ELWIDTH_SRC on any load/store ops afaict.
(In reply to Jacob Lifshay from comment #2) > (In reply to Luke Kenneth Casson Leighton from comment #1) > > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > > > > > these become harder as the encoding space is only 6 bits (and there > > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > > > of EXTRA > > > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER > > please define VINDEX sm=1<<r3. or just sm=r3 where one bit is set. there is probably another name for it. >-- it is non-standard terminology -- do you mean > load/store with index remap? no. i would have said Indexed REMAP. > predicate). The only load/store ops that need more than one predicate are > compress/expand load/store i.e. all of them (as far as the actual scalar ld/sts are concerned) > iirc the plan was originally to have twin-predication only on 1-in/1-out > operations, which ldux/stdux clearly are not. the address (EA) is considered to be "1" in this case. > > could ELWIDTH instead be considered, and the operation width > > (ld lw lh lb) be used in its place? > > > > * yes as long as losing saturation and sign-extending is ok. > > simple -- just set ELWIDTH larger than the load op and the load op > intrinsically will do the sign/zero extend, no need for SVP64 to add > sign/zero extension on top of that. (with the sole exception of signed > bytes, thanks PowerISA for being non-orthogonal) deep joy. and it isn't _particularly_ useful to do shorter (load then truncate, that's just dumb). > saturation can still be done -- saturating from the load's type to the dest > type (ELWIDTH + saturation's unsigned/signed bit). > > so this removes any need for ELWIDTH_SRC on any load/store ops afaict. okaay. now we are cooking with gas. next stage, given two free bits, is to work out what regs can be expanded from EXTRA2 to EXTRA3. * lwzux RT,RA,RB if vectorised and used for list-pointer-chaining, it is RT and RA that must be allowed to be one-different. RB, because it is not updated, need not be EXTRA3. * stdux RS,RA,RB likewise. aieee this is going to be fun.
(In reply to Luke Kenneth Casson Leighton from comment #3) > (In reply to Jacob Lifshay from comment #2) > > (In reply to Luke Kenneth Casson Leighton from comment #1) > > > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > > > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > > > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > > > > > > > these become harder as the encoding space is only 6 bits (and there > > > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > > > > of EXTRA > > > > > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER > > > > please define VINDEX > > sm=1<<r3. or just sm=r3 where one bit is set. there is probably another > name for it. the standard name is extractelement or extract https://llvm.org/docs/LangRef.html#extractelement-instruction imho it may be more efficient to simply add r3 to the load address and perform a scalar load (optionally SVP64 prefixed) rather than setting sm=1<<r3, since that's much simpler and simple hardware then won't issue VL load ops for only one of them to succeed. extractelement is only really useful when extracting from a vector already in registers, since you can't always just add to the address for that.
(In reply to Jacob Lifshay from comment #4) > (In reply to Luke Kenneth Casson Leighton from comment #3) > > (In reply to Jacob Lifshay from comment #2) > > > (In reply to Luke Kenneth Casson Leighton from comment #1) > > > > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > > > > > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > > > > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > > > > > > > > > these become harder as the encoding space is only 6 bits (and there > > > > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > > > > > of EXTRA > > > > > > > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER > > > > > > please define VINDEX > > > > sm=1<<r3. or just sm=r3 where one bit is set. there is probably another > > name for it. > > the standard name is extractelement or extract > https://llvm.org/docs/LangRef.html#extractelement-instruction > > imho it may be more efficient to simply add r3 to the load address and > perform a scalar load (optionally SVP64 prefixed) rather than setting > sm=1<<r3, since that's much simpler and simple hardware then won't issue VL > load ops for only one of them to succeed. > > extractelement is only really useful when extracting from a vector already > in registers, since you can't always just add to the address for that. oh, and insert or insertelement for the other way: https://llvm.org/docs/LangRef.html#insertelement-instruction
(In reply to Jacob Lifshay from comment #4) > the standard name is extractelement or extract > https://llvm.org/docs/LangRef.html#extractelement-instruction > > imho (please do drop that, it's an affectation that gets tiring. we don't need to know that your opinion is "humble" - here we just need to know what your [valued] insights are, as third-person-objective constructive input. also please trim) > it may be more efficient to simply add r3 to the load address and > perform a scalar load (optionally SVP64 prefixed) rather than setting > sm=1<<r3, since that's much simpler and simple hardware then won't issue VL > load ops for only one of them to succeed. ta-daaa, now you're getting it. and that's an optimisation that would be performed by hardware that chose to implement micro-coding (which does *not* mean "like intel does it", it just means "some form of rewriting" rather than "straight naive 1:1". microwatt does micro-coding into OP_ADD)
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #4) > > it may be more efficient to simply add r3 to the load address and > > perform a scalar load (optionally SVP64 prefixed) rather than setting > > sm=1<<r3, since that's much simpler and simple hardware then won't issue VL > > load ops for only one of them to succeed. > > ta-daaa, now you're getting it. and that's an optimisation that would > be performed by hardware that chose to implement micro-coding (which does > *not* mean "like intel does it", it just means "some form of rewriting" > rather than "straight naive 1:1". microwatt does micro-coding into OP_ADD) umm, you seem to have missed my point which is that programmers should write a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4, 0(r5) since simple cpus won't perform that optimization since that's more complex to do.
(In reply to Jacob Lifshay from comment #7) > umm, you seem to have missed my point which is that programmers should write > a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4, > 0(r5) since simple cpus won't perform that optimization since that's more > complex to do. whoops, that should have been sv.ld/sm=1<<r3 r4, 0(*r5)
(In reply to Jacob Lifshay from comment #7) > umm, you seem to have missed my point which is that programmers should write > a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4, > 0(*r5) since simple cpus won't perform that optimization since that's more > complex to do. blech, costs an extra register (RB=r3) but it is the same thing... or is it? ermermerm... oh! it isn't! not quite - it's a multiply/shift on r3. and needs a vector source. no you can use /els then the immediate becomes a multiplier... let me check, i can never remember if RA.isvec: svctx.ldstmode = indexed elif els == 0: svctx.ldstmode = unitstride elif immediate != 0: svctx.ldstmode = elementstride and then: elif svctx.ldstmode == elementstride: # element stride mode srcbase = ireg[RA] offs = i * immed # j*immed for a ST and... oh hang on if you really want r3 as an index, you can do element-strided on RB: if els and !RA.isvec and !RB.isvec: svctx.ldstmode = elementstride if svctx.ldstmode == elementstride: EA = ireg[RA] + ireg[RB]*j # register-strided so the syntax for that is: sv/ldx/els *RT, RA, RB # yes, just scalar on RB.