we need indexed move to have *defined* behavior, imho out-of-bounds being undefined behavior is unacceptable: https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remap.mdwn;h=f553aaf2e219a8d0851870b30e443b1d517d0d22;hb=HEAD#l194 > Additionally, > no register used as an Index may exceed MAXVL. > > Failure to observe > these conditions results in `UNDEFINED` behaviour. Also, for consistency, we need the check to be against VL, not MAXVL. I suggest all out-of-bound writes are ignored, and all out-of-bound reads give zero. This matches with the dynamic shuffle instruction in wasm ("i8x16.swizzle"), which has the read bytes be dynamically selected, so it only depends on out-of-bound reads being zero: https://webassembly.github.io/spec/core/exec/instructions.html#mathsf-i8x16-xref-syntax-instructions-syntax-instr-vec-mathsf-swizzle Alternative suggestions could be to trap on out of bound writes with zeros on reads (also matches wasm since wasm only uses dynamic read indexes for registers), or to trap on all out-of-bounds accesses (terrible for wasm performance).
(In reply to Jacob Lifshay from comment #0) > we need indexed move to have *defined* behavior, imho out-of-bounds being > undefined behavior is unacceptable: briefly, more tomorrow i lost a reply: yes it is. the cost in hardware is too great. Indexed REMAP goes literally in the Issue Phase. any gates added there can damage performance. i specifically designed it so that the implementor may cache the Indices then completely ignore normal Register Hazards, safe that the programmer knows what to expect and what not to do. wasm swizxle, good for them, not relevant. swizzle is static, this is dynamic. pwrfectly valid for programmer to set very first Index to MAXVL-1. loop goes round, last one has VL<MAXVL, urk, the damn thing breaks expectations, the indices get silently maxed?? no. Traps are the worst thing to do as well, those require Shadow Matrix Entries. a hot-loop triggers Traps? no. this is a high cost area that has to prioritise Hardware expediency over "niceness of the API". sometimes it is just too costly to do "nice" it is very late here: a useful exercise which you could do would be to thimk through how to mitigate VL being set to a value less than the programmer expects the Indices to contain, perhaps by creating a predicate mask from sv.cmp on VL against the indices. such thought experiments would form the basis of application notes as well as ISA WG justification
(In reply to Luke Kenneth Casson Leighton from comment #1) > (In reply to Jacob Lifshay from comment #0) > > we need indexed move to have *defined* behavior, imho out-of-bounds being > > undefined behavior is unacceptable: > > briefly, more tomorrow i lost a reply: yes it is. > the cost in hardware is too great. > > Indexed REMAP goes literally in the Issue Phase. > any gates added there can damage performance. you don't have to do the comparisons in instruction issue... for reads, they can be done in the alu or register read or operand forwarding stages, all you need is to and the value with 0/-1 as appropriate for the index<VL comparison. for writes, they can be merged into the predicate mask at any stage before forwarding outputs to succeeding operations or writing to output registers. also, even if you did do the comparisons in instruction issue, the comparisons can be done simultaneously to instruction decode/issue, the results aren't needed to decode the instruction. the comparison results can be ignored if it's not an indexed op, so the comparison logic can be built to assume it is an indexed op and the invalid results will be ignored when the assumption is wrong. > i specifically designed it so that the implementor > may cache the Indices then completely ignore normal > Register Hazards, safe that the programmer knows > what to expect and what not to do. > > wasm swizxle, good for them, not relevant. swizzle > is static, this is dynamic. you missed what I stated, wasm's i8x16.swizzle *is* a *dynamic* shuffle and requires out-of-bounds accesses to return zero. i8x16.swizzle is how they spell dynamic 8-bit element-sized shuffle. they're very likely to not be the only place dynamic shuffles are required to have defined behavior... > pwrfectly valid for programmer to set very first > Index to MAXVL-1. loop goes round, last one has > VL<MAXVL, urk, the damn thing breaks expectations, > the indices get silently maxed?? > > no. no, the out-of-bound index change the register read to explicitly return zero. > > Traps are the worst thing to do as well, those require > Shadow Matrix Entries. a hot-loop triggers Traps? yes, hence why I left that as an alternative rather than my preferred choice.
the (In reply to Jacob Lifshay from comment #2) > (In reply to Luke Kenneth Casson Leighton from comment #1) > you don't have to do the comparisons in instruction issue... then that forces a Hazard Dependency which is exactly what is to be avoided. > for reads, they can be done in the alu or register read or operand > forwarding stages, no chance. i already said that that goes into a Shadow Cancellation Matrix > all you need is to and the value with 0/-1 as appropriate > for the index<VL comparison. "all"... going into an unnecesaary Shadoe Matrix. > for writes, they can be merged into the predicate mask at any stage before > forwarding outputs to succeeding operations or writing to output registers. predicate masks require Shadow Matrices. you are asking for ywt another Shadow Matrix to be added. the answer is, has been said once already, and is now onto the second repetition: no. > also, even if you did do the comparisons in instruction issue, the > comparisons can be done simultaneously to instruction decode/issue, second reptition: too coatly. i said no to this once already. tbe answer doea not change: no > the > results aren't needed to decode the instruction. the comparison results can > be ignored if it's not an indexed op, so the comparison logic can be built > to assume it is an indexed op and the invalid results will be ignored when > the assumption is wrong. that is the definition of requiring a Shadow Matrix. > > i specifically designed it so that the implementor > > may cache the Indices then completely ignore normal > > Register Hazards, safe that the programmer knows > > what to expect and what not to do. > > > > wasm swizxle, good for them, not relevant. swizzle > > is static, this is dynamic. > > you missed what I stated, wasm's i8x16.swizzle *is* a *dynamic* shuffle and > requires out-of-bounds accesses to return zero. that's great for them. it can be done with a sv.cmp creating a predicate mask. > no, the out-of-bound index change the register read to explicitly return > zero. not happening. > > > > Traps are the worst thing to do as well, those require > > Shadow Matrix Entries. a hot-loop triggers Traps? > > yes, hence why I left that as an alternative rather than my preferred choice. issue is where the hardware for loop lives. it interacts directly with the Register Hazard Allocation, even as far back as Decode. the only reason to be able to get away with the hardware forloops prior to Indexing bwing added was because they are 100% deterministic and rely on SPRs not regs (change MAXVL etc and it is normal to have some latency). IBM's hardware engineers will grumble about that but it is reasonably explainable and reasonable to have 8-way multi-issue Deterministic computation of QTY 8 sets of Register Hazards at once (even with Matrix REMAP). throw in Indexed REMAP where it becomes necessary to look up EIGHT simultaneous Indices to create 8 simultaneous Register Hazards and they are going to get very concerned until the rules "No Hazards" on the Indices are explained, at which point i expect them to calm doen to then try adding comparisons into that either in Decode or Issue they are going to blow a fuse and with good reason. the answer is no, and will remain no. a useful task that you could do is create a small substitution table cipher (a toy one) e.g. a substitution table of 16 entries and show how sv.cmp would work. if you are ok to do that or come uo with other algorithms that fit Indexed REMAP *as designed* i will keep the bugreport open, otherwise i will close it as invalid.
(In reply to Luke Kenneth Casson Leighton from comment #3) > the (In reply to Jacob Lifshay from comment #2) > > (In reply to Luke Kenneth Casson Leighton from comment #1) > > > you don't have to do the comparisons in instruction issue... > > then that forces a Hazard Dependency which is exactly what is to be > avoided. > > > for reads, they can be done in the alu or register read or operand > > forwarding stages, > > no chance. > > i already said that that goes into a Shadow Cancellation > Matrix Trying to use shadow cancellation at all for predicting the shuffle or predicate for dynamic shuffles is imho a bad idea, they're dynamic specifically because they're likely change all the time. The huge number of pipeline flushes caused by mispredictions would kill performance (think 10x as slow). The solution is to instead have the output be always produced, and if that predicate bit is zero, mux in the old output element. This is how predication normally works on GPUs or CPUs (with the optimization of using the register write-enable to do the muxing for some microarchitectures). Likewise for inputs, the dynamic shuffle waits till all input elements are available, and only then, reads them based on the indexes. if there's an out-of-bound input, all you need is to replace whatever junk you read with a zero, you don't need to respect dependencies for that since it'll just be zeroed.
(In reply to Jacob Lifshay from comment #4) > Trying to use shadow cancellation at all for predicting the shuffle or > predicate for dynamic shuffles is imho a bad idea, they're dynamic > specifically because they're likely change all the time. The huge number of > pipeline flushes caused by mispredictions would kill performance (think 10x > as slow). dynamic shuffle isn't expected to be super fast (unlike static shuffles), it *is* expected to be faster than a pipeline flush due to predicate/shuffle misprediction every other element!
(In reply to Jacob Lifshay from comment #4) > Likewise for inputs, the dynamic shuffle waits till all input elements are > available, and only then, reads them based on the indexes. if there's an > out-of-bound input, all you need is to replace whatever junk you read with a > zero, you don't need to respect dependencies for that since it'll just be > zeroed. i said no. repeating the same thing does not change the answer "no", which has now been repeated three times. that's the limit of the number of times i'm prepared to spend saying "no". as you did not respond to what i said would be a useful task, which is to write some assembler under the existing conditions, i'm assuming that you do not wish to help out here. that's enough to warrant closing this bugreport as invalid.
vsx has precedent for zeroing elements with out-of-bounds indexes: the xxpermx instruction basically does: v <- VA || VB for i in 0 to 15 if VC.byte[i].bit[0:2] != UIM then VT.byte[i] <- 0 else VT.byte[i] <- v.byte[VC.byte[i].bit[3:7]]
(In reply to Jacob Lifshay from comment #7) > vsx has precedent for zeroing elements with out-of-bounds indexes: the > xxpermx instruction basically does: we are not copying the behaviour of VSX which has a fixed power2 size and hardcoded register sizes.