Bug 908 - indexed remap needs defined behavior for out-of-bounds indexes
Summary: indexed remap needs defined behavior for out-of-bounds indexes
Status: RESOLVED INVALID
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- major
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks:
 
Reported: 2022-08-14 23:17 BST by Jacob Lifshay
Modified: 2022-09-29 20:50 BST (History)
1 user (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2022-08-14 23:17:19 BST
we need indexed move to have *defined* behavior, imho out-of-bounds being undefined behavior is unacceptable:
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remap.mdwn;h=f553aaf2e219a8d0851870b30e443b1d517d0d22;hb=HEAD#l194
> Additionally,
> no register used as an Index may exceed MAXVL.
> 
> Failure to observe
> these conditions results in `UNDEFINED` behaviour.

Also, for consistency, we need the check to be against VL, not MAXVL.

I suggest all out-of-bound writes are ignored, and all out-of-bound reads give zero.
This matches with the dynamic shuffle instruction in wasm ("i8x16.swizzle"), which has the read bytes be dynamically selected, so it only depends on out-of-bound reads being zero:
https://webassembly.github.io/spec/core/exec/instructions.html#mathsf-i8x16-xref-syntax-instructions-syntax-instr-vec-mathsf-swizzle

Alternative suggestions could be to trap on out of bound writes with zeros on reads (also matches wasm since wasm only uses dynamic read indexes for registers), or to trap on all out-of-bounds accesses (terrible for wasm performance).
Comment 1 Luke Kenneth Casson Leighton 2022-08-15 00:15:25 BST
(In reply to Jacob Lifshay from comment #0)
> we need indexed move to have *defined* behavior, imho out-of-bounds being
> undefined behavior is unacceptable:

briefly, more tomorrow i lost a reply: yes it is.
the cost in hardware is too great.

Indexed REMAP goes literally in the Issue Phase.
any gates added there can damage performance.
i specifically designed it so that the implementor
may cache the Indices then completely ignore normal
Register Hazards, safe that the programmer knows
what to expect and what not to do.

wasm swizxle, good for them, not relevant. swizzle
is static, this is dynamic.

pwrfectly valid for programmer to set very first
Index to MAXVL-1.  loop goes round, last one has
VL<MAXVL, urk, the damn thing breaks expectations,
the indices get silently maxed??

no.

Traps are the worst thing to do as well, those require
Shadow Matrix Entries.  a hot-loop triggers Traps?

no.

this is a high cost area that has to prioritise
Hardware expediency over "niceness of the API".
sometimes it is just too costly to do "nice"

it is very late here: a useful exercise which you
could do would be to thimk through how to mitigate
VL being set to a value less than the programmer
expects the Indices to contain, perhaps by creating
a predicate mask from sv.cmp on VL against the
indices.

such thought experiments would form the basis of
application notes as well as ISA WG justification
Comment 2 Jacob Lifshay 2022-08-15 00:50:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> (In reply to Jacob Lifshay from comment #0)
> > we need indexed move to have *defined* behavior, imho out-of-bounds being
> > undefined behavior is unacceptable:
> 
> briefly, more tomorrow i lost a reply: yes it is.
> the cost in hardware is too great.
> 
> Indexed REMAP goes literally in the Issue Phase.
> any gates added there can damage performance.

you don't have to do the comparisons in instruction issue...
for reads, they can be done in the alu or register read or operand forwarding stages, all you need is to and the value with 0/-1 as appropriate for the index<VL comparison.
for writes, they can be merged into the predicate mask at any stage before forwarding outputs to succeeding operations or writing to output registers.

also, even if you did do the comparisons in instruction issue, the comparisons can be done simultaneously to instruction decode/issue, the results aren't needed to decode the instruction. the comparison results can be ignored if it's not an indexed op, so the comparison logic can be built to assume it is an indexed op and the invalid results will be ignored when the assumption is wrong.

> i specifically designed it so that the implementor
> may cache the Indices then completely ignore normal
> Register Hazards, safe that the programmer knows
> what to expect and what not to do.
> 
> wasm swizxle, good for them, not relevant. swizzle
> is static, this is dynamic.

you missed what I stated, wasm's i8x16.swizzle *is* a *dynamic* shuffle and requires out-of-bounds accesses to return zero. i8x16.swizzle is how they spell dynamic 8-bit element-sized shuffle.

they're very likely to not be the only place dynamic shuffles are required to have defined behavior...

> pwrfectly valid for programmer to set very first
> Index to MAXVL-1.  loop goes round, last one has
> VL<MAXVL, urk, the damn thing breaks expectations,
> the indices get silently maxed??
> 
> no.

no, the out-of-bound index change the register read to explicitly return zero.

> 
> Traps are the worst thing to do as well, those require
> Shadow Matrix Entries.  a hot-loop triggers Traps?

yes, hence why I left that as an alternative rather than my preferred choice.
Comment 3 Luke Kenneth Casson Leighton 2022-08-15 02:59:09 BST
the (In reply to Jacob Lifshay from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #1)

> you don't have to do the comparisons in instruction issue...

then that forces a Hazard Dependency which is exactly what is to be
avoided.

> for reads, they can be done in the alu or register read or operand
> forwarding stages, 

no chance.

i already said that that goes into a Shadow Cancellation
Matrix

> all you need is to and the value with 0/-1 as appropriate
> for the index<VL comparison.

"all"... going into an unnecesaary Shadoe Matrix.

> for writes, they can be merged into the predicate mask at any stage before
> forwarding outputs to succeeding operations or writing to output registers.

predicate masks require Shadow Matrices.

you are asking for ywt another Shadow Matrix to be added.

the answer is, has been said once already, and is now onto the
second repetition:

no.


 
> also, even if you did do the comparisons in instruction issue, the
> comparisons can be done simultaneously to instruction decode/issue, 

second reptition: too coatly.

i said no to this once already.

tbe answer doea not change: no


> the
> results aren't needed to decode the instruction. the comparison results can
> be ignored if it's not an indexed op, so the comparison logic can be built
> to assume it is an indexed op and the invalid results will be ignored when
> the assumption is wrong.

that is the definition of requiring a Shadow Matrix.
 
> > i specifically designed it so that the implementor
> > may cache the Indices then completely ignore normal
> > Register Hazards, safe that the programmer knows
> > what to expect and what not to do.
> > 
> > wasm swizxle, good for them, not relevant. swizzle
> > is static, this is dynamic.
> 
> you missed what I stated, wasm's i8x16.swizzle *is* a *dynamic* shuffle and
> requires out-of-bounds accesses to return zero.

that's great for them.  it can be done with a sv.cmp creating
a predicate mask.


> no, the out-of-bound index change the register read to explicitly return
> zero.

not happening.

> > 
> > Traps are the worst thing to do as well, those require
> > Shadow Matrix Entries.  a hot-loop triggers Traps?
> 
> yes, hence why I left that as an alternative rather than my preferred choice.

issue is where the hardware for loop lives.  it interacts directly with
the Register Hazard Allocation, even as far back as Decode.

the only reason to be able to get away with the hardware forloops prior to
Indexing bwing added was because they are 100% deterministic and rely on
SPRs not regs (change MAXVL etc and it is normal to have some latency).

IBM's hardware engineers will grumble about that but it is reasonably
explainable and reasonable to have 8-way multi-issue Deterministic
computation of QTY 8 sets of Register Hazards at once (even with Matrix REMAP).

throw in Indexed REMAP where it becomes necessary to look up EIGHT
simultaneous Indices to create 8 simultaneous Register Hazards and
they are going to get very concerned until the rules "No Hazards" on
the Indices are explained, at which point i expect them to calm doen 

to then try adding comparisons into that either in Decode or Issue
they are going to blow a fuse and with good reason.

the answer is no, and will remain no.

a useful task that you could do is create a small substitution table
cipher (a toy one) e.g. a substitution table of 16 entries and show
how sv.cmp would work.

if you are ok to do that or come uo with other algorithms that fit
Indexed REMAP *as designed* i will keep the bugreport open, otherwise
i will close it as invalid.
Comment 4 Jacob Lifshay 2022-08-16 08:01:22 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> the (In reply to Jacob Lifshay from comment #2)
> > (In reply to Luke Kenneth Casson Leighton from comment #1)
> 
> > you don't have to do the comparisons in instruction issue...
> 
> then that forces a Hazard Dependency which is exactly what is to be
> avoided.
> 
> > for reads, they can be done in the alu or register read or operand
> > forwarding stages, 
> 
> no chance.
> 
> i already said that that goes into a Shadow Cancellation
> Matrix

Trying to use shadow cancellation at all for predicting the shuffle or predicate for dynamic shuffles is imho a bad idea, they're dynamic specifically because they're likely change all the time. The huge number of pipeline flushes caused by mispredictions would kill performance (think 10x as slow).

The solution is to instead have the output be always produced, and if that predicate bit is zero, mux in the old output element. This is how predication normally works on GPUs or CPUs (with the optimization of using the register write-enable to do the muxing for some microarchitectures).

Likewise for inputs, the dynamic shuffle waits till all input elements are available, and only then, reads them based on the indexes. if there's an out-of-bound input, all you need is to replace whatever junk you read with a zero, you don't need to respect dependencies for that since it'll just be zeroed.
Comment 5 Jacob Lifshay 2022-08-16 08:04:22 BST
(In reply to Jacob Lifshay from comment #4)
> Trying to use shadow cancellation at all for predicting the shuffle or
> predicate for dynamic shuffles is imho a bad idea, they're dynamic
> specifically because they're likely change all the time. The huge number of
> pipeline flushes caused by mispredictions would kill performance (think 10x
> as slow).

dynamic shuffle isn't expected to be super fast (unlike static shuffles), it *is* expected to be faster than a pipeline flush due to predicate/shuffle misprediction every other element!
Comment 6 Luke Kenneth Casson Leighton 2022-08-16 10:52:41 BST
(In reply to Jacob Lifshay from comment #4)

> Likewise for inputs, the dynamic shuffle waits till all input elements are
> available, and only then, reads them based on the indexes. if there's an
> out-of-bound input, all you need is to replace whatever junk you read with a
> zero, you don't need to respect dependencies for that since it'll just be
> zeroed.

i said no.  repeating the same thing does not change the answer "no",
which has now been repeated three times.  that's the limit of the number
of times i'm prepared to spend saying "no".

as you did not respond to what i said would be a useful task, which is to
write some assembler under the existing conditions, i'm assuming that you
do not wish to help out here.

that's enough to warrant closing this bugreport as invalid.
Comment 7 Jacob Lifshay 2022-09-29 20:45:58 BST
vsx has precedent for zeroing elements with out-of-bounds indexes: the xxpermx instruction basically does:
v <- VA || VB
for i in 0 to 15
    if VC.byte[i].bit[0:2] != UIM then
        VT.byte[i] <- 0
    else
        VT.byte[i] <- v.byte[VC.byte[i].bit[3:7]]
Comment 8 Luke Kenneth Casson Leighton 2022-09-29 20:50:57 BST
(In reply to Jacob Lifshay from comment #7)
> vsx has precedent for zeroing elements with out-of-bounds indexes: the
> xxpermx instruction basically does:

we are not copying the behaviour of VSX which has a fixed power2 size and
hardcoded register sizes.