Bug 559 - analyse implications of automatic detection of changing VL loop direction
Summary: analyse implications of automatic detection of changing VL loop direction
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2020-12-30 03:22 GMT by Luke Kenneth Casson Leighton
Modified: 2020-12-30 19:20 GMT (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-12-30 03:22:06 GMT
in bug #558 the idea was proposed to autodetect the direction of overlapping registers in SV loops.
to be discussed
Comment 1 Jacob Lifshay 2020-12-30 03:38:34 GMT
changing direction won't make it act as a parallel vector op (write outputs only after fully reading all inputs) in all cases, since neither incrementing or decrementing indexes will work here:

vl = 8

add r8.v, r4.v, r12.v

incrementing version expands to:
add r8, r4, r12
add r9, r5, r13
add r10, r6, r14
add r11, r7, r15
add r12, r8, r16 // r8 reads wrong value
add r13, r9, r17
add r14, r10, r18
add r15, r11, r19

decrementing version:
add r15, r11, r19
add r14, r10, r18
add r13, r9, r17
add r12, r8, r16
add r11, r7, r15 // r15 reads wrong value
add r10, r6, r14
add r9, r5, r13
add r8, r4, r12
Comment 2 Luke Kenneth Casson Leighton 2020-12-30 03:54:37 GMT
the original idea was simply to treat the semantics of SV's hardware forloop concept quite literally: multiple independent instructions are issued and the register hazards fully respected.

this would, with careful overlap design, result in useful mapreduce patterns under the control of the developer.

up for discussion is the autodetection of direction and inversion of the same in order to explicitly avoid overlap.
Comment 3 Luke Kenneth Casson Leighton 2020-12-30 03:57:34 GMT
(In reply to Jacob Lifshay from comment #1)
> changing direction won't make it act as a parallel vector op (write outputs
> only after fully reading all inputs) in all cases, since neither
> incrementing or decrementing indexes will work here:

yes, for 1-src 1-dest instructions it makes sense.

likewise for 2-src 1-dest where one source avoids overlap with both the other src and also the dest.

but anything else is hosed.

question is: what to do in each case?
Comment 4 Luke Kenneth Casson Leighton 2020-12-30 15:13:06 GMT
the original question was to do with whether gcc should rely on hardware reversing the order of the VL loop so that register allocation need not be concerned about the consequences of using overlapping ranges of registers.

the case where dest overlaps either *or both* src1 and src2 demonstrates that overlap avoidance is going to be necessary, not just "nice to have".

i'm inclined to close this one as invalid.
Comment 5 Jacob Lifshay 2020-12-30 17:17:32 GMT
(In reply to Luke Kenneth Casson Leighton from comment #4)
> i'm inclined to close this one as invalid.

I'd instead close it as completed, since we did analyze the implications of automatic detection of changing VL loop direction as the title says.
Comment 6 Luke Kenneth Casson Leighton 2020-12-30 17:41:43 GMT
(In reply to Jacob Lifshay from comment #5)
> (In reply to Luke Kenneth Casson Leighton from comment #4)
> > i'm inclined to close this one as invalid.
> 
> I'd instead close it as completed, since we did analyze the implications of
> automatic detection of changing VL loop direction as the title says.

good point :)

still for due diligence there are a few things left to cover.

VSLIDE is the usual instruction which moves registers inside a vector up and down.  SV you simply leave the elements in place and issue an instruction that starts at a different offset.
Comment 7 Jacob Lifshay 2020-12-30 17:57:38 GMT
(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #5)
> > (In reply to Luke Kenneth Casson Leighton from comment #4)
> > > i'm inclined to close this one as invalid.
> > 
> > I'd instead close it as completed, since we did analyze the implications of
> > automatic detection of changing VL loop direction as the title says.
> 
> good point :)
> 
> still for due diligence there are a few things left to cover.
> 
> VSLIDE is the usual instruction which moves registers inside a vector up and
> down.  SV you simply leave the elements in place and issue an instruction
> that starts at a different offset.

we still might need vslide since registers != elements since elements don't have to be 64-bit.

I guess twin predication with a dest_mask of src_mask << slide_by will work, but it'll be 2/3 instructions rather than 1. Also, some implementations may be able to implement vslide more efficiently.
Comment 8 Alexandre Oliva 2020-12-30 19:20:39 GMT
Indeed, automatic detection and reversal of direction won't do in general case.

We could still state that the insn operands must be such that there aren't overlaps between inputs and outputs that could lead sequential operation to behave differently from fully parallel operation, leaving those cases reserved (meant to be unused) rather than defined in a way that is at odds with the behavior and expectations of every other vector/simd processor out there.  (hyperbole alert; I don't know them all ;-)