in bug #558 the idea was proposed to autodetect the direction of overlapping registers in SV loops. to be discussed
changing direction won't make it act as a parallel vector op (write outputs only after fully reading all inputs) in all cases, since neither incrementing or decrementing indexes will work here: vl = 8 add r8.v, r4.v, r12.v incrementing version expands to: add r8, r4, r12 add r9, r5, r13 add r10, r6, r14 add r11, r7, r15 add r12, r8, r16 // r8 reads wrong value add r13, r9, r17 add r14, r10, r18 add r15, r11, r19 decrementing version: add r15, r11, r19 add r14, r10, r18 add r13, r9, r17 add r12, r8, r16 add r11, r7, r15 // r15 reads wrong value add r10, r6, r14 add r9, r5, r13 add r8, r4, r12
the original idea was simply to treat the semantics of SV's hardware forloop concept quite literally: multiple independent instructions are issued and the register hazards fully respected. this would, with careful overlap design, result in useful mapreduce patterns under the control of the developer. up for discussion is the autodetection of direction and inversion of the same in order to explicitly avoid overlap.
(In reply to Jacob Lifshay from comment #1) > changing direction won't make it act as a parallel vector op (write outputs > only after fully reading all inputs) in all cases, since neither > incrementing or decrementing indexes will work here: yes, for 1-src 1-dest instructions it makes sense. likewise for 2-src 1-dest where one source avoids overlap with both the other src and also the dest. but anything else is hosed. question is: what to do in each case?
the original question was to do with whether gcc should rely on hardware reversing the order of the VL loop so that register allocation need not be concerned about the consequences of using overlapping ranges of registers. the case where dest overlaps either *or both* src1 and src2 demonstrates that overlap avoidance is going to be necessary, not just "nice to have". i'm inclined to close this one as invalid.
(In reply to Luke Kenneth Casson Leighton from comment #4) > i'm inclined to close this one as invalid. I'd instead close it as completed, since we did analyze the implications of automatic detection of changing VL loop direction as the title says.
(In reply to Jacob Lifshay from comment #5) > (In reply to Luke Kenneth Casson Leighton from comment #4) > > i'm inclined to close this one as invalid. > > I'd instead close it as completed, since we did analyze the implications of > automatic detection of changing VL loop direction as the title says. good point :) still for due diligence there are a few things left to cover. VSLIDE is the usual instruction which moves registers inside a vector up and down. SV you simply leave the elements in place and issue an instruction that starts at a different offset.
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #5) > > (In reply to Luke Kenneth Casson Leighton from comment #4) > > > i'm inclined to close this one as invalid. > > > > I'd instead close it as completed, since we did analyze the implications of > > automatic detection of changing VL loop direction as the title says. > > good point :) > > still for due diligence there are a few things left to cover. > > VSLIDE is the usual instruction which moves registers inside a vector up and > down. SV you simply leave the elements in place and issue an instruction > that starts at a different offset. we still might need vslide since registers != elements since elements don't have to be 64-bit. I guess twin predication with a dest_mask of src_mask << slide_by will work, but it'll be 2/3 instructions rather than 1. Also, some implementations may be able to implement vslide more efficiently.
Indeed, automatic detection and reversal of direction won't do in general case. We could still state that the insn operands must be such that there aren't overlaps between inputs and outputs that could lead sequential operation to behave differently from fully parallel operation, leaving those cases reserved (meant to be unused) rather than defined in a way that is at odds with the behavior and expectations of every other vector/simd processor out there. (hyperbole alert; I don't know them all ;-)