See: http://bugs.libre-riscv.org/show_bug.cgi?id=186#c16 and http://bugs.libre-riscv.org/show_bug.cgi?id=186#c14 (edit: add motivation) It could increase performance a lot (at least several %) by reducing compiler-generated copies and reducing register pressure in hot loops (due to the compiler not needing to copy the scalar inputs out of the way first).
this is a non-bug, jacob, because the SV for-loop sends individual instructions to the DMs, as if that forloop (loop2) had been macro-unrolled in the assembly code as scalar instructions. those scalar instructions are *going* to have the correct read and write hazards established (by the DMs)... *just as they would if they were a sequence of scalar instructions*. consequently as this behaviour of SV is well-defined and not subject to change, this is a non-bug. we don't have a nonbug category, closest is "invalid".
This bug is for discussing and changing the SimpleV spec, not for the implementation. Reopening.
If we later decide we don't want to change the SimpleV spec as suggested (which will help compiler backends), we can close this bug then.
(In reply to Jacob Lifshay from comment #2) > This bug is for discussing and changing the SimpleV spec, not for the > implementation. > > Reopening. ok good call.
(In reply to Jacob Lifshay from comment #3) > If we later decide we don't want to change the SimpleV spec as suggested > (which will help compiler backends), we can close this bug then. ok let's discuss how in hardware this would be implemented: void load1(int address_reg, int dest_reg, int N) { int address = regs[address_reg]; for(int i = 0; i < N; i++) regs[dest_reg + i] = *(int *)(address + sizeof(int) * i); } first, it breaks the design of SV, entirely. SV is no longer described as "a macro unrolling for loop at the hardware level" it would have to be described as, "a hardware forloop except for LD which is niw complicated because it reads the address of the first register and uses it as a base address". so that is an immediate red flag. the second factor is that the base address has to either be read in the decode phase (which requires a full stall of the entire engine waiting until the source register has no dependencies. clearly that is unacceptable) the second method is to have special shadow logic similar to predication, which is in a similar position. the shadow is thrown across all LDs, waiting for the address reg to be read, at which point that data is broadcast to all waiting LD Function Units. once received the units may drop the addresswaitingshadow (not the invalid memory exception shadow which must also exist and separately still be held). the problem comes when VL is greater than the number of LD/ST Function Units (no, we are not doing 64 LD/ST FUs, the practical limit is 4 LDs, 4 STs.) at that point, if an exception occurs during half of the way through VL=16 then that address value absolutely must be stored on the stack, as part of context switch state. failure to do so, given that the loop may not be backed out of (because you only gave 4 LD FUs and consequently can only sgadow and recall *4* writes... and there are 64 to perform....), results in memory corruption. this is the point at which i say "that is a bad idea", not least because in the current design it is completely unnecessary, it also complicates the design *and* increases context switch latency *and* increases the number of CSRs. so, to recap: * SV as a conceot, the simplicity is destroyed. * context switch latency is increased * hardware complexity is increased * the number of CSRs - a precious resource - is increased. there are no upsides in other words.
First, the reason to change the semantics: It could increase performance a lot (at least several %) by reducing compiler-generated copies and reducing register pressure in hot loops (due to the compiler not needing to copy the scalar inputs out of the way first). (In reply to Luke Kenneth Casson Leighton from comment #5) > first, it breaks the design of SV, entirely. SV is no longer described as > "a macro unrolling for loop at the hardware level" > > it would have to be described as, "a hardware forloop except for LD which is > niw complicated because it reads the address of the first register and uses > it as a base address". The reading scalar/subvector (henceforth just called scalar) inputs before writing any outputs would be applied to all instructions uniformly, not just LD. You can consider this conceptually as if all scalar inputs are copied to the context-switching state CSRs as an additional operation which executes before the hardware for loop, it can be considered to be a new operation inserted into the issue queue before the vector element operations. Also, the vector element operations conceptually read all the scalar inputs from the context-switching state CSRs written by the new inserted operation instead of directly from the input registers. This can be implemented in HW without any additional delay by sending the value of the scalar input registers to the CSRs and to the element operations simultaneously, they don't have to wait for the CSR write to complete. When resuming from a context-switch with a partially-executed SimpleV instruction (vstart != 0), the copy to the CSR is omitted, and the element operations just read the scalar inputs from the CSRs. > this is the point at which i say "that is a bad idea", not least because in > the current design it is completely unnecessary, it also complicates the > design *and* increases context switch latency *and* increases the number of > CSRs. > > so, to recap: > > * SV as a conceot, the simplicity is destroyed. It's much less additional conceptual complexity then the register renaming tables you had wanted to add. > * context switch latency is increased Yee, but it could increase performance a lot (at least several %) by reducing compiler-generated copies and reducing register pressure in hot loops. I'm sure that's an acceptable tradeoff for increasing context-switch state by 512-bits (worst-case estimate). > * hardware complexity is increased not by much, a mux to allow substituting the CSR values at the ALU (and other ops) inputs and the logic to write to the CSRs. > * the number of CSRs - a precious resource - is increased. Power still has plenty of CSR space left, so that's not as large of a concern as for RISC-V.
(In reply to Jacob Lifshay from comment #6) > First, the reason to change the semantics: > > It could increase performance a lot (at least several %) by reducing > compiler-generated copies and reducing register pressure in hot loops (due > to the compiler not needing to copy the scalar inputs out of the way first). generally i have found that when suggestions like this come up, the suggestion does not topologically change or alter the overall complexity of the State Machine that encompasses the hardware-software combination. what it does instead is: *move* some of the required resources to maintain that FSM from hardware into software or vice versa. the fact that the scalar inputs need copying should tell you that this complex resource intensive task is being requested to be moved - *in full* - into hardware. and that is not a good idea. not least, it is not a good idea in a key area that is already complex and we are under time pressure to meet the Oct 2020 deadline. > (In reply to Luke Kenneth Casson Leighton from comment #5) > > first, it breaks the design of SV, entirely. SV is no longer described as > > "a macro unrolling for loop at the hardware level" > > > > it would have to be described as, "a hardware forloop except for LD which is > > niw complicated because it reads the address of the first register and uses > > it as a base address". > > The reading scalar/subvector (henceforth just called scalar) inputs before > writing any outputs would be applied to all instructions uniformly, not just > LD. if this had been raised even three months ago there would be plenty of time to discuss such a drastic change to SimpleV, leaving me then with the months of time needed to think through the implications properly and fully. > You can consider this conceptually as if all scalar inputs are copied to the > context-switching state CSRs as an additional operation which executes > before the hardware for loop, it can be considered to be a new operation > inserted into the issue queue before the vector element operations. Also, > the vector element operations conceptually read all the scalar inputs from > the context-switching state CSRs written by the new inserted operation > instead of directly from the input registers. that it is multiple CSRs, not just the one, has me even more concerned. like i said: the overall state remains the same: pressure on regfile to save state becomes pressure on CSRs to save state, with the detriment being that the hardware is made far more complex. > This can be implemented in HW without any additional delay by sending the > value of the scalar input registers to the CSRs and to the element > operations simultaneously, they don't have to wait for the CSR write to > complete. which is yet more hardware complexity. > > When resuming from a context-switch with a partially-executed SimpleV > instruction (vstart != 0), the copy to the CSR is omitted, and the element > operations just read the scalar inputs from the CSRs. which is even more complexity in the data routing paths. when we are already under pressure to implement the first prototype. > > this is the point at which i say "that is a bad idea", not least because in > > the current design it is completely unnecessary, it also complicates the > > design *and* increases context switch latency *and* increases the number of > > CSRs. > > > > so, to recap: > > > > * SV as a conceot, the simplicity is destroyed. > > It's much less additional conceptual complexity then the register renaming > tables you had wanted to add. that turns out to be a misunderstanding given that we were unaware that the 6600's FUs *automatically* provide register renaming. if i had known that, i wouldn't have gone, "we need reg renaming!" :) > > * context switch latency is increased > > Yee, but it could increase performance a lot (at least several %) by > reducing compiler-generated copies and reducing register pressure in hot > loops. I'm sure that's an acceptable tradeoff for increasing context-switch > state by 512-bits (worst-case estimate). > > > * hardware complexity is increased > > not by much, a mux to allow substituting the CSR values at the ALU (and > other ops) inputs and the logic to write to the CSRs. too much. as in: it's just too much for me to integrate into the design at this very late stage, doing what is effectively a full redesign and review of something that took me nearly eight months to understand and, from the ISA perspective, took us 18 months to discuss and review. if we try to implement this, right now, it will set us back about five months. no really that is not an underestimate. it would be about 2 weeks of discussion, minimum, for us, doing a full review. i will need about a month to think about it, to fully understand in my head. i will also need to write out proper diagrams (floor plans) and study them. two weeks minimum. it will take about four to five weeks just to go through the spec, reviewing each of the types of instructions to add the new proposed stage to. probably longer. then the simulator would need updating (3 weeks). unit tests written (3 weeks). then after all that, which is about four months, i will have a clear idea of how it works, and all the implications. *then* i will be able to spend time on the hardware, which will be about a month. so - five months - because it is such a massive conceptual core redesign, because it adds an entirely new layer to SV (the capture and storage of scalars in CSRs). none of the above steps look unreasonably long. bottom line, we simply have not got the time at this late phase to do such a massive intrusive redesign.
so apologies, does that make any sense, jacob? basically i'm having an extremely tough time keeping track of the design in my head, which took an exceptionally long time to get straight, in amongst a stack of management, admin *and* other technical tasks, and changing it, particularly to add complexity without a compelling (disaster-avoiding) reason is just... not a good idea.
(In reply to Luke Kenneth Casson Leighton from comment #8) > so apologies, does that make any sense, jacob? basically i'm having > an extremely tough time keeping track of the design in my head, which > took an exceptionally long time to get straight, in amongst a stack > of management, admin *and* other technical tasks, and changing it, > particularly to add complexity without a compelling (disaster-avoiding) > reason is just... not a good idea. Yeah, that all makes sense. I still think it's still a good idea, we just ran out of time for Libre-SOC v1. I was hoping we could change it before we're stuck due to backward-compatibility constraints. I'll leave the bug open to defer it to Libre-SOC v2.
(In reply to Jacob Lifshay from comment #9) > I was hoping we could change it before we're stuck due to > backward-compatibility constraints. fortunately we have ISAMUX/NS > I'll leave the bug open to defer it to Libre-SOC v2. good idea
read this one again and reviewed the load1 / load2 pseudocode: the rule that SV is a Sub-PC for-loop has to remain inviolate and fundamental. attempts to break that inviolate rule result in complications when explaining the concept, weakening the simplicity. it is nowhere near worth throwing that away. additionally when any given implementation looks at adding SV, the simplicity of being in between issue and decode is now also violated. implementors would be forced to modify the behaviour of designs to comply with the "exceptional violating behaviour".
that said: if the implementor works out if it is ABSOLUTELY GUARANTEED that the inviolate rule of Program Order is preserved even in SubPC loops can be respected by reading a scalar register once, that's great. LOAD, if the src overlaps with tge dest, is not one of them.