187 – SimpleV vector operation semantics: reading scalar inputs before writing any outputs (deferred to Libre-SOC v2)

Bug 187 - SimpleV vector operation semantics: reading scalar inputs before writing any outputs (deferred to Libre-SOC v2)

Summary: SimpleV vector operation semantics: reading scalar inputs before writing any ...

Status:	RESOLVED INVALID

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:

Reported:	2020-02-24 22:31 GMT by Jacob Lifshay
Modified:	2021-02-21 20:58 GMT (History)
CC List:	2 users (show)

See Also:	186
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2020-02-24 22:31:24 GMT

See:

http://bugs.libre-riscv.org/show_bug.cgi?id=186#c16
and
http://bugs.libre-riscv.org/show_bug.cgi?id=186#c14

(edit: add motivation)

It could increase performance a lot (at least several %) by reducing
compiler-generated copies and reducing register pressure in hot loops (due to
the compiler not needing to copy the scalar inputs out of the way first).

Comment 1 Luke Kenneth Casson Leighton 2020-02-24 22:48:44 GMT

this is a non-bug, jacob, because the SV for-loop sends individual instructions to the DMs, as if that forloop (loop2) had been macro-unrolled in the assembly code as scalar instructions.

those scalar instructions are *going* to have the correct read and write hazards established (by the DMs)... *just as they would if they were a sequence of scalar instructions*.

consequently as this behaviour of SV is well-defined and not subject to change, this is a non-bug.  we don't have a nonbug category, closest is "invalid".

Comment 2 Jacob Lifshay 2020-02-24 23:13:35 GMT

This bug is for discussing and changing the SimpleV spec, not for the implementation.

Reopening.

Comment 3 Jacob Lifshay 2020-02-24 23:16:13 GMT

If we later decide we don't want to change the SimpleV spec as suggested (which will help compiler backends), we can close this bug then.

Comment 4 Luke Kenneth Casson Leighton 2020-02-24 23:31:43 GMT

(In reply to Jacob Lifshay from comment #2)
> This bug is for discussing and changing the SimpleV spec, not for the
> implementation.
> 
> Reopening.

ok good call.

Comment 5 Luke Kenneth Casson Leighton 2020-02-24 23:53:44 GMT

(In reply to Jacob Lifshay from comment #3)
> If we later decide we don't want to change the SimpleV spec as suggested
> (which will help compiler backends), we can close this bug then.

ok let's discuss how in hardware this would be implemented:


void load1(int address_reg, int dest_reg, int N)
{
    int address = regs[address_reg];
    for(int i = 0; i < N; i++)
        regs[dest_reg + i] = *(int *)(address + sizeof(int) * i);
}

first, it breaks the design of SV, entirely.  SV is no longer described as "a macro unrolling for loop at the hardware level"

it would have to be described as, "a hardware forloop except for LD which is niw complicated because it reads the address of the first register and uses it as a base address".

so that is an immediate red flag.

the second factor is that the base address has to either be read in the decode phase (which requires a full stall of the entire engine waiting until the source register has no dependencies.  clearly that is unacceptable)

the second method is to have special shadow logic similar to predication, which is in a similar position.  the shadow is thrown across all LDs, waiting for the address reg to be read, at which point that data is broadcast to all waiting LD Function Units.

once received the units may drop the addresswaitingshadow (not the invalid memory exception shadow which must also exist and separately still be held).

the problem comes when VL is greater than the number of LD/ST Function Units (no, we are not doing 64 LD/ST FUs, the practical limit is 4 LDs, 4 STs.)

at that point, if an exception occurs during half of the way through VL=16 then that address value absolutely must be stored on the stack, as part of context switch state. 

failure to do so, given that the loop may not be backed out of (because you only gave 4 LD FUs and consequently can only sgadow and recall *4* writes... and there are 64 to perform....), results in memory corruption.

this is the point at which i say "that is a bad idea", not least because in the current design it is completely unnecessary, it also complicates the design *and* increases context switch latency *and* increases the number of CSRs.

so, to recap:

* SV as a conceot, the simplicity is destroyed.
* context switch latency is increased
* hardware complexity is increased
* the number of CSRs - a precious resource - is increased.

there are no upsides in other words.

Comment 6 Jacob Lifshay 2020-02-25 00:35:43 GMT

First, the reason to change the semantics:

It could increase performance a lot (at least several %) by reducing compiler-generated copies and reducing register pressure in hot loops (due to the compiler not needing to copy the scalar inputs out of the way first).

(In reply to Luke Kenneth Casson Leighton from comment #5)
> first, it breaks the design of SV, entirely.  SV is no longer described as
> "a macro unrolling for loop at the hardware level"
> 
> it would have to be described as, "a hardware forloop except for LD which is
> niw complicated because it reads the address of the first register and uses
> it as a base address".

The reading scalar/subvector (henceforth just called scalar) inputs before writing any outputs would be applied to all instructions uniformly, not just LD.

You can consider this conceptually as if all scalar inputs are copied to the context-switching state CSRs as an additional operation which executes before the hardware for loop, it can be considered to be a new operation inserted into the issue queue before the vector element operations. Also, the vector element operations conceptually read all the scalar inputs from the context-switching state CSRs written by the new inserted operation instead of directly from the input registers.

This can be implemented in HW without any additional delay by sending the value of the scalar input registers to the CSRs and to the element operations simultaneously, they don't have to wait for the CSR write to complete.

When resuming from a context-switch with a partially-executed SimpleV instruction (vstart != 0), the copy to the CSR is omitted, and the element operations just read the scalar inputs from the CSRs.

> this is the point at which i say "that is a bad idea", not least because in
> the current design it is completely unnecessary, it also complicates the
> design *and* increases context switch latency *and* increases the number of
> CSRs.
> 
> so, to recap:
> 
> * SV as a conceot, the simplicity is destroyed.

It's much less additional conceptual complexity then the register renaming tables you had wanted to add.

> * context switch latency is increased

Yee, but it could increase performance a lot (at least several %) by reducing compiler-generated copies and reducing register pressure in hot loops. I'm sure that's an acceptable tradeoff for increasing context-switch state by 512-bits (worst-case estimate).

> * hardware complexity is increased

not by much, a mux to allow substituting the CSR values at the ALU (and other ops) inputs and the logic to write to the CSRs.

> * the number of CSRs - a precious resource - is increased.

Power still has plenty of CSR space left, so that's not as large of a concern as for RISC-V.

Comment 7 Luke Kenneth Casson Leighton 2020-02-25 01:21:17 GMT

(In reply to Jacob Lifshay from comment #6)
> First, the reason to change the semantics:
> 
> It could increase performance a lot (at least several %) by reducing
> compiler-generated copies and reducing register pressure in hot loops (due
> to the compiler not needing to copy the scalar inputs out of the way first).


generally i have found that when suggestions like this come up, the suggestion does not topologically change or alter the overall complexity of the State Machine that encompasses the hardware-software combination.

what it does instead is: *move* some of the required resources to maintain that FSM from hardware into software or vice versa.

the fact that the scalar inputs need copying should tell you that this complex resource intensive task is being requested to be moved - *in full* - into hardware.

and that is not a good idea.

not least, it is not a good idea in a key area that is already complex and we are under time pressure to meet the Oct 2020 deadline.


> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > first, it breaks the design of SV, entirely.  SV is no longer described as
> > "a macro unrolling for loop at the hardware level"
> > 
> > it would have to be described as, "a hardware forloop except for LD which is
> > niw complicated because it reads the address of the first register and uses
> > it as a base address".
> 
> The reading scalar/subvector (henceforth just called scalar) inputs before
> writing any outputs would be applied to all instructions uniformly, not just
> LD.

if this had been raised even three months ago there would be plenty of time to discuss such a drastic change to SimpleV, leaving me then with the months of time needed to think through the implications properly and fully.


> You can consider this conceptually as if all scalar inputs are copied to the
> context-switching state CSRs as an additional operation which executes
> before the hardware for loop, it can be considered to be a new operation
> inserted into the issue queue before the vector element operations. Also,
> the vector element operations conceptually read all the scalar inputs from
> the context-switching state CSRs written by the new inserted operation
> instead of directly from the input registers.

that it is multiple CSRs, not just the one, has me even more concerned.

like i said: the overall state remains the same: pressure on regfile to save state becomes pressure on CSRs to save state, with the detriment being that the hardware is made far more complex.



> This can be implemented in HW without any additional delay by sending the
> value of the scalar input registers to the CSRs and to the element
> operations simultaneously, they don't have to wait for the CSR write to
> complete.

which is yet more hardware complexity.

> 
> When resuming from a context-switch with a partially-executed SimpleV
> instruction (vstart != 0), the copy to the CSR is omitted, and the element
> operations just read the scalar inputs from the CSRs.

which is even more complexity in the data routing paths.

when we are already under pressure to implement the first prototype.



 
> > this is the point at which i say "that is a bad idea", not least because in
> > the current design it is completely unnecessary, it also complicates the
> > design *and* increases context switch latency *and* increases the number of
> > CSRs.
> > 
> > so, to recap:
> > 
> > * SV as a conceot, the simplicity is destroyed.
> 
> It's much less additional conceptual complexity then the register renaming
> tables you had wanted to add.

that turns out to be a misunderstanding given that we were unaware that the 6600's FUs *automatically* provide register renaming.

if i had known that, i wouldn't have gone, "we need reg renaming!" :)

 
> > * context switch latency is increased
> 
> Yee, but it could increase performance a lot (at least several %) by
> reducing compiler-generated copies and reducing register pressure in hot
> loops. I'm sure that's an acceptable tradeoff for increasing context-switch
> state by 512-bits (worst-case estimate).
> 
> > * hardware complexity is increased
> 
> not by much, a mux to allow substituting the CSR values at the ALU (and
> other ops) inputs and the logic to write to the CSRs.

too much.

as in: it's just too much for me to integrate into the design at this very late stage, doing what is effectively a full redesign and review of something that took me nearly eight months to understand and, from the ISA perspective, took us 18 months to discuss and review.

if we try to implement this, right now, it will set us back about five months.

no really that is not an underestimate.

it would be about 2 weeks of discussion, minimum, for us, doing a full review.

i will need about a month to think about it, to fully understand in my head.

i will also need to write out proper diagrams (floor plans) and study them. two weeks minimum.

it will take about four to five weeks just to go through the spec, reviewing each of the types of instructions to add the new proposed stage to.  probably longer.

then the simulator would need updating (3 weeks).  unit tests written (3 weeks).

then after all that, which is about four months, i will have a clear idea of how it works, and all the implications.

*then* i will be able to spend time on the hardware, which will be about a month.

so - five months - because it is such a massive conceptual core redesign, because it adds an entirely new layer to SV (the capture and storage of scalars in CSRs).

none of the above steps look unreasonably long.

bottom line, we simply have not got the time at this late phase to do such a massive intrusive redesign.

Comment 8 Luke Kenneth Casson Leighton 2020-02-25 10:42:07 GMT

so apologies, does that make any sense, jacob?  basically i'm having
an extremely tough time keeping track of the design in my head, which
took an exceptionally long time to get straight, in amongst a stack
of management, admin *and* other technical tasks, and changing it,
particularly to add complexity without a compelling (disaster-avoiding)
reason is just... not a good idea.

Comment 9 Jacob Lifshay 2020-02-28 16:05:24 GMT

(In reply to Luke Kenneth Casson Leighton from comment #8)
> so apologies, does that make any sense, jacob?  basically i'm having
> an extremely tough time keeping track of the design in my head, which
> took an exceptionally long time to get straight, in amongst a stack
> of management, admin *and* other technical tasks, and changing it,
> particularly to add complexity without a compelling (disaster-avoiding)
> reason is just... not a good idea.

Yeah, that all makes sense.

I still think it's still a good idea, we just ran out of time for Libre-SOC v1.

I was hoping we could change it before we're stuck due to backward-compatibility constraints.

I'll leave the bug open to defer it to Libre-SOC v2.

Comment 10 Luke Kenneth Casson Leighton 2020-02-28 17:21:20 GMT

(In reply to Jacob Lifshay from comment #9)

> I was hoping we could change it before we're stuck due to
> backward-compatibility constraints.

fortunately we have ISAMUX/NS

> I'll leave the bug open to defer it to Libre-SOC v2.

good idea

Comment 11 Luke Kenneth Casson Leighton 2021-02-21 20:51:53 GMT

read this one again and reviewed the load1 / load2 pseudocode: the rule that SV is a Sub-PC for-loop has to remain inviolate and fundamental.

attempts to break that inviolate rule result in complications when explaining the concept, weakening the simplicity.  it is nowhere near worth throwing that away.

additionally when any given implementation looks at adding SV, the simplicity of being in between issue and decode is now also violated.  implementors would be forced to modify the behaviour of designs to comply with the "exceptional violating behaviour".

Comment 12 Luke Kenneth Casson Leighton 2021-02-21 20:58:01 GMT

that said: if the implementor works out if it is ABSOLUTELY GUARANTEED that the inviolate rule of Program Order is preserved even in SubPC loops can be respected by reading a scalar register once, that's great.

LOAD, if the src overlaps with tge dest, is not one of them.