101 – IEEE754 pipeline "go_die" (Computation Unit Cancellation) needed

Bug 101 - IEEE754 pipeline "go_die" (Computation Unit Cancellation) needed

Summary: IEEE754 pipeline "go_die" (Computation Unit Cancellation) needed

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:	115 116 48
	Show dependency tree / graph

Reported:	2019-06-28 07:23 BST by Luke Kenneth Casson Leighton
Modified:	2022-06-18 19:54 BST (History)
CC List:	2 users (show)

See Also:
NLnet milestone:	NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2019-06-28 07:23:39 BST

scoreboard-style pipelines do not "stall" however they do need cancellation.
this really needs to be a global signal that goes to all stages, telling
the pipeline that it should no longer be sending data with a certain muxid
further down the pipe(s).  this probably best as unary because, clearly,
multiple cancellations may occur in any given clock.

Comment 1 Jacob Lifshay 2019-06-28 07:32:40 BST

all that's needed is a flip-flop chain that marks if the matching pipeline stage has valid data or not. No need to modify the data in the pipeline, since that takes up more gates and that pipeline slot can't be used again anyway.

Comment 2 Luke Kenneth Casson Leighton 2019-06-28 07:54:26 BST

(In reply to Jacob Lifshay from comment #1)
> all that's needed is a flip-flop chain that marks if the matching pipeline
> stage has valid data or not. No need to modify the data in the pipeline,
> since that takes up more gates and that pipeline slot can't be used again
> anyway.

the data's not important: the muxid is.

what you suggest that has the unfortunate unintended side-effect of making
the muxid impossible to use (immediately, either in the same or a subsequent
cycle).

to workaround that, it becomes necessary to add extra fields to muxid
to say "hey i am a valid muxid where the previous one wasn't", errrr...

or, you simply extend the number of Reservation Stations (and FUs)
to *DOUBLE* the length of the pipeline, such that the "cancelled"
items don't block things up.

that in turn creates far more gates than adding "cancellation of muxid"
ever would.

for FPDIV we're already going to need a *massive* number of Function Units,
otherwise we get a processing backlog at the FU / Reservation Stage.

in a Concurrent Computation Unit design (pipelines with fan-in and
fan-out, basically) if the number of FUs / Reservation Stations is
less than the pipeline length, there is not enough inputs/results
"storage latches" to match "data in the pipeline".

* 3-wide fan-in, fan-out means 3 FUs
* 4-long pipeline can process 4 sets of operands and produce only
  one result per cycle
* 5 pieces of data need processing
* the first 3 sets of operands go into the 3 FU "Reservation Stations",
  no problem.
* the 4th *STALLS* the *ENTIRE* engine - and i do mean the *ENTIRE*
  instruction issue stage, freezing *ALL* further processing - for
  one clock cycle, waiting for one of the 3 FU "RSs" to become free.

so if a muxid is not cancelled, you have to *wait* for that MUXID to
pop out the result end, at which point you go, "oh!  this result
was actually cancelled a few clocks ago, but thank you for letting
us know: *now* we can drop the "busy_o" signal on this Computation Unit:

https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/experiment/compalu.py;h=7da6b5cf9fa06b9bfa766f769ac19f4f49caf90b;hb=734d6ca4e4a4f6ea3d6038a54e50de5d76d9618b#l64

and *only* when busy_o is dropped can the engine continue.

it's just one of those things about 6600-style scoreboards: the tomasulo
algorithm works around this by making the Reservation Stations multi-entry,
however the penalty for doing so is that the entire scoreboard (now a ROB
in Tomasulo) has to use a CAM, in order to differentiate these multiple
entries.

bottom line is: being able to do global (immediate) cancellation of
the muxid is really quite important.  the *data* is actually not
as important, because without the muxid the data isn't routed at
the Fan-Out stage in the ReservationStation class.

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpdiv/pipeline.py;h=7c130fae0539a48c235b331f70b8e3f451da0eb6;hb=e47839fad581624f248e13b05b2ca9d9975d2f95#l36

Comment 3 Luke Kenneth Casson Leighton 2019-07-30 00:42:27 BST

http://bugs.libre-riscv.org/show_bug.cgi?id=126#c2

A new class, to replace use of SimpleHandshake, is needed. Its logic:

* a unary muxid is unconditionally propagated down the pipeline (sync).
* where the muxid is nonzero, the registers from the output are propagated (conditionally) to the next pipe stage.
* a global unary "cancel" signal is accessible by all pipe stages.
* if a bit in the "cancel" matches in the unary muxid, the output is NOT propagated to the next stage, and neither is the muxid.

This gives data "freezing" characteristics, reducing power consumption when a muxid is zero.

The cancel mask is global and is NOT a pipeline staged register, it is a FULLY GLOBAL signal that results in IMMEDIATE cancellation on a global basis of all and any muxids set.

Thus any cancelled muxids may immediately be used on the next clock cycle.

It is potentially possible, with a lot of messing about, that the cancelled muxid could be re-used on the same cycle that it is cancelled.

The unary muxids are unique to each ReservationStation and must NOT occur twice in any pipeline.

This is absolutely essential. No matter how large or how many pipes there are connected to ReservationStations, each muxid MUST occur once and only once across all stages.

This for two reasons.

1. So that global cancellation works

2. So that on exit from the pipeline (no matter how complex the stage routing) it is GUARANTEED that the data associated with the (unique) muxid WILL get back to the correct ReservationStation result latch.

Comment 4 Luke Kenneth Casson Leighton 2019-07-30 00:48:47 BST

Also: I'd like to use dynamic classes, here. If people want to use SimpleHandshake, or something else, the code should enable them to do that.

It means using 3 arg type() and __new__.

https://www.tutorialdocs.com/article/python-class-dynamically.html

Comment 5 Luke Kenneth Casson Leighton 2019-08-01 17:22:30 BST

(In reply to Luke Kenneth Casson Leighton from comment #4)
> Also: I'd like to use dynamic classes, here. If people want to use
> SimpleHandshake, or something else, the code should enable them to do that.
> 
> It means using 3 arg type() and __new__.
> 
> https://www.tutorialdocs.com/article/python-class-dynamically.html

that's done - it works.  i've not tested an alternative class, yet,
however i have confidence that a class can be dropped in place.
what it means is: for users of the IEEE754FPU API who want to do
single-issue, they can use whatever pipeline-building class they like.

for "Cancellation", i looked at the existing fan-in and fan-out classes,
and realised that changing them from the existing binary "muxid" is
not only quite a lot of work, it's also counterproductive.

so instead, "Cancellation" should be viewed effectively as "speculative
predication", and a predication "mask", followed up by a *cancellation*
request, should be *added* rather than *replace* the use of muxid.

the OoO design requires this "speculative predication" anyway.  it works
by sending computations to [SIMD] elements *before* the actual predicate
is known, doing a (separate) read on the register containing the predicate,
working out which elements are to be "cancelled" and which not to be,
then following up with a "cancellation" mask, broadcast to all elements
currently in the middle of performing actual work.

for Vectorisation, for using SIMD back-ends, we can simply "pre-mask"
the (non-matching) elements at the end of the vector.  so if there are
only 3 elements to be calculated, and the SIMD back-end requires 4,
the last element is "masked out".

this technique would use *exactly* the same "mask" system as is required
to perform "cancellation", so adding "Cancellation capability" effectively
is the fundamental basis for both predication and Vectorisation.

Comment 6 Luke Kenneth Casson Leighton 2019-08-02 08:26:11 BST

damn. first experiment to blur the lines between data and control: fail.

data is contained in the "Stage" API, and is passed through with a
call to nmoperator.eq, and is otherwise untouched and completely opaque
to the "pipeline" part that "handles" control and routing.

i intended to add cancel and mask to the *data*, and to break the
above rule by allowing the *control* side to inspect the data.

the problem comes when trying to stop CMOS gate-flipping, as we talked
about a few days ago, jacob.  with the data being both passed across
through registers *and* now effectively containing a "control" signal
(the mask), the only way to get the data to discern when data should
NOT be passed from pipeline register to pipeline register is to MODIFY
that data, EXCLUDING sending the "control" signal (the mask).

that would cause such a coding mess that i don't even want to try it,
especially when doing so breaks the separation between data and
control/routing in the first place.

so instead i will look at how to add the "mask" and "cancel" signals
to the *handling* side (PrevControl and NextControl), where, really,
it should have been in the first place.

Comment 7 Jacob Lifshay 2019-08-02 08:28:28 BST

(In reply to Luke Kenneth Casson Leighton from comment #6)
> so instead i will look at how to add the "mask" and "cancel" signals
> to the *handling* side (PrevControl and NextControl), where, really,
> it should have been in the first place.

Sounds good to me.

Comment 8 Luke Kenneth Casson Leighton 2019-08-03 23:40:12 BST

(In reply to Jacob Lifshay from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #6)
> > so instead i will look at how to add the "mask" and "cancel" signals
> > to the *handling* side (PrevControl and NextControl), where, really,
> > it should have been in the first place.
> 
> Sounds good to me.

okaay, it's a bit of a mess (the code in nmutil/iocontrol.py),
however the latest unit test, cancellation seems to work.

i broke the valid/ready rules, somewhat: the "mask", if non-zero,
basically takes over.  in effect, it turns the pipeline into
one of dan gisselquist's simpler types: the "global CE" one.
so it's critically important *not* to try to set the "recv not ready"
signal, because the data *will* be sent, regardless.

the masks (basically predication bits) are "Cat'd" together as
they are funneled into the Multi-In Mux pipeline, so that none of
them are lost.  they're then funneled *back out* again at the
other end (Multi-Out fan-out stage).

i allowed the option in the API to specify multi-bit masks, so
that we can have SIMD element cancellation as well.  hmm, just
fixed that: i'd missed that if something is cancelled it shouldn't
continue to be passed down the chain! whoops.

Comment 9 Luke Kenneth Casson Leighton 2019-08-06 12:46:49 BST

https://git.libre-riscv.org/?p=ieee754fpu.git;a=commit;h=65e4f1069622b79a5782ef0754610c8d309439f0

unit test success on FPDIV!  so that's a cancellable mask-capable pipeline.

Comment 10 Luke Kenneth Casson Leighton 2019-08-07 01:27:53 BST

i just realised that if using MaskCancellable for early-in, early-out
and pipe-feedback, it could stall, as there would be circumstances
where incoming data needs priority-routing (muxing) with the feedback
data.

that in turn means that MaskCancellable has to properly respect the
ready/valid Data IO Handling API.

drat.

so i've changed MaskCancellable to be based fully on SimpleHandshake.
it's necessary to be careful there because it's based on a
combinatorial chain of ready/valid signalling.

unit tests pass.