scoreboard-style pipelines do not "stall" however they do need cancellation. this really needs to be a global signal that goes to all stages, telling the pipeline that it should no longer be sending data with a certain muxid further down the pipe(s). this probably best as unary because, clearly, multiple cancellations may occur in any given clock.
all that's needed is a flip-flop chain that marks if the matching pipeline stage has valid data or not. No need to modify the data in the pipeline, since that takes up more gates and that pipeline slot can't be used again anyway.
(In reply to Jacob Lifshay from comment #1) > all that's needed is a flip-flop chain that marks if the matching pipeline > stage has valid data or not. No need to modify the data in the pipeline, > since that takes up more gates and that pipeline slot can't be used again > anyway. the data's not important: the muxid is. what you suggest that has the unfortunate unintended side-effect of making the muxid impossible to use (immediately, either in the same or a subsequent cycle). to workaround that, it becomes necessary to add extra fields to muxid to say "hey i am a valid muxid where the previous one wasn't", errrr... or, you simply extend the number of Reservation Stations (and FUs) to *DOUBLE* the length of the pipeline, such that the "cancelled" items don't block things up. that in turn creates far more gates than adding "cancellation of muxid" ever would. for FPDIV we're already going to need a *massive* number of Function Units, otherwise we get a processing backlog at the FU / Reservation Stage. in a Concurrent Computation Unit design (pipelines with fan-in and fan-out, basically) if the number of FUs / Reservation Stations is less than the pipeline length, there is not enough inputs/results "storage latches" to match "data in the pipeline". * 3-wide fan-in, fan-out means 3 FUs * 4-long pipeline can process 4 sets of operands and produce only one result per cycle * 5 pieces of data need processing * the first 3 sets of operands go into the 3 FU "Reservation Stations", no problem. * the 4th *STALLS* the *ENTIRE* engine - and i do mean the *ENTIRE* instruction issue stage, freezing *ALL* further processing - for one clock cycle, waiting for one of the 3 FU "RSs" to become free. so if a muxid is not cancelled, you have to *wait* for that MUXID to pop out the result end, at which point you go, "oh! this result was actually cancelled a few clocks ago, but thank you for letting us know: *now* we can drop the "busy_o" signal on this Computation Unit: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/experiment/compalu.py;h=7da6b5cf9fa06b9bfa766f769ac19f4f49caf90b;hb=734d6ca4e4a4f6ea3d6038a54e50de5d76d9618b#l64 and *only* when busy_o is dropped can the engine continue. it's just one of those things about 6600-style scoreboards: the tomasulo algorithm works around this by making the Reservation Stations multi-entry, however the penalty for doing so is that the entire scoreboard (now a ROB in Tomasulo) has to use a CAM, in order to differentiate these multiple entries. bottom line is: being able to do global (immediate) cancellation of the muxid is really quite important. the *data* is actually not as important, because without the muxid the data isn't routed at the Fan-Out stage in the ReservationStation class. https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpdiv/pipeline.py;h=7c130fae0539a48c235b331f70b8e3f451da0eb6;hb=e47839fad581624f248e13b05b2ca9d9975d2f95#l36
http://bugs.libre-riscv.org/show_bug.cgi?id=126#c2 A new class, to replace use of SimpleHandshake, is needed. Its logic: * a unary muxid is unconditionally propagated down the pipeline (sync). * where the muxid is nonzero, the registers from the output are propagated (conditionally) to the next pipe stage. * a global unary "cancel" signal is accessible by all pipe stages. * if a bit in the "cancel" matches in the unary muxid, the output is NOT propagated to the next stage, and neither is the muxid. This gives data "freezing" characteristics, reducing power consumption when a muxid is zero. The cancel mask is global and is NOT a pipeline staged register, it is a FULLY GLOBAL signal that results in IMMEDIATE cancellation on a global basis of all and any muxids set. Thus any cancelled muxids may immediately be used on the next clock cycle. It is potentially possible, with a lot of messing about, that the cancelled muxid could be re-used on the same cycle that it is cancelled. The unary muxids are unique to each ReservationStation and must NOT occur twice in any pipeline. This is absolutely essential. No matter how large or how many pipes there are connected to ReservationStations, each muxid MUST occur once and only once across all stages. This for two reasons. 1. So that global cancellation works 2. So that on exit from the pipeline (no matter how complex the stage routing) it is GUARANTEED that the data associated with the (unique) muxid WILL get back to the correct ReservationStation result latch.
Also: I'd like to use dynamic classes, here. If people want to use SimpleHandshake, or something else, the code should enable them to do that. It means using 3 arg type() and __new__. https://www.tutorialdocs.com/article/python-class-dynamically.html
(In reply to Luke Kenneth Casson Leighton from comment #4) > Also: I'd like to use dynamic classes, here. If people want to use > SimpleHandshake, or something else, the code should enable them to do that. > > It means using 3 arg type() and __new__. > > https://www.tutorialdocs.com/article/python-class-dynamically.html that's done - it works. i've not tested an alternative class, yet, however i have confidence that a class can be dropped in place. what it means is: for users of the IEEE754FPU API who want to do single-issue, they can use whatever pipeline-building class they like. for "Cancellation", i looked at the existing fan-in and fan-out classes, and realised that changing them from the existing binary "muxid" is not only quite a lot of work, it's also counterproductive. so instead, "Cancellation" should be viewed effectively as "speculative predication", and a predication "mask", followed up by a *cancellation* request, should be *added* rather than *replace* the use of muxid. the OoO design requires this "speculative predication" anyway. it works by sending computations to [SIMD] elements *before* the actual predicate is known, doing a (separate) read on the register containing the predicate, working out which elements are to be "cancelled" and which not to be, then following up with a "cancellation" mask, broadcast to all elements currently in the middle of performing actual work. for Vectorisation, for using SIMD back-ends, we can simply "pre-mask" the (non-matching) elements at the end of the vector. so if there are only 3 elements to be calculated, and the SIMD back-end requires 4, the last element is "masked out". this technique would use *exactly* the same "mask" system as is required to perform "cancellation", so adding "Cancellation capability" effectively is the fundamental basis for both predication and Vectorisation.
damn. first experiment to blur the lines between data and control: fail. data is contained in the "Stage" API, and is passed through with a call to nmoperator.eq, and is otherwise untouched and completely opaque to the "pipeline" part that "handles" control and routing. i intended to add cancel and mask to the *data*, and to break the above rule by allowing the *control* side to inspect the data. the problem comes when trying to stop CMOS gate-flipping, as we talked about a few days ago, jacob. with the data being both passed across through registers *and* now effectively containing a "control" signal (the mask), the only way to get the data to discern when data should NOT be passed from pipeline register to pipeline register is to MODIFY that data, EXCLUDING sending the "control" signal (the mask). that would cause such a coding mess that i don't even want to try it, especially when doing so breaks the separation between data and control/routing in the first place. so instead i will look at how to add the "mask" and "cancel" signals to the *handling* side (PrevControl and NextControl), where, really, it should have been in the first place.
(In reply to Luke Kenneth Casson Leighton from comment #6) > so instead i will look at how to add the "mask" and "cancel" signals > to the *handling* side (PrevControl and NextControl), where, really, > it should have been in the first place. Sounds good to me.
(In reply to Jacob Lifshay from comment #7) > (In reply to Luke Kenneth Casson Leighton from comment #6) > > so instead i will look at how to add the "mask" and "cancel" signals > > to the *handling* side (PrevControl and NextControl), where, really, > > it should have been in the first place. > > Sounds good to me. okaay, it's a bit of a mess (the code in nmutil/iocontrol.py), however the latest unit test, cancellation seems to work. i broke the valid/ready rules, somewhat: the "mask", if non-zero, basically takes over. in effect, it turns the pipeline into one of dan gisselquist's simpler types: the "global CE" one. so it's critically important *not* to try to set the "recv not ready" signal, because the data *will* be sent, regardless. the masks (basically predication bits) are "Cat'd" together as they are funneled into the Multi-In Mux pipeline, so that none of them are lost. they're then funneled *back out* again at the other end (Multi-Out fan-out stage). i allowed the option in the API to specify multi-bit masks, so that we can have SIMD element cancellation as well. hmm, just fixed that: i'd missed that if something is cancelled it shouldn't continue to be passed down the chain! whoops.
https://git.libre-riscv.org/?p=ieee754fpu.git;a=commit;h=65e4f1069622b79a5782ef0754610c8d309439f0 unit test success on FPDIV! so that's a cancellable mask-capable pipeline.
i just realised that if using MaskCancellable for early-in, early-out and pipe-feedback, it could stall, as there would be circumstances where incoming data needs priority-routing (muxing) with the feedback data. that in turn means that MaskCancellable has to properly respect the ready/valid Data IO Handling API. drat. so i've changed MaskCancellable to be based fully on SimpleHandshake. it's necessary to be careful there because it's based on a combinatorial chain of ready/valid signalling. unit tests pass.