a straight in-order core is needed which has standard Power ISA 3.0 support and single-issue. very simple, no reg renaming, no optimisations, run-ahead speculation, unwinding or cancellation. if an instruction cannot complete 100% then stalling is deployed. three? four? stages: * fetch Stage API input: start (DMI) stop (DMI) step (DMI) pc (DMI) * decode Stage API input: CoreState raw_insn_i bigendian_i * issue Stage API input, for core: self.maindecoder: 102 Decode2ToExecute1Type("core", opkls=IssuerDecode2ToOperand) self.state = CoreState("core") self.raw_insn_i = Signal(32) # raw instruction self.bigendian_i = Signal() # bigendian - TODO, set by MSR.BE testing: * python3 simple/test/test_issuer.py nosvp64 --allow-overlap hazard
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;hb=HEAD core.py can likely be used as-is with no modifications, as long as the issue is stalled at issue phase itself, if the register "write outstanding" bitvector has all the bits clear. process: * decode detects register numbers for write and read and creates a unary bitvector mask for each. - add r5, r7, r3 would create: - read vector of 0b0000000010001000 - wr vector of 0b0000p00000100000 * global WRITE vector is ANDed with instruction READ vector - this detects any RaW (Read after Write) hazards - if the AND has any bits set, the entire processor MUST stall - this instruction and all others must NOT be issued - if it is clear (on this or a future cycle) the instruction may proceed * when the instruction proceeds the WRITE vector must be ORed into the GLOBAL write vector * when an instruction completes, the bits that were set in the GLOBAL write vector must be cleared. - this can be done by hooking into the write port (snooping) of the regfile (PC update is done this way) two things to note: * be careful when both reading and writing to the same reg. the regfiles all have "operand forwarding" so it is technically possible to achieve, but watch out for the bitvector * cancellation (exceptions) has to be thought through: anything still in the pipeline that was issued BEFORE the exception point has to be allowed to finish. strictly speaking if an exception could occur then no further instruction should be issued until the possibility of an exception has passed. (i.e. stall)
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;hb=HEAD regfiles are created either binary-addressed or unary-addressed, core.py shows how: 267 # select the required read port. these are pre-defined sizes 268 rfile = regs.rf[regfile.lower()] ... 329 if rfile.unary: 330 rens.append(addr_en) 331 else: 332 addrs.append(addr_en) 333 rens.append(rp) therefore, for creating a bitvector class, the Regfiles.rf dict can be enumerated, and a series of setter/getter ports added. unary regfile numbers (CRs) can be used as-is, but binary ones (RA, RB) must be converted to unary, 1<<RA funnily enough, a good class to use for that would be... an unary-addressable regfile! probably: RegFileArray(1,32) - 32 regs but 1 bit per reg. https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfile.py;h=c3f33393bde72951b27aa72664795c572913a7d0;hb=HEAD#l103 there will only be one bit per "reg", but enough ports need to be added so that issue can read all it needs and write all bits from all operands simultaneously. by that i mean: * for LD with update there are 2 reads and 2 write regs (GPRs) * ST with update, 3 reads 1 write * mul-accumulate 3 read 1 write therefore the INT reg bitvector "regfile" will need to be 2W but only 1R because although you want to be able to write 2 bits simultaneously (LD-with-update) only the Issuer will be reading the global vector. similar analysis for CR file, etc etc there may be "better" ways to do this, but this is reasonably quick and easy.
https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_outmux_pipe.py;h=d94b6394b55519d3a1f8383f24d3abde4c6e9539;hb=HEAD#l125 here is an example of how to use MultiOutPipe, although it may not becessarily be appropriate because of the different data sets (different subsets of decoding for different pipelines). it *might* still work though. the muxid would be set to the fan-out pipeline number, where a map was needed: muxid 0 => ALU, muxid 1 => Logical etc. although... looking now at Core connect_instruction i think it takes care of the fan-out perfectly well without needing MultiOutPipe https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;hb=HEAD#l193 so i think mostly instead it is fine to have a simple linear pipeline: * fetch * decode * issue to Core which handles fanout stall condition would be that the issue to core would not send "ready_o" if it was detected that the Global Hazard Vector had a bit set. each pipeline stage can be set as a StageAPI instance, but due to the stalling you have to have functions which say if the data is ready to be accepted https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_buf_pipe.py;h=e738657a08c20fce0b97f8d87be3c21b22fa88be;hb=HEAD#l721 720 @property 721 def d_ready(self): 722 """ data is ready to be accepted when this is true 723 """ 724 return (self.count == 1) # | (self.count == 3) 725 return Const(1) 726 727 def d_valid(self, i_ready): 728 """ data is valid at output when this is true 729 """ 730 return self.count == self.valid_trigger 731 return Const(1)
continuing the investigation: the MultiCompUnits in soc/fu/compunits.py need to be concurrent, using ReservationStations. * ReservationStations needs to be enhanced so that "fake" ALU instances are created which connect self.n[i] and self.p[i] into self.fakealus[i] * the "fake" ALUs need to be passed each to MultiCompUnits in order to present multiple access points to the same underlying pipeline * also their index muxid has to go into the fake alu * each "fake" ALU back in compunits.py is given a number alu0 alu1 mul0 mul1 mul2 etc to give the impression of having multiple concurrent units * in core.py the selector (enable signal) is masked out by FU busy signals to ensure that already-allocated Compunits do not try to allocate twice. therefore there is: * one actual (real) ALU * one ReservationStation per ALU * multiple ReservationStations which are named * the appearance of more FunctionUnits than actually exists * but all results are "managed".
96 class ReservationStations(Elaboratable): 110 def __init__(self, num_rows, maskwid=0, feedback_width=None): 111 self.num_rows = nr = num_rows 112 self.feedback_width = feedback_width 113 self.inpipe = InMuxPipe(nr, self.i_specfn, maskwid) # fan-in 114 self.outpipe = MuxOutPipe(nr, self.o_specfn, maskwid) # fan-out 115 116 self.p = self.inpipe.p # kinda annoying, 117 self.n = self.outpipe.n # use pipe in/out as this class in/out 118 def set_alu(self, alu): 119 srlf.alu, self.pseudoalus = alu, [] 120 for i in range(num_rows): 121 self.pseudoslus.append(PseudoALU(alu, self.p[i], self.n[i]) class PseudoALU: def __init__(self, alu, p, n): self.alu = alu self.p = p self.n = n here, now, when a ReservationStation class is used, the actual pipelined ALU will be presented as if it has simultaneous (concurrent) access ports but in reality the fan-in / fan-out of the RS Class only allows one request to be picked at a time. importantly, then, all requests are "managed" (monitored) from start to finish. if num_rows=1 this is the "Single" situation right now with TestIssuer. the length of each pipeline must be known because there must be equal or more RS rows per pipeline stage. if this is too many then FSMs must be used instead.
I think I'd prefer to start from a clean (empty) InOrderIssuer.py, and copy things over from TestIssuer.py as needed. Another way would be to make a full copy, and try to morph it incrementally, one FSM at a time, while making sure the tests still pass. What do you think?
(In reply to Cesar Strauss from comment #6) > I think I'd prefer to start from a clean (empty) InOrderIssuer.py, and copy > things over from TestIssuer.py as needed. Another way would be to make a > full copy, and try to morph it incrementally, one FSM at a time, while > making sure the tests still pass. What do you think? well, the FSMs were not designed for single-clock, they were in most cases designed for multi-clock operation. much of the "setup" though (interfaces, in/out, PowerDecoder2, core, DMI interface) will remain exactly the same (it has to, really) honestly, though, because of using the StageAPI, it may be simpler just to write it from scratch, even starting from not actually having a core at all but just fetch, issue (ignore it), increment the PC. really there should not be more than 450 lines of code (even when including DMI, DEC and TB), so cutting out the core (entirely, pretending the instruction has been executed with a "fake" ALU, or not even looking at the MultiCompUnit signals), it should be quite straightforward. btw i have started on compunits adding ReservationStations, only to find that the Mux-In and Mux-Out classes are hopeless :) they do not meet the ready/valid API properly and i think i know why: the Mux-Out needs a "busy" flag, per fan-out output.
I wrote a superscalar branch predictor that handles next-pc logic, it could probably be adapted for the in-order cpu by setting the issue width to 4-bytes (or 8-bytes if we want to handle any openpower v3.1 instructions). It has ready/valid inputs/outputs so should be pretty easy to integrate. BranchPredictor class: https://salsa.debian.org/Kazan-team/reg_rename_demo_cpu/-/blob/9c8690b116b0ad775aaeef066760a29ca6e60979/reg_rename_demo_cpu/fetch.py#L349 tests: https://salsa.debian.org/Kazan-team/reg_rename_demo_cpu/-/blob/9c8690b116b0ad775aaeef066760a29ca6e60979/reg_rename_demo_cpu/test_fetch.py#L76 I wrote it as part of an attempt to show that a simple superscalar OoO register-renaming cpu wouldn't take a month to write.
(In reply to Jacob Lifshay from comment #8) > I wrote a superscalar branch predictor that handles next-pc logic, a branch predictor is a non-essential task. anything that requires any kind of cancellation (branch prediction being one such) is completely off the table: stall is the sole exclusive option until such time as code execution is successful. there is insufficient time to do otherwise. > it could > probably be adapted for the in-order cpu by setting the issue width to > 4-bytes (or 8-bytes if we want to handle any openpower v3.1 instructions). no, those are off the table, too. there is insufficient time. > It has ready/valid inputs/outputs so should be pretty easy to integrate. doesn't help if there is no InOrder Issuer to integrate into. please keep this bugreport focussed on getting the required bare minimum features completed as quickly as possible. > I wrote it as part of an attempt to show that a simple superscalar OoO > register-renaming cpu wouldn't take a month to write. the focus of this bugreport has nothing to do with OoO or register renaming, please keep it focussed exclusively on the set task so as not to cause distraction.
(In reply to Luke Kenneth Casson Leighton from comment #9) > (In reply to Jacob Lifshay from comment #8) > > I wrote a superscalar branch predictor that handles next-pc logic, > > a branch predictor is a non-essential task. anything that requires any > kind of cancellation (branch prediction being one such) is completely > off the table: stall is the sole exclusive option until such time as > code execution is successful. we could have the fetch pipe fetch ahead...execution could still always stall rather than speculating -- all that would happen is it will have fetched the correct target instead of always obliviously fetching the instructions immediately following the branch so a branch doesn't always require flushing the entire fetch pipe... that will speed the whole cpu up by probably 50%. > > there is insufficient time to do otherwise. it isn't that complex imho... just an idea that we can use (or not). > > I wrote it as part of an attempt to show that a simple superscalar OoO > > register-renaming cpu wouldn't take a month to write. > > the focus of this bugreport has nothing to do with OoO or > register renaming, please keep it focussed exclusively on > the set task so as not to cause distraction. yup...just explaining where the code comes from.
(In reply to Jacob Lifshay from comment #10) > we could have the fetch pipe fetch ahead...execution could still always > stall rather than speculating -- all that would happen is it will have > fetched the correct target instead of always obliviously fetching the > instructions immediately following the branch so a branch doesn't always > require flushing the entire fetch pipe... that will speed the whole cpu up > by probably 50%. this is an optimistion, and it is scope-creep. it is the absolute worst possible thing to do on a project that is under time pressure to suggest optimisations, particularly when no actual code even implementing the specified non-optimised design does not even exist yet. i have asked you multiple times to focus and not introduce distractions. please stop recommending, raising, or discussing optimisations. if there was no time pressure under contract it would be perfectly fine. as we are under contract and under time pressure it is not fine and never will be fine. please listen and keep this bugreport focussed. no more ideas that could jeapordise the contract: only action and absolute focus on completion of the contract. are we absolutely clear?
commit 45d50392c8fd4a70867ef82db0bc2f366b408cee (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Sun Nov 7 15:40:17 2021 +0000 add hazard vectors to Regfiles the reason for adding it to Regfiles is because both In-Order and OoO need global hazard vectors. in the case of In-Order the hazard vector bits are set directly by the Issue Engine. in the case of Out-of-Order the vector bits are set by way of an amalgamation (Great Big Or Gate) of the columns from the DMs in either case the vectors are needed, so might as well be added to Regfiles
https://bugs.libre-soc.org/show_bug.cgi?id=742#c5 ReservationStations2 is now working, and has been tested with num_rows=1 (and is in active usage already). something to watch out for, in core.py: # enable the required Function Unit based on the opcode decode # note: this *only* works correctly for simple core when one and # *only* one FU is allocated per instruction for funame, fu in fus.items(): fnunit = fu.fnunit.value enable = Signal(name="en_%s" % funame, reset_less=True) comb += enable.eq((self.e.do.fn_unit & fnunit).bool()) comb += fu_bitdict[funame].eq(enable) therefore it is not yet safe to enable num_rows > 1 it is perfectly fine to set num_rows=1 and have InOrder overlapping instructions, that can go ahead (right now), there should be no problem. an add instruction will quite safely overlap with a mul instruction will quite safely overlap with a DIV instruction etc. etc. the problem will come if num_rows is set to greater than one: this will result in more than one "en_{funame}" being set, and the instruction will be sent to *more than one* Function Unit. to solve that, a PriorityPicker is needed (just like exists in the CDC6600 and therefore also in the scoreboard6600 experimental code) but, that is for later: right now there is nothing stopping the InOrder core from being written (just with all num_rows=1)
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/div/fsm.py;h=1b22ca6f3f145f58e547451f496106e07bcc188d;hb=809cf2faa4450901779045cfaa89e69f70ed9f42#l133 crsar, here is an example of an FSM-based pipeline which is setting the ready/valid signalling directly, itself. you need: * derive from ControlBase * add p.i_data manually * add n.o_data manually * set p.o_ready only when ready to accept input (no stall) * always accept one data per clock when p.i_valid & p.o_ready is true * only send on data when n.i_ready is HI * always set n.o_ready at the exact same time as n.o_data is valid that's about it. quite easy. generally though it is better to properly use the API by setting up a completely separate "stage" module which is passed in to ControlBase. or, the class to pass in stage=self. hmm i just realised that the DIV FSM is actually completely ignoring the Stage API, i will update it.
done. hilariously, there's a severe (catastrophic) bug in python 3.7 which required a workaround (addition of a setup() function) to prevent a (catastrophic) coredump by /usr/bin/python3.7 due to a null pointer de-reference. ignoring that... https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=b50d83f14e83aae30f52aedd030e385889a50867 class FSMDivCoreStage(ControlBase): def __init__(self, pspec): - super().__init__() - self.pspec = pspec - self.p.i_data = CoreInputData(pspec) - self.n.o_data = CoreOutputData(pspec) - self.saved_input_data = CoreInputData(pspec) + self.pspec = pspec # store now: used in ispec and ospec + super().__init__(stage=self) + self.saved_input_data = self.ispec() self.empty = Signal(reset=1) self.saved_state = DivState(64, name="saved_state") self.div_state_next = DivStateNext(64) self.div_state_init = DivStateInit(64) self.divisor = Signal(unsigned(64)) + def ispec(self): + return CoreInputData(self.pspec) + + def ospec(self): + return CoreOutputData(self.pspec) ... you can see how rather than self.p.i_data and self.n.o_data being explicitly set, the Stage API is supposed to be used: setting i_data and o_data is the Stage API's job (inherited by ControlBase).
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=7036fbf292ed8a5bc8393c3c95e15a28870ee325 i now have a suite of PriorityPickers at the front of Core issue. this means that only one FU will ever get picked (even if there are multiple ReservationStations). this can actually be tested even with TestIssuer by issuing a DIV instruction followed by (a few) add instructions. the DIV will still be running (64+ cycles) whilst the ADD gets round to being loaded and issued. of course if the DIV instruction tries to use the same registers as the ADD it will get the wrong answer, but it will be a way to test whether overlapping instructions can work at all.
Cesar, i am slowly morphing core so it can have regfile hazard vectors (which you won't need to start with as long as running test instructions that avoid hazards) https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=103d9306983e8782c930590fe58af2fc960ee216 this puts the requested port names into a pair of dictionaries. i can then, for the "Hazards" regfiles, request the exact same names, which will make it much easier in core.py to create a matching "bit-setter" / "bit-clearer" system.
the write protection hazard vector is now in place but is not in use. there is further work needed to cover the situation where Data.ok is not set by the ALU but wrflags was set. this is reasonably straightforward to do. after that, the bitvector read is ready to try out, and that can be done with TestIssuer by using a DIV and an ADD instruction. DIV will take many more cycles than the FSM so is perfect to try.
cesar i added a FetchOutput data structure, for fetch pipeline ospec(): https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=f1c63229cdf94d9fbca54086ff13d6a149245814 class FetchOutput: def __init__(self): #, svp64_en): self.state = CoreState("core_fetched") self.raw_insn_i = Signal(32) # one raw instruction self.bigendian_i = Signal() # bigendian - TODO, set by MSR.BE
commit 7492a3533c61a6999d36df687c8d5e6e3603b0d6 (HEAD -> master, origin/master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Nov 17 18:01:27 2021 +0000 reading of regfile bitvector added, which activates on a per-FU basis at the regfile read port this is somewhat complete overkill because strictly speaking the read should be done at issue time. fortunately, merging of lots of ORs results in the exact same thing, just distributed horribly inefficient though to make use of pre-existing for-loops and data structures, decode_regfile_read() at each (distributed) point can raise its flag, "is the main decoder requesting that this register be read" actually what is needed is: back in the connect_instruction() function all available decode_regfile_read() functions are called there and then, only the once, for each register. looking at decode_regfile_read() there should, strictly speaking, be 17 separate bitvector requests/merges. at present there are *30* because MUL requests RA, ALU requests RA, etc. etc.
I was thinking about the "core stopped" signal for DMI. I think it could be generated by the AND of all the "o_ready" signals of the reservation stations. When all reservation stations are ready, it means they are not processing any instructions, and the core is stopped. Another way would be to maintain a (small) counter of issued instructions, and another for retired instructions. When they match, there is no instruction in flight, and the core is stopped.
(In reply to Cesar Strauss from comment #21) > I was thinking about the "core stopped" signal for DMI. > > I think it could be generated by the AND of all the "o_ready" signals of the > reservation stations. When all reservation stations are ready, it means they > are not processing any instructions, and the core is stopped. ah no, it has to also include the DMI state of the user request. as in: yes, you are right: stopped can be generated that way, but ONLY when requested to do so. > > Another way would be to maintain a (small) counter of issued instructions, > and another for retired instructions. When they match, there is no > instruction in flight, and the core is stopped. there is already an fu busy signal per FU, a counter is not needed. an OR of all FU Busy signals is already how the FSM "busy" is generated right now.
the integration of bitvectors went perfectly right up to the moment when a combinatorial loop was detected. the loop is as follows: * bitvector regfiles were set to a combinatorial variant on operand forwarding. * this so that set/clear of writes could be detected immediately by the read hazard, in the *current* cycle * this so that issue could be held until the instruction was ready unfortunately the actual setting/clearing is gated by FU issue/busy which is also combinatorial which results in the read of the bitvector being dependent on setting/clearing which then gates issue and hence there's a loop. the solution - which occurred to me just now - is this: * make the bitvector regfiles a sync (one-clock delay) * request the read of the bitvector at issue time (bear in mind it will arrive 1 clock late but that's ok because...) * once issue is set, only on the *next* clock will the read-requests be set, at which point... * the bitvector hazard read will also arrive and... * if any bit is set, the read-requests can be gated out (prevented) the irony is that this is *exactly* and i do mean exactly the job of a FU-Regs Dependency Matrix
(In reply to Luke Kenneth Casson Leighton from comment #23) > * if any bit is set, the read-requests can be gated out (prevented) success, by gating the rd.go.rel_o if there is even one read hazard for the port. this is quite draconian (excessive) but does the job. now a process of checking for errors can begin: this will get complicated quite quickly.
Current plan for the in-order pipelined issuer: * Create pipelined_issuer.py in src/soc/simple (alongside issuer.py). * Copy interface and initialization code from TestIssuerInternal into PipelinedIssuer class, so that one can be substituted for the other transparently. * Add an argument to test_issuer.py, to instantiate a PipelinedIssuer module instead of TestIssuerInternal. Likely, need to adjust HDLRunner in https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/test/test_runner.py;h=03d4fe96197706bd8c52b9fc28d3cd8952dc05cf;hb=HEAD#l132
(In reply to Cesar Strauss from comment #25) > Current plan for the in-order pipelined issuer: > > * Create pipelined_issuer.py in src/soc/simple (alongside issuer.py). to help with clarity on that, what i think i will do this afternoon is split out fetch_fsm into its own submodule. it is one of the few with clear in/out ready/valid signalling, whereas some of the others have dual ready/valid. planning ahead back when TestIssuerInternal was being developed, they shouldn't have had, but hey. > * Copy interface and initialization code from TestIssuerInternal into > PipelinedIssuer class, so that one can be substituted for the other > transparently. good idea > * Add an argument to test_issuer.py, to instantiate a PipelinedIssuer module > instead of TestIssuerInternal. Likely, need to adjust HDLRunner in > https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/test/ > test_runner.py;h=03d4fe96197706bd8c52b9fc28d3cd8952dc05cf;hb=HEAD#l132 yes, that one bypasses TestIssuer itself due to the clock-domain crossing and the (optional) addition of a PLL and other things, none of which are relevant for actual instruction running in a unit test environment.
IRC log notes lkcl yehyeh. this makes it... awkward to turn into separate FSMs, which then in turn can be morphed into pipelines 19:49 lkcl perhaps by cutting out SVP64 entirely first it would become much easier 19:50 lkcl everything should be a forward-chain (only) 19:53 lkcl with the sole exception being: 19:53 lkcl * reading of PC (if it is detected to have been changed by TRAP or BRANCH) 19:54 lkcl * reading of MSR (same, by TRAP or MTMSR) 19:54 lkcl * a global stall condition 19:54 lkcl * a global "core reset" condition 19:55 lkcl that's pretty much it: that's the only "backwards" feedback, from later stages to earlier ones, and even PC and MSR are via the regfile (already), not by special datapaths 19:56 lkcl oh, of course, the exception flags, from LDST. 19:58 lkcl those are also backwards-propagated 19:58 lkcl under... guess what: stall conditions of course :)
(In reply to Luke Kenneth Casson Leighton from comment #27) > yehyeh. this makes it... awkward to turn into separate FSMs, which then > in turn can be morphed into pipelines > perhaps by cutting out SVP64 entirely first it would become much easier > everything should be a forward-chain (only) Sure. I can start fixing TestIssuerInternal FSMs to be forward-chain only, if it helps. In that case, I would leave the Fetch/Decode pipelines for later.
(In reply to Cesar Strauss from comment #28) > Sure. > > I can start fixing TestIssuerInternal FSMs to be forward-chain only, if it > helps. i think it will: then, when copying them to make the pipelines, actually that becomes trivial. > In that case, I would leave the Fetch/Decode pipelines for later. i think it is an important intermediary step, that makes the creation of pipelines almost trivial, especially when you can see how i split out FetchFSM: actually the changes needed to make that "true" pipeline are negligeable.
the following cases are now successfully detected: * read-after-write: instruction is issued but stalls, other instruction issue is NOT stalled (making the core "superscalar") * write-after-write: instruction is NOT issued, but is captured and acknowledged back to the Issuer. attempts to issue continue and will succeed when the WaW hazard clears * read-and-write-by-same-instruction: this case is covered by DELAYING write-hazard setting by one clock cycle, such that the RaW hazard check does NOT include the instruction currently being issued potential problems to monitor: * if the write-hazard bitvector setting is delayed, and the write to its SR Latch is also delayed, then two instructions issued one clock cycle apart might not properly detect read-after-write hazards. right now this is undetectable because the FSM of TestIssuer cannot issue instructions that fast (1 every 4-5 cycles) * LD/ST instructions must not be acknowledged unless the opportunity for an exception has passed * likewise branch and trap (anything that changes PC)
must sort MSR use get_state exactly as with PC and SVSTATE. also TODO create a TestIssuerBase class
i have this split out now, separate file https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/inorder.py;hb=HEAD the goal here is to remove every FSM, or, more to the point, make every FSM only one entry (one state). an intermediary phase is to make every FSM two-state only: one input ready/valid reader, one output ready/valid writer. a Decode FSM (Decode stage) is currently missing, this should involve pdecode2 and nothing else. (oh except for state, containing pc msr svstate i.e. CoreInput. (In reply to Luke Kenneth Casson Leighton from comment #31) > must sort MSR use get_state exactly as with PC and SVSTATE. > > also TODO create a TestIssuerBase class both done
Cesar, i thought through some ideas, and if you pass the signal core.busy_o globally back to fetch, that will be sufficient. Fetch FSM: if not core.busy_o: n.data.insn = fetched_insn n.o_valid = 1 Decode Phase: if not core.busy_o same thing in other words core.busy_o is the global stall condition. in this way it will be really simple and easy to do.