whilst DecodeA picks up that RA may be a zero immediate, this is not actually passed through Decode2ExecuteType, nor does the ALU CompUnit recognise it.
forgot that i had raised 334 last night. this one can therefore be about the CompUnit. the change needed is: * first that Decode2ExecuteType needs a flag ra_is_zero * second that CompAluOpSubset and CompLogicalALUSubset need likewise * third - and this is the kicker - that the FSM for src1_i needs a Mux to select RA or zero *and* it needs to *no longer* request a READ for src1 (rd.req[0]) if the ra_is_zero flag is set. * additionally that if the z flag is set, the rd_done part of the FSM does NOT wait for GO_READ on the src1 port (rd.go[0]) i will draw out a diagram. again. :)
Dear Luke, Please do not bother drawing a gate level diagram for the FSM. Your description is enough. For the dataflow (registers, ALU, muxes), yes, a diagram is useful, but do represent the whole FSM as an empty box. I find that I can better design an FSM by starting from a behavioral description, preferably in the form of a timing diagram. If you like, you can draw one in paper, of what you want to see in the GTKWave window. With arrows linking what edge causes which transition in another signal. What do you think? Regards, Cesar
(In reply to Cesar Strauss from comment #2) > Dear Luke, > > Please do not bother drawing a gate level diagram for the FSM. Your > description is enough. we need one for the documentation, which is an extremely important aspect of this project. plus i find it very difficult to think of timing related designs without a gate level drawing. as in i literally can't write these FSMs without one (!) this one is wrong (imm needs muxing) https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg the imm mux path in this one is correct https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg immed mode hooks RD_REQ2, switching it off. > For the dataflow (registers, ALU, muxes), yes, a diagram is useful, but do > represent the whole FSM as an empty box. > > I find that I can better design an FSM by starting from a behavioral > description, then comment 1 should do the trick? RD_REQ1 (rd.req[0]) needs to be disabled if oper_i.zero_a is set, and src_i[0] set to zero in all 64 bits. > preferably in the form of a timing diagram. i am used to examining those, drawing them not so much. > If you like, you can draw one in paper, of what you want to see in the > GTKWave window. With arrows linking what edge causes which transition in > another signal. > > What do you think? i can give it a shot once i am confident with the gate level. are you ok to go via the description for now?
(In reply to Luke Kenneth Casson Leighton from comment #3) > (In reply to Cesar Strauss from comment #2) > > Dear Luke, > > > > Please do not bother drawing a gate level diagram for the FSM. Your > > description is enough. > > we need one for the documentation, which is an extremely important aspect of > this project. I agree, documentation is essencial. But I argue that a diagram that shows too much detail, and goes all the way to gate level, is more like source code than documentation. > plus i find it very difficult to think of timing related designs without a > gate level drawing. > > as in i literally can't write these FSMs without one (!) Sure, I understand. It's just practice, and realizing the convenience. At the start, I used to input all my FPGA designs with a schematics editor. The FPGA library actually had blocks for many of the 74XX logic family parts. Then, as I learned about VHDL, I started translating some of my previous designs to it. Eventually, I got so used to it, that I started writing VHDL directly. It's so much more compact, so less tangled wires, so less time spent arranging the symbols around the page. Lastly, I learned about Verilog, and switched to it for its simplicity. > > For the dataflow (registers, ALU, muxes), yes, a diagram is useful, but do > > represent the whole FSM as an empty box. > > > > I find that I can better design an FSM by starting from a behavioral > > description, > > then comment 1 should do the trick? Yes, it's all I need. Regards, Cesar
Cesar, hi, just to let you know, this should *not* interfere with adding is_zero support. it however will be necessary to use this: if hasattr(self.oper_i, "zero_a"): # ok now we know that this operation requires zero_a support, we can # mux in zero_a. otherwise we must put register a straight through i am doing a minor code-morph on MultiCompUnit which makes it suitable for general use across multiple pipelines, as specified by "regspecs". an example is below. note that this example - which uses CompCROpSubset **NOT** CompALUOpSubset - is why we will need the "detection" above. please do not let this stop you from adding zero_a detection support, let us keep in touch and coordinate here. i will have to add "detect if there is an operand B immediate" in the same way, so you can see how it is done. class CompCROpSubset(Record): def __init__(self, name=None): layout = (('insn_type', InternalOp), ('fn_unit', Function), ('insn', 32), ('read_cr_whole', 1), ('write_cr_whole', 1), ) class CRInputData(IntegerData): regspec = [('INT', 'a', '0:63'), # 64 bit range ('INT', 'b', '0:63'), # 6B bit range ('CR', 'full_cr', '0:31'), # 32 bit range ('CR', 'cr_a', '0:3'), # 4 bit range ('CR', 'cr_b', '0:3'), # 4 bit range ('CR', 'cr_c', '0:3')] # 4 bit range class CROutputData(IntegerData): regspec = [('INT', 'o', '0:63'), # 64 bit range ('CR', 'full_cr', '0:31'), # 32 bit range ('CR', 'cr_o', '0:3')] # 4 bit range ... class CRPipeSpec(CommonPipeSpec): regspec = (CRInputData.regspec, CROutputData.regspec) opsubsetkls = CompCROpSubset
ok done for imm_data, here is the location, it is virtually a cookie-cut of what is done for imm_data, bear in mind it will need to be src_l.q[0] and because src_l.q[0] covers RA, where src_l.q[1] covers RB. likewise the src1_or_imm needs to go into sl[0] not sl[1] ``` # if the operand subset has "zero_a" we implicitly assume that means # src_i[0] is an INT register type where zero can be multiplexed in, instead. # see https://bugs.libre-soc.org/show_bug.cgi?id=336 #if hasattr(oper_r, "zero_a"): # select zero immediate if opcode says so. however also change the latch # to trigger *from* the opcode latch instead. # ... # ... # if the operand subset has "imm_data" we implicitly assume that means # "this is an INT ALU/Logical FU jobbie, RB is multiplexed with the immediate" if hasattr(oper_r, "imm_data"): # select immediate if opcode says so. however also change the latch # to trigger *from* the opcode latch instead. op_is_imm = oper_r.imm_data.imm_ok ```
(In reply to Luke Kenneth Casson Leighton from comment #6) > ok done for imm_data, here is the location, it is virtually a cookie-cut of > what > is done for imm_data, bear in mind it will need to be src_l.q[0] and because > src_l.q[0] covers RA, where src_l.q[1] covers RB. likewise the src1_or_imm > needs to go into sl[0] not sl[1] I have done the above. I held my commit because you seemed to be working on the file. It's OK to commit now?
(In reply to Cesar Strauss from comment #7) > (In reply to Luke Kenneth Casson Leighton from comment #6) > > ok done for imm_data, here is the location, it is virtually a cookie-cut of > > what > > is done for imm_data, bear in mind it will need to be src_l.q[0] and because > > src_l.q[0] covers RA, where src_l.q[1] covers RB. likewise the src1_or_imm > > needs to go into sl[0] not sl[1] > > I have done the above. excellent. > I held my commit because you seemed to be working on the file. yeh don't do that: just shove it in :) > It's OK to commit now? git pull to merge, first, then go for it. i happen to just be going out shopping anyway.
(In reply to Luke Kenneth Casson Leighton from comment #8) > (In reply to Cesar Strauss from comment #7) > > (In reply to Luke Kenneth Casson Leighton from comment #6) > > > ok done for imm_data, here is the location, it is virtually a cookie-cut of > > > what > > > is done for imm_data, bear in mind it will need to be src_l.q[0] and because > > > src_l.q[0] covers RA, where src_l.q[1] covers RB. likewise the src1_or_imm > > > needs to go into sl[0] not sl[1] > > > > I have done the above. > > excellent. > > > I held my commit because you seemed to be working on the file. > > yeh don't do that: just shove it in :) > > > It's OK to commit now? > > git pull to merge, first, then go for it. i happen to just be going out > shopping > anyway. looks great, cesar - unit test time. it should be as simple as setting dut.oper_i.zero_a.eq(1) - or adding an extra argument to op_sim() to allow that to be set dynamically - then setting a to a non-zero value and checking that, yes, actually, a is now irrelevant (set to zero). a more advanced version of the op_sim() test should be not to set rd.go.eq(0b11) in an ad-hoc fashion: instead to check rd_rel_o bits. this starts to be very much like the planned LD/ST unit test modifications, and you could probably use the exact same code for both. however please do everything in an incremental fashion: no more than 5-10/15 lines at a time when modifying existing code (if that is practical). so, start with adding the extra argument to op_sim() to allow zero_a to be set then add a test call in scoreboard_sim op_sim(a=12312313213, zero_a=True, ...) and so on.
cesar i created a common function for both zero a and imm b muxing. should be functionally the same. note that the Mux will automatically have its inputs extended with zero padding to the max operand width. so i replaced Const 0 with just 0
brilliant, just saw this, Cesar. confirmed that it is functional. commit 260625df9309f8f35541207cab431dd8dba90c5a Author: Cesar Strauss <cestrauss@gmail.com> Date: Sat May 23 19:52:08 2020 -0300 Add a few test cases with zero_a set, in combination with imm_ok so the next one is: we need something that's a bit more sophisticated, designed more along the lines of test_inout_mux_pipe.py: https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_inout_mux_pipe.py;h=03fddde0f7ab12ec9023fc57b913442ba42c120a;hb=HEAD#l52 not quite that sophisticated, but close. basically, we need: * one function (similar to TestInput.send) which takes care of *one* REQ_READ and GO_RD signalling bit. it should monitor: self.rd.req[N], then set self.rd.go[N] for *one* cycle (one blank yield) then set it self.rd.go[N] back to zero and confirm (assert) that self.rd.req[N] has gone to zero * another function (similar again to TestInput.rcv) which does the same thing for REQ_WRITE and GO_WRITE. * another function - a very simple one - which sets issue_i for one cycle (yield in between), then waits for "self.busy" to drop to 0 we can make it more sophisticated as we go along. it should be... only... about... 80 lines of code, at a guess.
https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_buf_pipe.py;hb=HEAD just occurred to me, see class Test3, this nay be simpler to start from. remove the for loops for now, add a parameter read_idx to the send function, and write_idx to rcv function. read_idx will refer to self.compunit.rd.rel[read_idx] you see how that would go?
(In reply to Luke Kenneth Casson Leighton from comment #11) > brilliant, just saw this, Cesar. confirmed that it is functional. > > commit 260625df9309f8f35541207cab431dd8dba90c5a > Author: Cesar Strauss <cestrauss@gmail.com> > Date: Sat May 23 19:52:08 2020 -0300 > > Add a few test cases with zero_a set, in combination with imm_ok > Still need a way to test the case of zero_a and/or imm_data being entirely absent, in which case the muxes are not generated.
(In reply to Cesar Strauss from comment #13) > Still need a way to test the case of zero_a and/or imm_data being entirely > absent, in which case the muxes are not generated. ah if you create a test which uses... mmm.... which one is dead-simple... CompCROpSubset, then that doesn't have imm_data or zero_a. see soc.experiment.alu_hier.DummyALU i've just added a DummyALU for you, it ignores operand b, ignores the actual operand type (up to you if you want to put that back, do something different), just copies a to o - nothing sophisticated. so you can copy test_compunit_regspec1(), call it test_compunit_regspec2() then cookie-cut scoreboard_sim() and op_sim() to remove trying to set imm_data, invert_a, and zero_a (because those don't exist in CompCROpSubset), and just assert that the result is equal to the 1st operand, a. if you really wanted to you could actually remove b from the inspec in test_compunit_regspec2() and also in DummyALU.
Found this: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=75df2153517ecd733cd5270776e88d76b8df5d06;hb=HEAD#l213 name = "data_r%d" % i data_r = Signal(self.cu._get_srcwid(i), name=name, reset_less=True) latchregister(m, self.get_out(i), data_r, req_l.q[i], name) We are giving the same name for two things, the output signal of the latchregister, and its internal DFF (given by the function parameter). nMigen will append "$1" to one of them, but which? Can we go and choose a name, given to its internal DFF, different from its output, in this and similar cases? Also: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=75df2153517ecd733cd5270776e88d76b8df5d06;hb=HEAD#l208 latchregister(m, self.oper_i, oper_r, self.issue_i, "oper_r") In this case, as it happens, oper_r sub-signals do not receive names that begin with "oper_r". So, what receives the "oper_r" name is actually the internal DFF, which confused me at no end on the GTKWave timing diagram. This is how I found this issue.
(In reply to Cesar Strauss from comment #15) > Found this: > > https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/ > compalu_multi.py;h=75df2153517ecd733cd5270776e88d76b8df5d06;hb=HEAD#l213 > > name = "data_r%d" % i > data_r = Signal(self.cu._get_srcwid(i), name=name, reset_less=True) > latchregister(m, self.get_out(i), data_r, req_l.q[i], name) > > We are giving the same name for two things, the output signal of the > latchregister, and its internal DFF (given by the function parameter). > nMigen will append "$1" to one of them, but which? wonderful, isn't it? > Can we go and choose a name, given to its internal DFF, different from its > output, in this and similar cases? sure. i'd suggest doing this: latchregister(m, self.get_out(i), data_r, req_l.q[i], name) --> latchregister(m, self.get_out(i), data_r, req_l.q[i], name+"_l") > > Also: > > https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/ > compalu_multi.py;h=75df2153517ecd733cd5270776e88d76b8df5d06;hb=HEAD#l208 > > latchregister(m, self.oper_i, oper_r, self.issue_i, "oper_r") sure - feel free to call it "oper_l" instead
(In reply to Luke Kenneth Casson Leighton from comment #16) > sure. i'd suggest doing this: > latchregister(m, self.get_out(i), data_r, req_l.q[i], name) > --> > latchregister(m, self.get_out(i), data_r, req_l.q[i], name+"_l") > sure - feel free to call it "oper_l" instead btw, commit this immediately - do not wait to "accumulate" a series of commits, and *definitely* do not put multiple separate and distinct "purposes" into the one commit. git is intelligent enough to merge commits that are not on the exact same line. therefore, to speed up collaboration, do take advantage of that: do not wait, just commit it - and commit "single-purpose". if you find you want to use the word "and" in the commit message, *stop* - it means that the commit is not single-purpose. more on that in HDL_workflow.
(In reply to Luke Kenneth Casson Leighton from comment #17) > btw, commit this immediately - do not wait to "accumulate" a series of > commits, nicely done, cesar. commit a5968ecf27c0be3b357342889aaa8e48c1a519cb Author: Cesar Strauss <cestrauss@gmail.com> Date: Sun May 24 16:44:12 2020 -0300 Rename the internal DFF of latchregisters to avoid conflict
(In reply to Luke Kenneth Casson Leighton from comment #18) > (In reply to Luke Kenneth Casson Leighton from comment #17) > > > btw, commit this immediately - do not wait to "accumulate" a series of > > commits, > > nicely done, cesar. > > commit a5968ecf27c0be3b357342889aaa8e48c1a519cb > Author: Cesar Strauss <cestrauss@gmail.com> > Date: Sun May 24 16:44:12 2020 -0300 > > Rename the internal DFF of latchregisters to avoid conflict Thanks. When I did the above, I noticed src/soc/experiment/compalu.py and src/soc/experiment/compldst.py have been suffering from bitrot. I suppose they are previous attempts, no longer updated, and kept for historical reasons?
(In reply to Cesar Strauss from comment #19) > When I did the above, I noticed src/soc/experiment/compalu.py and > src/soc/experiment/compldst.py have been suffering from bitrot. I suppose > they are previous attempts, no longer updated, and kept for historical > reasons? they're only there from the transition to multi-bit RD/WR signalling.
(In reply to Luke Kenneth Casson Leighton from comment #1) > * third - and this is the kicker - that the FSM for src1_i needs a Mux to > select RA or zero *and* it needs to *no longer* request a READ for src1 > (rd.req[0]) if the ra_is_zero flag is set. > > * additionally that if the z flag is set, the rd_done part of the FSM does > NOT wait for GO_READ on the src1 port (rd.go[0]) Just a reminder that the FSM still needs adjustment for rd.rel/rd.go. Checked in GTKWave that rd.rel is still being set unconditionally after issue_i.
(In reply to Cesar Strauss from comment #21) > Just a reminder that the FSM still needs adjustment for rd.rel/rd.go. > Checked in GTKWave that rd.rel is still being set unconditionally after > issue_i. if rd.rel[0] is being set when zero_a is on, yes that's bad. rd.go[N] is an external response to rd.req[N]. this is why it is critical to get rd.req and wr.req bits set right because the external user of the CompUnits kniws nothing about *why* they are set, they just respond.
255 # all request signals gated by busy_o. prevents picker problems 256 m.d.comb += self.busy_o.eq(opc_l.q) # busy out 257 bro = Repl(self.busy_o, self.n_src) 258 m.d.comb += self.rd.rel.eq(src_l.q & bro) # src1/src2 req rel 259 ah. right. try this, at line 258 sls = [] for l in sl: sls.append(l[2]) # get latch conditions self.rd.rel.eq(Cat(*sls) & bro) that is bringing in the immediate selection. i think. from the muxes.
m.d.comb += src_sel.eq(Mux(op_is_imm, self.opc_l.q, self.src_l.q[i])) sl[i][2] = src_sel hmm you see how that interacts, from the src_srl?
(In reply to Luke Kenneth Casson Leighton from comment #24) > m.d.comb += src_sel.eq(Mux(op_is_imm, self.opc_l.q, self.src_l.q[i])) > sl[i][2] = src_sel > > hmm you see how that interacts, from the src_srl? not quite correct, did something different, do git pull see if you can get the test to work. will take a look in morning.
interesting! # XXX - immediate and zero is not a POWER mode (and won't work anyway) # reason: no actual operands. #result = yield from op_sim(dut, 5, 2, InternalOp.OP_ADD, zero_a=1, # imm=8, imm_ok=1) #assert result == 8 this test is invalid, because it effectively is a "update register with immediate", disabling both A and B regs. i believe there are other opcodes to cover this, therefore it is not a valid test. which is good because the FSM doesn't cope with it (it relies on at least one RD.GO coming in, and if both operands are immediates, that can never happen)
argh yes it is a valid test. there is an add zero immediate instruction. so this means rd_done will never be set, and the FSM never proceeds to completion on 2-operand CompUnits
Don't lose sleep about it, I can handle the FSM.(In reply to Luke Kenneth Casson Leighton from comment #25) > not quite correct, did something different, do git pull see if you can > get the test to work. will take a look in morning. Don't lose sleep about it, I can handle the FSM. It's my bread and butter.
(In reply to Luke Kenneth Casson Leighton from comment #22) > (In reply to Cesar Strauss from comment #21) > > > Just a reminder that the FSM still needs adjustment for rd.rel/rd.go. > > Checked in GTKWave that rd.rel is still being set unconditionally after > > issue_i. > > if rd.rel[0] is being set when zero_a is on, yes that's bad. > > rd.go[N] is an external response to rd.req[N]. this is why it is critical > to get rd.req and wr.req bits set right because the external user of the > CompUnits kniws nothing about *why* they are set, they just respond. In this case, "external user" are the dependency matrices, right? I've read https://libre-soc.org/3d_gpu/architecture/6600scoreboard/. As far as I understand it, the Dependency Matrices also need to know themselves that the operand is immediate, and not set the corresponding FU-Reg Latch, correct? Is this already implemented?
(In reply to Cesar Strauss from comment #28) > Don't lose sleep about it, I can handle the FSM.(In reply to Luke Kenneth > Casson Leighton from comment #25) > > not quite correct, did something different, do git pull see if you can > > get the test to work. will take a look in morning. > > Don't lose sleep about it, I can handle the FSM. It's my bread and butter. :) i realised that actually there's probably nothing wrong with the modification i made: it's the unit test that must not check a flag (rd.req) that is never going to get set. (In reply to Cesar Strauss from comment #29) > (In reply to Luke Kenneth Casson Leighton from comment #22) > > if rd.rel[0] is being set when zero_a is on, yes that's bad. > > > > rd.go[N] is an external response to rd.req[N]. this is why it is critical > > to get rd.req and wr.req bits set right because the external user of the > > CompUnits kniws nothing about *why* they are set, they just respond. > > In this case, "external user" are the dependency matrices, right? yes. > I've read https://libre-soc.org/3d_gpu/architecture/6600scoreboard/. > > As far as I understand it, the Dependency Matrices also need to know > themselves that the operand is immediate, not quite. immediates are immediate, therefore there is no concept of a "dependency". therefore, technically you are correct, but it is by way of "total omission of involvement". > and not set the corresponding FU-Reg Latch, correct? by way of the DMs not in the slightest remote bit knowing *anything* about immediates.... yes. > Is this already implemented? yes. the DMs manage "dependencies". an immediate is not a Dependency Hazard (not a read or write register).
Regarding the parallel test: Pushed a proof of concept. Defective as it was, I pushed it anyway, hoping for some feedback on the bug. Found it myself literally seconds after pushing.
(In reply to Cesar Strauss from comment #31) > Regarding the parallel test: > > Pushed a proof of concept. > > Defective as it was, I pushed it anyway, hoping for some feedback on the > bug. Found it myself literally seconds after pushing. :) ok i added some stubs here.
(In reply to Luke Kenneth Casson Leighton from comment #32) > (In reply to Cesar Strauss from comment #31) > > Regarding the parallel test: > > > > Pushed a proof of concept. > > > > Defective as it was, I pushed it anyway, hoping for some feedback on the > > bug. Found it myself literally seconds after pushing. > > :) > > ok i added some stubs here. right. so i've pushed another commit which shows where we would expect to put the inputs in (constructor). rather than have them directly as arguments to op_sim(), all those arguments - a, b, op, inv_a, imm_ok, zero_a - they become the arguments to the constructor. also expected_o should be in the constructor as well.
(In reply to Luke Kenneth Casson Leighton from comment #33) > right. so i've pushed another commit which shows where we would expect to > put the inputs in (constructor). rather than have them directly as arguments > to op_sim(), all those arguments - a, b, op, inv_a, imm_ok, zero_a - they > become the arguments to the constructor. > > also expected_o should be in the constructor as well. Implemented the issue_i/busy_o protocol check, for the moment. Will add the above as I need them, incrementally.
(In reply to Cesar Strauss from comment #34) > Implemented the issue_i/busy_o protocol check, for the moment. i saw. looks good. > Will add the above as I need them, incrementally. ok. it will be fascinating to compare against the formal proof
(In reply to Luke Kenneth Casson Leighton from comment #35) > it will be fascinating to compare against the formal proof I saw src/soc/fu/compunits/formal/proof_fu.py. Very interesting. How about a formal proof of the ALU interface? Also, maybe DummyALU could be truly dummy (doesn't drive its signals), and let Assume drive them, in the CompUnit formal proof.
(In reply to Cesar Strauss from comment #36) > (In reply to Luke Kenneth Casson Leighton from comment #35) > > it will be fascinating to compare against the formal proof > > I saw src/soc/fu/compunits/formal/proof_fu.py. Very interesting. it is also very "clean". > How about a formal proof of the ALU interface? the interaction with it will be needed, yes, because of the signalling ready/valid. in particular the ALU may *not* necessarily be ready to proceed at the time operands are available. > Also, maybe DummyALU could be truly dummy (doesn't drive its signals), and > let Assume drive them, in the CompUnit formal proof. i have no idea. we do need some way to check that data actually gets passed through.
(In reply to Luke Kenneth Casson Leighton from comment #37) > (In reply to Cesar Strauss from comment #36) > > How about a formal proof of the ALU interface? > > the interaction with it will be needed, yes, because of the signalling > ready/valid. > > in particular the ALU may *not* necessarily be ready to proceed at the time > operands are available. Feel free to put this on my queue, if you wish (formal proof of the CompUnit-ALU interface). While I have no practical experience on formal proofs, I've been reading about it for some time, with the goal of applying it to my own designs. I like to think I've reached a good understanding.
(In reply to Cesar Strauss from comment #38) > > in particular the ALU may *not* necessarily be ready to proceed at the time > > operands are available. > > Feel free to put this on my queue, if you wish (formal proof of the > CompUnit-ALU interface). sure, go for it. > While I have no practical experience on formal proofs, I've been reading > about it for some time, with the goal of applying it to my own designs. I > like to think I've reached a good understanding. good place to learn. i would suggest simply adding them to proof_fu.py - focus on ready_(i/o), valid_(i/o) - these are accessible signals. if you assume that the DummyALU takes exactly one cycle between data_i and data_o, this may make life easier. the sequence will go: * p.valid_i is set HI by CU * p.ready_o is asserted HI by ALU in response (takes 1 cycle and asserts for ONLY one cycle) * n.valid_o is asserted HI by ALU * n.ready_i is set HI by CU * n.valid_o is asserted LO by ALU in response (takes 1 cycle) regarding data: * data input should be placed by CU on the input at the same time as it raises p.valid_i. it should REMAIN valid until p.ready_o is seen asserted. * data output should be ready and valid when n_ready_i is HI. all other times, it *may* still be left on the bus, but should *NOT* be taken to be valid. this is basically a standard ready/valid signalling system used by wishbone, AXI4, etc. etc. etc. etc.
cesar i've moved the unit tests for MultiCompUnit into their own file, they were getting very big and justified their own module.
I updated the unit test to deal with rdmaskn. For how long rdmaskn must be held valid? If I assert it for only one cycle, together with issue_i, it doesn't work. rel still rises on the next cycle. Shouldn't it be latched, as with zero_a and imm_ok?
Cesar: this is kiiinda true... indirectly :) i'm mulling over the words... rdmask disables (entirely disables) bits of rd.rel, and consequently rd.go *should* never be raised. rd.rel being the output (which is masked out), rd.go being the input, it's a "protocol violation" if raised. commit f930c1814f19c41f082c819494cf246edd09082d Author: Cesar Strauss <cestrauss@gmail.com> Date: Mon Jun 1 07:56:12 2020 -0300 Add rdmaskn parameter and assert it along issue_i Like zero_a and imm_ok, rdmaskn disallows the activation of go.
(In reply to Luke Kenneth Casson Leighton from comment #42) > rdmask disables (entirely disables) bits of rd.rel, and consequently > rd.go *should* never be raised. rd.rel being the output (which is > masked out), rd.go being the input, it's a "protocol violation" if > raised. So rdmask, as an input, must be driven and held high, for all the time that busy_o is active?
(In reply to Cesar Strauss from comment #41) > I updated the unit test to deal with rdmaskn. > > For how long rdmaskn must be held valid? the entire time (the entire busy cycle) > If I assert it for only one cycle, > together with issue_i, it doesn't work. rel still rises on the next cycle. > > Shouldn't it be latched, as with zero_a and imm_ok? hmmm hmmm that's a really good point. actually... it should be generated from the Decoder2 information (Decode2Execute1 in power_decoder2.py), and brought in via the CompXXXOpSubsets. if it came in via those, it would be part of oper_i and would automatically be latched. it's *different* information depending on which Function Unit is used, based on the regspecs which are collected together in soc/fu/compunits/compunits.py (from soc/fu/*/pipe_data.py) i'll need to think about it - can you set it manually for now and hold it set for the duration of busy_o == True? btw it should not be necessary to set the entire mask to all 1s, because there does not exist an FU which takes zero operands.
(In reply to Cesar Strauss from comment #43) > So rdmask, as an input, must be driven and held high, for all the time that > busy_o is active? yes - for now. i will look into whether (and how, and if) it should be moved into CompXXXOpSubsets (into oper_i).
(In reply to Luke Kenneth Casson Leighton from comment #45) > (In reply to Cesar Strauss from comment #43) > > > So rdmask, as an input, must be driven and held high, for all the time that > > busy_o is active? > > yes - for now. i will look into whether (and how, and if) it should be > moved into CompXXXOpSubsets (into oper_i). just fyi, Cesar: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/alu/pipe_data.py;h=33f54b026b95e06860a9f8b6a9a1b2797fdd59ee;hb=93fb5e930101a3bbb317e6180fc598f56b43cb9c#l67 line 67 (right at the end). there's a function, "rdflags", which is there for all soc.fu.*.pipe_data.py and it will be called by the Decoder and passed the data structure (Decode2Execute1Type) which has all of the information, from the actual instruction, about which register file ports (rd.req[0-N]) should be activated. it has to be done from the Decoder because only the Decoder (PowerDecode2) has the relevant information. but... what PowerDecode2 *doesn't* know is the order of the regfile ports at the Function Unit... so... the function "rdflags" defines it.
I've updated the Documentation, here: https://libre-soc.org/3d_gpu/architecture/decoder/
hi cesar i added a hack-test for when wrmak is zero. this basically says that when the ALU says that is done, if there isno data to write then the MCU write-request phase is also done. the logic becomes pretty similar to what is in LDSTCompUnit, which for ST does not write a reg, either.
I'm advancing on the ALU CompUnit parallel unit test. Having gone past the rd rel/go interface, I've moved on to the ALU input interface. The ALU input interface uses a valid/ready protocol. I assume that this, like rel/go, is similar to the wishbone interface. As such: 1) The port is idle when valid_o and ready_i are both zero. 2) If valid_o is high but ready_i is low, we have a wait state. 3) When both are high, we have a data transfer, and the transaction ends. 4) It is illegal to keep ready_i high when valid_o is low. As a diagram: clk 1 2 3 4 5 6 data_i . D D D D . valid_i . 1 1 1 1 . ready_o . . . . 1 . latched . . . . D . Back to back transactions are also permitted: clk 1 2 3 4 5 6 data_i . A A B C . valid_i . 1 1 1 1 . ready_o . . 1 1 1 . latched . . A B C . Well, while tracing the ALU input handshake, in the parallel unit test, in src/soc/experiment/test/test_compalu_multi.py, I found this: clk 1 2 3 4 5 6 CompUnit: valid_i . . 1 1 . . ALU: ready_o . . . 1 1 . ^ trouble On cycle 5, ready_o is still high while valid_i is low. This can generate trouble if ready_o is wired-or with other modules. Worse yet, on the src/soc/experiment/alu_hier.py unit test: valid_i . . 1 1 1 1 1 1 . . ready_o . . . 1 1 1 1 1 1 . This is supposed to cover a single transaction, but in my undestanding I see 5 acknolowdged back-to-back transactions followed by a illegal state. Can I go ahead and fix all these cases, to be more like wishbone? Or, is my assumption of wishbone likeness wrong?
(In reply to Cesar Strauss from comment #49) > I'm advancing on the ALU CompUnit parallel unit test. Having gone past the > rd rel/go interface, I've moved on to the ALU input interface. > > The ALU input interface uses a valid/ready protocol. I assume that this, > like rel/go, is similar to the wishbone interface. standard wishbone / AXI4 yes. the difference being that rel/go will NEVER be back to back (multiple transactions one after the other) however ready/valid can be. however in the MultiCompUnits it is never used for that, because MCU is single transaction only. > As such: > > 1) The port is idle when valid_o and ready_i are both zero. > 2) If valid_o is high but ready_i is low, we have a wait state. > 3) When both are high, we have a data transfer, and the transaction ends. or, takes place on that clock cycle, abd ends "as far as the next clock cycle is concerned". > 4) It is illegal to keep ready_i high when valid_o is low. > This is supposed to cover a single transaction, but in my undestanding I see > 5 acknolowdged back-to-back transactions followed by a illegal state. this does not surprise me. the alu_hier.py ALUs were written "in a hurry" and are there purely for test purposes only. > Can I go ahead and fix all these cases, to be more like wishbone? Or, is my > assumption of wishbone likeness wrong? yes please do... or... last week i considered throwing them out and using nmutil.singlepipe ALUs instead. if you feel it does not take a long time to "fix" alu_hier.py ALU classes please do go ahead. otherwise if it will take a long time it would be better to drop them and replace them with ALUs based on e.g. TestBufPipe (see nmutil unit tests) again: they are only there for test purposes, however as tests it is indeed good for them to properly conform to the required API
(In reply to Luke Kenneth Casson Leighton from comment #50) > if you feel it does not take a long time to "fix" alu_hier.py ALU classes > please do go ahead. One more thing. These test ALUs are non-pipelined, right? I mean, they take a finite number of cycles to sequence through their stages, but are not able to accept a new operation, each cycle, meanwhile, right? In this case, I think setting ready_o combinatorially, based on valid_i being high, and the sequence counter being zero, is enough. This will even keep ready_o from activating if valid_i is high and the ALU is still busy. Will check if it works.
(In reply to Cesar Strauss from comment #51) > (In reply to Luke Kenneth Casson Leighton from comment #50) > > if you feel it does not take a long time to "fix" alu_hier.py ALU classes > > please do go ahead. > > One more thing. These test ALUs are non-pipelined, right? I mean, they take > a finite number of cycles to sequence through their stages, but are not able > to accept a new operation, each cycle, meanwhile, right? correct. this is to simulate FSMs (such as a DIV operation) and also to provide a way to check that the MultiCompUnits work properly when there is a delay. it is a SEVERE violation of the entire MultiCompUnit API for a MCU to expect to process (generate) more than one result at a time. this because that implies that it is "abandoning" a result, in direct violation of its role and responsibility to monitor the result through to completion. the way to use pipelines: for an N Stage Pipeline, have a minimum of N MultiCompUnits with a fan-in and fan-out. this is what the class ReservationStations is for. the whole purpose of writing this parallel test is to be able to work towards the scenario of doing exactly that: have Nx MultiCompUnits throw 100-200 results at a pipeline (a "proper" N-stage pipeline) and check that the results are consistent. however that is for later. > In this case, I think setting ready_o combinatorially, based on valid_i > being high, and the sequence counter being zero, is enough. This will even > keep ready_o from activating if valid_i is high and the ALU is still busy. > Will check if it works. go for it.
commit 71a791cf5f05bd5858fdc60f4ccdca936edc1d75 Author: Cesar Strauss <cestrauss@gmail.com> Date: Sun Jun 7 16:32:11 2020 -0300 Make the test ALU conform to the valid/ready protocol Adjust the test case accordingly. It was definitively not a one-line change. But it was fun. Select any other operation than MUL, ADD, or SHR, and it will do a combinatorial subtraction, with zero-delay and no registering. The interesting thing about zero-delay is that, due to the absence of an internal register, it will not release the input data until the result is transferred out. Basically, the handshake signals of the input and output ports are directly connected between them, only the data is modified between the ports. Nice change, working with RTL again. Will go back to the parallel unit tests, now.
(In reply to Cesar Strauss from comment #53) > commit 71a791cf5f05bd5858fdc60f4ccdca936edc1d75 > Author: Cesar Strauss <cestrauss@gmail.com> > Date: Sun Jun 7 16:32:11 2020 -0300 > > Make the test ALU conform to the valid/ready protocol > > Adjust the test case accordingly. > > It was definitively not a one-line change. But it was fun. oh good! :) > Select any other operation than MUL, ADD, or SHR, and it will do a > combinatorial subtraction, with zero-delay and no registering. > > The interesting thing about zero-delay is that, due to the absence of an > internal register, it will not release the input data until the result is > transferred out. oh wait... that's a *combinatorial* ALU? iinteresting. i wonder what that could be used for. hmmm i tried putting OP_NOP in there, and both compunit1 and parallel locked up. this tends to indicate a combinatorial loop. any clues? > Basically, the handshake signals of the input and output ports are directly > connected between them, only the data is modified between the ports. > > Nice change, working with RTL again. Will go back to the parallel unit > tests, now. :)
(In reply to Luke Kenneth Casson Leighton from comment #54) > oh wait... that's a *combinatorial* ALU? iinteresting. i wonder > what that could be used for. Well, the ALU output is already latched in MultiCompUnit. Maybe having another output register, inside the ALU itself, is redundant? > hmmm i tried putting OP_NOP in there, and both compunit1 and parallel > locked up. this tends to indicate a combinatorial loop. any clues? I see. I'm looking into it.
(In reply to Cesar Strauss from comment #55) > (In reply to Luke Kenneth Casson Leighton from comment #54) > > oh wait... that's a *combinatorial* ALU? iinteresting. i wonder > > what that could be used for. > > Well, the ALU output is already latched in MultiCompUnit. Maybe having > another output register, inside the ALU itself, is redundant? ah right yes i understand, now. yes, absolutely. the only reason for the latch, inside the test ALU, was to simulate the concept of a pipeline stage. however you are absolutely right: with both the incoming src regs being latched, and the outgoing result being latched also by MultiCompUnit, it is not necessary ro latch inside alu_hier.ALU as well. > > > hmmm i tried putting OP_NOP in there, and both compunit1 and parallel > > locked up. this tends to indicate a combinatorial loop. any clues? > > I see. I'm looking into it. it may be because "now" is set at the default action. i have yet to work out how to detect combinatorial loops. there may be something that can be done using yosys synth, michael showed me something a couple weeks ago. ltp command?
(In reply to Luke Kenneth Casson Leighton from comment #56) > (In reply to Cesar Strauss from comment #55) > > (In reply to Luke Kenneth Casson Leighton from comment #54) > > > hmmm i tried putting OP_NOP in there, and both compunit1 and parallel > > > locked up. this tends to indicate a combinatorial loop. any clues? > > > > I see. I'm looking into it. > > it may be because "now" is set at the default action. i have yet to work > out how to detect combinatorial loops. there may be something that can be > done using yosys synth, michael showed me something a couple weeks ago. ltp > command? Found it manually, staring from valid_o, and changing comb to sync along the way. Interesting debug experience. I don't really recall being much afflicted by this sort of issue. But then, I normally try to avoid using latches as much as possible, this may be the reason. commit c8c13adf3da9a6ca8b7765edc597ec7fdf19e8d8 Author: Cesar Strauss <cestrauss@gmail.com> Date: Tue Jun 9 08:57:31 2020 -0300 Avoid a combinatorial loop on valid_o The path was: all_rd (1) -> all_rd_pulse (1) -> alui_l.s (1) -> -> alu.p.valid_i (1) -> ALU (zero-delay) -> alu.n.valid_o (1) -> -> rok_l.r (1) -> all_rd (0) Decided to break the loop on the reset of the read-done, write proceed latch (rok_l.r), with no ill effects on performance. Added a test case for this, using the zero-delay ALU (OP_NOP).
(In reply to Cesar Strauss from comment #57) > (In reply to Luke Kenneth Casson Leighton from comment #56) > > it may be because "now" is set at the default action. i have yet to work > > out how to detect combinatorial loops. there may be something that can be > > done using yosys synth, michael showed me something a couple weeks ago. ltp > > command? > > Found it manually, staring from valid_o, and changing comb to sync along the > way. Interesting debug experience. i've done that before :) > I don't really recall being much afflicted by this sort of issue. But then, > I normally try to avoid using latches as much as possible, this may be the > reason. yyeah they're not a "normal" way to do FSMs. however with some aspects being combinatorial and some being sync, we can't use "standard" techniques. nicely done. test_core.py still works, which is important.
Created attachment 64 [details] Suggested pipeline design for the ALU Comp Unit (In reply to Luke Kenneth Casson Leighton from comment #58) > (In reply to Cesar Strauss from comment #57) > > I don't really recall being much afflicted by this sort of issue. But then, > > I normally try to avoid using latches as much as possible, this may be the > > reason. > > yyeah they're not a "normal" way to do FSMs. however with some aspects being > combinatorial and some being sync, we can't use "standard" techniques. Did you consider implementing the ALU CompUnit as a pipeline? I have put some thought into it. See the attached diagram. First, consider that rel / go protocol seems to be compatible with the valid / ready protocol, by connecting n.ready_o->rd.rel_i and rd.go_o->n.valid_i. Likewise on the dest port. A pass-thru stage helps, I think. In the idle state, the switches will be set as shown, disconnecting the pipeline from the input and output ports, connecting them instead to a full sink and empty source respectively. So, there can be no activity on these ports. At issue time, you transition the dest switch from straight to crossed. Also, in each parallel arm, choose either one of the src or imm switches to cross. The barrier / combiner joins the input data, and outputs n.valid_o only when both p.valid_o are active. And then, it copies n.ready_i to both p.ready_i. This combiner takes care to inform the ALU that all its operands input are valid. The ALU outputs directly to the dest port. holding the data itself, according to the protocol, until the dest port can accept it. Notice the little need for FSMs and latches in the design. The total complexity is divided in manageable pieces, tucked inside the pipeline stages. If all stages work in isolation, we can be confident that the design will work when put together. Does it make sense? I think it is worth investing the time to implement this. If it turns up that it works, we can be more confident on the design. Even apply the same principle to LDSTCompUnit. I can give it a try, in parallel with other tasks. It would be a nice way to learn more about the pipeline API.
(In reply to Cesar Strauss from comment #59) > Created attachment 64 [details] > Suggested pipeline design for the ALU Comp Unit > > (In reply to Luke Kenneth Casson Leighton from comment #58) > > (In reply to Cesar Strauss from comment #57) > > > I don't really recall being much afflicted by this sort of issue. But then, > > > I normally try to avoid using latches as much as possible, this may be the > > > reason. > > > > yyeah they're not a "normal" way to do FSMs. however with some aspects being > > combinatorial and some being sync, we can't use "standard" techniques. > > Did you consider implementing the ALU CompUnit as a pipeline? you mean, MultiCompUnit itself? if so: this would violate the mission-critical protocol, the purpose for which MultiCompUnit exists. MultiCompUnit *must not* abandon its data. its absolutely must track the operands from start to finish. we call this a "Phase-aware Function Unit". "phase-aware" because it absolutely has to keep track of the start time and completion time. failure to do so would be absolutely disastrous. a pipeline *abandons* its data. it says "here, you deal with it, i'm abdicating all responsibility for data-tracking". i.e. once the data is in the pipeline, there is *no hardware* in the actual pipeline itself which tracks or handles responsibility for tracking the completion of that result. this was the subject of a rather excruciating discussion on comp.arch: it became clear that there simply does not exist an Industry-standard or Computer Science term that covers "Function Units which track data from operand to result". one had to be invented, and the best that could be thought of was "Phase-aware Function Unit" for an out-of-order system, it is absolutely imperative that we have Phase-aware Function Units, and that "management" task is what MultiCompUnit is for. a pipeline *cannot* fulfil that role. by definition and by design. however.... MultiCompUnit can *MANAGE* pipelines... by using the ready/valid protocol. and if we have multiple MultiCompUnit front-ends onto a single pipeline, *now* we have Reservation Stations. and that's what the ReservationStations class is for: https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/concurrentunit.py;hb=HEAD#l40 > I can give it a try, in parallel with other tasks. It would be a nice way to > learn more about the pipeline API. actually we do need a ReservationStation version of a MultiCompUnit unit test. a 2 or 3 stage "real" but dummy pipeline (simply passing its input through to its output, maybe doing +1 on each stage or something, or, heck, doing ADD would be perfectly on) then there would be 4 to 5 MultiCompUnits wired up to a ReservationStation version of that pipeline (see nmutil multipipe.py) the unit test would then be an adaptation of the test_inout_mux_pipe.py test, throwing multiple inputs in and seeing if they come out successfully at the end. https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_inout_mux_pipe.py;hb=HEAD there's some other code i can refer you to, if you're interested (and we should raise a new bugreport for it) bug #385
(In reply to Luke Kenneth Casson Leighton from comment #60) > (In reply to Cesar Strauss from comment #59) > > Did you consider implementing the ALU CompUnit as a pipeline? > > you mean, MultiCompUnit itself? if so: this would violate the > mission-critical > protocol, the purpose for which MultiCompUnit exists. I understand that MultiCompUnit is sequential by contract, and that allowing multiple operations in flight is a violation of that protocol. That was not my idea when I conceived using the pipeline API in the design of MultiCompUnit. The idea was to use the pipeline API to enforce sequential operation (eliminating all complex FSMs), and at the same time enforcing the ready/valid and rel/go protocols for free. Is started when I suspected that the rel/go protocol of the src and dest ports maybe were compatible with the valid/ready protocol of the ALU. I was then that I conceived of the src and dest ports being joined to the ALU using the pipeline API. The only thing, is making sure that only one operation is in flight in the pipeline at a any time. A simple FSM would take care of that, I think. But then, I would be crippling the power of the pipeline concept, which does seems like a waste. Oh well, at least I conceived an example of a parallel, dynamically reconfigurable pipeline, with new elements like cross-switches and barriers/combinators, that are maybe useful in other contexts.
(In reply to Cesar Strauss from comment #61) > (In reply to Luke Kenneth Casson Leighton from comment #60) > > (In reply to Cesar Strauss from comment #59) > > > Did you consider implementing the ALU CompUnit as a pipeline? > > > > you mean, MultiCompUnit itself? if so: this would violate the > > mission-critical > > protocol, the purpose for which MultiCompUnit exists. > > I understand that MultiCompUnit is sequential by contract, and that allowing > multiple operations in flight is a violation of that protocol. > > That was not my idea when I conceived using the pipeline API in the design > of MultiCompUnit. ah? > The idea was to use the pipeline API to enforce sequential operation > (eliminating all complex FSMs), and at the same time enforcing the > ready/valid and rel/go protocols for free. ahh okaaay now i get it. sorry: comparison-and-difference was what it took for me to understand. > Is started when I suspected that the rel/go protocol of the src and dest > ports maybe were compatible with the valid/ready protocol of the ALU. yes this is extremely likely. > I was then that I conceived of the src and dest ports being joined to the > ALU using the pipeline API. > > The only thing, is making sure that only one operation is in flight in the > pipeline at a any time. A simple FSM would take care of that, I think. > > But then, I would be crippling the power of the pipeline concept, which does > seems like a waste. not quite. it's that "tracking" which is paramount, everything else is secondary. > Oh well, at least I conceived an example of a parallel, dynamically > reconfigurable pipeline, with new elements like cross-switches and > barriers/combinators, that are maybe useful in other contexts. :) what you've come up with is basically a pipeline-compatible "information collector and distributor", where: * certain information has to come in from multiple different sources * processing has to take place which cannot begin until all incoming information is received * processed information has to be distributed out to multiple destinations this may indeed prove extremely useful, although under some very special circumstances. given that the ready/valid protocol is basically Wishbone compatible it *might* turn out to be the case that it's useful for special Bus Transactions. now that i think about it - it was over a year ago - i did sort-of begin something like this (at least on the receiving side), it was a multi-bit ready/valid with a funnel. i didn't exactly abandon it... it just turned out to be complex. i really like it... however given the time invested into MultiCompUnits, and the timescales, we really should prioritise, and come back to this idea later. can you raise a bugreport with a summary and a link to the attachment?
(In reply to Luke Kenneth Casson Leighton from comment #62) > i really like it... however given the time invested into MultiCompUnits, > and the timescales, we really should prioritise, and come back to this > idea later. > > can you raise a bugreport with a summary and a link to the attachment? Done, with deferred status. https://bugs.libre-soc.org/show_bug.cgi?id=391 Placed there a cleaned up version of the diagram.
(In reply to Cesar Strauss from comment #49) > The ALU input interface uses a valid/ready protocol. > > 1) The port is idle when valid_o and ready_i are both zero. > 2) If valid_o is high but ready_i is low, we have a wait state. > 3) When both are high, we have a data transfer, and the transaction ends. > 4) It is illegal to keep ready_i high when valid_o is low. After going through the Pipeline API, I realized case 4 is not illegal at all. It should read: 4) If ready_i is high but valid_o is low, we have a wait state, from the sending side. I changed the test ALU input port to reflect this, by not forcing p.ready_o low when p.valid_i is low. commit 2b4b3a653806c543a17b5f6e6db2c00d24996210 Author: Cesar Strauss <cestrauss@gmail.com> Date: Sun Jun 28 18:38:03 2020 -0300 Let p.ready_o be active while the test ALU is idle The valid/ready protocol doesn't actually forbid p.ready_o being active while p.valid_i is inactive. It just mean that the ALU is idle, and is ready to accept new data. This should help avoiding potential combinatorial loops from p.ready_o to p.valid_i.
(In reply to Luke Kenneth Casson Leighton from comment #11) > basically, we need: > > * one function (similar to TestInput.send) which takes care of *one* > REQ_READ and GO_RD signalling bit. it should monitor: > self.rd.req[N], then set self.rd.go[N] for *one* cycle (one blank yield) > then set it self.rd.go[N] back to zero and confirm (assert) that > self.rd.req[N] has gone to zero commit ff916d3932fc4fe90c9e8d1cc3bfa8c3818dceb9 Author: Cesar Strauss <cestrauss@gmail.com> Date: Wed Oct 28 19:57:55 2020 -0300 Implement an operand producer that talks the rel_o/go_i handshake It can be instantiated once for each operand port, working in parallel with the main test-bench process. https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=ff916d3932fc4fe90c9e8d1cc3bfa8c3818dceb9 > * another function (similar again to TestInput.rcv) which does the same > thing for REQ_WRITE and GO_WRITE. > > * another function - a very simple one - which sets issue_i > for one cycle (yield in between), then waits for "self.busy" to drop to 0 > > we can make it more sophisticated as we go along. > > it should be... only... about... 80 lines of code, at a guess. This will be next.
I've finished the CompUnit parallel test for the Shifter. Next, I'll port the ALU tests, followed by the Load / Store CompUnit. It features: 1) Operands can arrive in any order, or simultaneously Simulated delays are specified per port. 2) Results (if multiple) can also be accepted in any order 3) Results are checked as they are produced 4) Detects duplicated or missing operand requests or results A running transaction counter is maintained at each port. As busy is negated, all counters must match. To see: 1) Update nmutil, to get the latest write_gtkw functionality 2) Run: $ python ~/src/soc/src/soc/experiment/test/test_compalu_multi.py $ gtkwave ~/test/soc/test_compunit_fsm1.gtkw There are three transactions, and each time the order of arrival of operands varies.
(In reply to Cesar Strauss from comment #66) > I've finished the CompUnit parallel test for the Shifter. Next, I'll port > the ALU tests, followed by the Load / Store CompUnit. fantastic, Cesar. out of curiosity, given the subject of this bugreport: is testing of operations that are zero for A included? :) actually this is more for the PowerDecoder2 (or sub-decoders), now.
(In reply to Luke Kenneth Casson Leighton from comment #67) > out of curiosity, given the subject of this bugreport: is testing of > operations that are zero for A included? :) actually this is more for the > PowerDecoder2 (or sub-decoders), now. Not yet, but it will be. For simplicity, I began with the Shifter, that doesn't have either "zero A" or immediates. But I really will implement these next, for the ALU test case.
(In reply to Cesar Strauss from comment #68) > (In reply to Luke Kenneth Casson Leighton from comment #67) > > out of curiosity, given the subject of this bugreport: is testing of > > operations that are zero for A included? :) actually this is more for the > > PowerDecoder2 (or sub-decoders), now. > > Not yet, but it will be. fantastic. > For simplicity, I began with the Shifter, that doesn't have either "zero A" > or immediates. But I really will implement these next, for the ALU test case. great to hear, this will confirm the low level i wonder... although the CompUnit itself needs zero-immediate tests, to make sure it can cope at the low level, really, we also need actual unit tests (test_*_compunit.py) at the level up. for the low-level that you are doing (which is equally as important), it does not interact with PowerDecoder2. the Op Subset Record is given to CompUnit *by* PowerDecoder2, and it is PowerDecoder2 which analyses the instruction and identifies whether the register is RA or RA-or-Zero. that interaction is what also needs testing: a nonzero value placed into the regfile, use RA-or-Zero, have PowerDecoder2 decode that, and confirm that zero was passed as the operand.
(In reply to Luke Kenneth Casson Leighton from comment #69) > that interaction is what also needs testing: a nonzero value placed into the > regfile, use RA-or-Zero, have PowerDecoder2 decode that, and confirm that > zero was passed as the operand. Understood. Meanwhile, the ALU parallel test case is done, including handling of the zero_a and imm_ok cases. These operand producers are kind of "dumb": they will happily respond to a request, even if zero_a or imm_ok is asserted. But, they will still increment their transaction counter. The counters are checked at the end, when busy is negated, and the mismatch will be detected. There is still a DummyALU CompUnit, with three operands, that remains to be ported. Also, I will remove my earlier unfinished attempt.
I finished porting the DummyALU test case. I found a bug in the test case itself. DummyALU is built with CompCROpSubset, but the test case instanced a MultiCompUnit with CompALUOpSubset. The next step is to add test cases for read and write masks, and shadow/go_die. To see the current work: $ python ~/src/soc/src/soc/experiment/test/test_compalu_multi.py # MultiCompUnit + FSM Shifter: $ gtkwave test_compunit_fsm1.gtkw # MultiCompUnit + ALU: $ gtkwave test_compunit_regspec1.gtkw # MultiCompUnit + DummyALU: $ gtkwave test_compunit_regspec3.gtkw
ah i'm so sorry, that was me trying to reduce the signals and codesize of examples because CR subset is much smaller, sorry! not properly documented.
(In reply to Cesar Strauss from comment #71) > The next step is to add test cases for read and write masks, and > shadow/go_die. The ALU CompUnit unit test now checks that masked-out input and output ports do not make requests. Individual test cases can specify whether any port is masked-out. The next step is to actually invent some new operations for the test ALU, that do not make use of some of the read ports, and maybe add some write ports to it as well.
(In reply to Cesar Strauss from comment #73) > (In reply to Cesar Strauss from comment #71) > > The next step is to add test cases for read and write masks, and > > shadow/go_die. > > The ALU CompUnit unit test now checks that masked-out input and output ports > do not make requests. Individual test cases can specify whether any port is > masked-out. fantastic. > The next step is to actually invent some new operations for the test ALU, > that do not make use of some of the read ports, the simplest one there is just bit-inversion (o = ~a). or, perhaps, add 1? > and maybe add some write ports to it as well. hmm i can't think of anything that would be needed to be very difficult, it only needs to be different and detectable. how about add 1 for port 1 and subtract 1 for port 2?
(In reply to Luke Kenneth Casson Leighton from comment #74) > (In reply to Cesar Strauss from comment #73) > > (In reply to Cesar Strauss from comment #71) > > The next step is to actually invent some new operations for the test ALU, > > that do not make use of some of the read ports, > > the simplest one there is just bit-inversion (o = ~a). or, perhaps, add 1? Thanks, but I ended up going with sign extension, there was already a MicrOp Enum allocated to it. I also added a NOP, which does not use any read or write ports. > > and maybe add some write ports to it as well. > > hmm i can't think of anything that would be needed to be very difficult, it > only needs to be different and detectable. how about add 1 for port 1 and > subtract 1 for port 2? Thanks, but I ended up implementing a simple condition register output port, for all existing instructions, conditioned on the rc bit which was already available on CompALUOpSubset. So, there are now test cases that include all combination of masked read and write ports. Some observations: 1) It seems that MultiCompUnit depends on the ALU *.ok output fields being held valid long enough into the write phase. In the test ALU, I tried holding *.ok valid just for the duration of the ALU valid_o, and it didn't work. I ended up instead directly decoding the input operation, combinatorially, into the *.ok, so they do stay valid during the entire instruction cycle. 2) I noticed that, for the CompUnit outputs, MultiCompUnit strips out the *.ok ALU output fields by flattening the Data Record (*.data and *.ok) into a Signal, truncating it. This works, but relies on the "ok" field being the last field. I'm now moving on to the final task, of testing shadown/go_die.
(In reply to Cesar Strauss from comment #75) > > subtract 1 for port 2? > > Thanks, but I ended up implementing a simple condition register output port, > for all existing instructions, conditioned on the rc bit which was already > available on CompALUOpSubset. both great, it sounds like this was easier. fantastic. > So, there are now test cases that include all combination of masked read and > write ports. > > Some observations: > > 1) It seems that MultiCompUnit depends on the ALU *.ok output fields being > held valid long enough into the write phase. yes, very important. they are the "data is valid and needs to be written" signals. they are needed because some ALUs will decide *themselves* if the data is to be written (XER.so for example is done this way. you don't know if it will be modified) i.e. if the "ok" flag is False then the write.req must definitely not be raised > In the test ALU, I tried > holding *.ok valid just for the duration of the ALU valid_o, and it didn't > work. I ended up instead directly decoding the input operation, > combinatorially, into the *.ok, so they do stay valid during the entire > instruction cycle. > > 2) I noticed that, for the CompUnit outputs, MultiCompUnit strips out the > *.ok ALU output fields by flattening the Data Record (*.data and *.ok) into > a Signal, truncating it. ah, right: if you find any of those... yyyees, strictly speaking this should not happen: there is supposed to be detection of the data/ok and to create a suitable Record, matching it. i may have missed some of those and used Signal instead. at least however nmigen will cope fine by flattening and correctly transferring the data. > This works, but relies on the "ok" field being the > last field. that definitely needs documenting. > > I'm now moving on to the final task, of testing shadown/go_die. star. this one is important for the predication, and when we get to SIMD.
(In reply to Luke Kenneth Casson Leighton from comment #76) > (In reply to Cesar Strauss from comment #75) > > 2) I noticed that, for the CompUnit outputs, MultiCompUnit strips out the > > *.ok ALU output fields by flattening the Data Record (*.data and *.ok) into > > a Signal, truncating it. > > > ah, right: if you find any of those... yyyees, strictly speaking this should > not happen: there is supposed to be detection of the data/ok and to create a > suitable Record, matching it. i may have missed some of those and used > Signal instead. Sure. Here, you see data_r is a Record: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=d7e32f28c556e76aff9be146ce280eba9745bb09;hb=HEAD#l268 A little below (line 281) it is appended to "drl". Then, drl[i] is placed in dest[i]. which is a Signal: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=d7e32f28c556e76aff9be146ce280eba9745bb09;hb=HEAD#l356 The issue is actually mentioned in a comment, when dest[i] is created: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=d7e32f28c556e76aff9be146ce280eba9745bb09;hb=HEAD#l83
(In reply to Cesar Strauss from comment #77) > The issue is actually mentioned in a comment, when dest[i] is created: > > https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/ yep, coming back to me. been a while. yes, back where lro=get_out() is used, strictly speaking the same thing should be done. however with those being internal Signals that get copied into the Regspec outputs, restoring the .ok at that point, we get away with it :)