See section LDST Comp Unit https://libre-soc.org/3d_gpu/architecture/6600scoreboard/ walk-through videos on how it is supposed to work https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst_multi.py;hb=HEAD
Created attachment 53 [details] gtkwave save file for l0 cache/buffer
Created attachment 54 [details] gtkwave save file for ldst comp unit
the "ultimate" success test of this is for experiment/score6600_multi.py to not go permanently into "busy" mode. the videos show how LDSTCompUnit is *supposed* to work... but doesn't quite. the issue on the current commit is that the address latching is not occurring - not being transferred through correctly to mem_w_data - for ST operations. i *think* it is working fine on LD operations. you can see in the gtkwave timing diagram that: * cycle @ 50ns mem_w_en (and so mem_w_data) change * however cycle @ 40ns *should* have set mem_w_addr to 0x6 this *actually* occurs 4 cycles later for the *second* ST operation! if you then look at alu_o (further up), you can see it is: * 0x0002 @ 20ns * 0x0006 @ 30ns (the required, correct value) * however ad__rel - the signal that says "address is valid" was *one cycle early* (ad__rel in gtkwave is the name for the object ad.rel, and there is a "convenience" variable in the python code ad_rel_o which points to self.ad.rel) it's nearly there. notes: * every req signal (rd.req[0], rd.req[1], rd.req[2], wr.req[0], wr.req[1], ad.req and st.req) has a corresponding "go" signal (rd.go[0] etc.) these are from the data structure created on-the-fly by "go_record" * outgoing request ("req") signals may be held ASSERTed indefinitely until such time as their associated "go" signal is ASSERTed for one cycle * on assertion of the corresponding "go" signal, the associated "req" signal **MUST** be de-asserted this is basically wishbone etc. etc. etc. standard ready_o / valid_i signalling conventions in TestLDSTCompUnit, you can see the following line: m.d.comb += self.ad.go.eq(self.ad.rel) # link addr-go direct to rel this is done deliberately because otherwise it becomes necessary to write a split pair of co-routines in the unit test: * one that watches ad.rel and sets ad.go for one cycle * one that watches wr.rel[1] and sets wr.go[1] for one cycle strictly speaking *all* {sig}.go signals could be made equal to all {sig}.rel signals, combinatorially, and the code should function perfectly correctly.
see also http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/006782.html (with thanks to Cesar)
one important set of API conventions: * L0CacheBuffer's operating conventions are that the address must be *EXTERNALLY* sustained and held, as must its associated addr_ok line. * this is because otherwise L0CacheBuffer itself has to have an address latch * strictly speaking, LDSTCompUnit does not need an actual latch for the address *at all*. * the reason is because, as you can see from the diagram https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg all of the information necessary to "fire" the adder and make the address "valid" is *already latched* and will *remain* latched until the busy signal is released. - op2 is latched - the immediate is latched
+++ b/src/soc/experiment/compldst_multi.py @@ -477,7 +477,7 @@ class LDSTCompUnit(Elaboratable): comb += pi.op.eq(self.oper_i) # op details (not all needed) # address comb += pi.addr.data.eq(addr_r) # EA from adder - comb += pi.addr.ok.eq(self.ad.go) # "go do address stuff" + comb += pi.addr.ok.eq(alu_ok) # "go do address stuff" comb += self.addr_exc_o.eq(pi.addr_exc_o) # exception occurred comb += addr_ok.eq(self.pi.addr_ok_o) # no exc, address fine # ld - ld gets latched in via lod_l ok this "fixes" the address issue for ST. alu_ok is the signal that indicates the condition that the ALU address is ok to proceed. however given that alu_valid and alu_ok are *sustained*, this causes problems on the alu SR Latch which, if using *only* alu_ok to reset it, would result in the alu_l being permanently held OFF to fix that i am doing this: # alu latch. use sync-delay between alu_ok and valid to generate pulse comb += alu_l.s.eq(reset_i) comb += alu_l.r.eq(alu_ok & ~alu_valid & ~rda_any) alu_ok is one clock cycle behind alu_valid: therefore alu_ok & ~alu_valid will generate the required one-clock pulse to indicate that the alu latch can safely capture the address. (which we noted previously we don't strictly need to do, but hey) moving on from there, LD is currently bouncing around like mad. the ld_ok condition is being raised, which should end the FSM, however it does not.
On Sun, May 17, 2020 at 1:59 PM Cesar Strauss <cestrauss@gmail.com> wrote: > I think I have enough information available, to proceed. I'm beginning > already to have some understanding of the problem at hand. there are *eight* possible input-combinations that the FSM has to handle, based on: * op: LD or ST * mode: immediate or indexed address-calculation (termed "Effective Address") * update: EA is to be outputted as a *SECOND* register yes/no these 8 permutations make the FSM riiiiidiculouslly complicated and there's really nothing that can be done about it: we need to be compliant with POWER9 requirements. * op LD/ST will cause the FSM to ignore operand3 if op == LD. this has to include *NOT* requesting operand3 (not setting rd.req[2]) however because it is a LD, the LD data will be latched from the L0CacheBuffer port: thus when that happens, wr.req[0] MUST send out a request for the REGFILE write port. when this is available, wr.go[0] will be activated for ONE cycle. after that one cycle, wr.req[0] MUST be dropped. * if op == ST, operand 3 contains the value to be stored, so must be rd.req[2]-ested. however because it is a ST, there is no LD data to output, and therefore wr.req[0] must not be activated. * immediate-mode will cause the FSM to use the immediate *instead* of operand2. therefore, in immediate-mode, operand2 must *NOT* be requested (rd.req[1] must never be set). * update=TRUE will cause the FSM to activate wr.req[1] and wait for wr.go[1] when wr.go[1] is activated, this indicates to the FSM that the latched address (EA) in addr_o is being broadcast onto a REGFILE WRITE port. it will get *one* chance to do that (one cycle), and MUST drop wr.req[1] on the next cycle: its job is then "done" this is *independent* of the logic that manages / monitors the LD/ST Port (!) why? because the address computation (EA) is an independent parallel path from actual LD/ST it's... just... ridiculously complicated. at the very least if you can review it, and greatly improve the unit tests, it would be fantastic. as i said i *might* have got it working, 10 minutes ago, however the complexity is so high i am not in the slightest bit confident of that, nor of the correctness of the unit tests.
(In reply to Luke Kenneth Casson Leighton from comment #7) > at the very least if you can review it, and greatly improve the unit > tests, it would be fantastic. > > as i said i *might* have got it working, 10 minutes ago, however the > complexity is so high i am not in the slightest bit confident of that, > nor of the correctness of the unit tests. Sure, I'll have a look into it. I'm glad you were able to advance. I was worried this issue might have got you blocked, for two days already, and turned itself into a time sink. Regards, Cesar
(In reply to Cesar Strauss from comment #8) > Sure, I'll have a look into it. appreciated. > I'm glad you were able to advance. I was worried this issue might have got > you blocked, for two days already, about 20 so far. in addition to around 20 days analysis prior to that. and about a month last year on an earlier version.
I got my development environment running, by consulting https://libre-soc.org/HDL_workflow/. Then, I ran the testbench at experiment/compldst_multi.py, generated the timing diagram, and confirmed in GTKWave that the problem related in comment #3 does not occur any longer in the latest git master. I started reading the reference material, in order to get a basic knowledge of the functionalities and interfaces. Then, I followed the nMigen tutorial at https://github.com/RobertBaruch/nmigen-tutorial, while reading the code of the LD/ST Comp Unit itself, and as a result I got a reasonable understanding of both. I'll now go back to the reference material a little more. Once I get a better understanding, I'll return to the code and proceed with reviewing it and improving the unit tests. I have formed some thoughts about this already. I'll ask if I need further clarifications, and report my progress. Regards, Cesar
(In reply to Cesar Strauss from comment #10) > I got my development environment running, by consulting > https://libre-soc.org/HDL_workflow/. > > Then, I ran the testbench at experiment/compldst_multi.py, generated the > timing diagram, and confirmed in GTKWave that the problem related in comment > #3 does not occur any longer in the latest git master. > > I started reading the reference material, in order to get a basic knowledge > of the functionalities and interfaces. > > Then, I followed the nMigen tutorial at > https://github.com/RobertBaruch/nmigen-tutorial, while reading the code of > the LD/ST Comp Unit itself, and as a result I got a reasonable understanding > of both. excellent. similar to https://bugs.libre-soc.org/show_bug.cgi?id=316 if you'd like we can find a practical and useful example that's an awful lot less complicated than this one, by way of getting started. > I'll now go back to the reference material a little more. Once I get a > better understanding, I'll return to the code and proceed with reviewing it > and improving the unit tests. I have formed some thoughts about this already. me too. please, when you have a moment, do describe them. here's a couple: * we do need a way to test dual ports (two CompLDSTs with two ports, 4-4 as well) which write to the same memory this will test whether the (prototype) L0CacheBuffer is doing its job (which i am not certain that it does. it handles _one_ LD or _one_ ST correctly) * we need some tests to be delayed-response (to each of the rd and wr *and* go.ad *and* go.st) and some to be immediate-response these can be done using multiple "processes". you notice in the simulation that a function is passed in: it is possible to pass in *multiple* such functions. hmm hmm where's an example of that... https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_inout_mux_pipe.py;hb=HEAD what that does is: set up a pipeline with a 4-input, 4-output "muxer" at the input and output. a context (index) is passed down the pipeline, and used to associate the "result" back with its corresponding output. so you can see in InputTest, there are *eight* separate and distinct simultaneously-operating "processes": 4 do sending, 4 do receiving. they all operate on the same (global) python data structure, which gives a way for the python unit test to actually check that they're doing the right thing. a) that no data is lost b) that data sent from input N is *received* at output N c) that the order of any data sent from input N to output N is preserved regardless of the interleaving order. so in a similar fashion, a class could be designed where one function takes care of rd.req[N] and rd.go[N] acknowledgement (for N=0..2), and likewise another function responds to ad.req, another to wr.req[0..1] and so on, where because those "processes" are all members of the same class, they have access to the full "state" of all other "processes" and consequently can verify and report that the request-signals are coming in correctly (or at all), and in the right order. in addition, under the control of the caller (the unit test), those individual processes could be told "respond immediately to the signal" or "response with an arbitrary delay" extremely easily: a simple for-loop with a "yield" in it - as you can see at line 101 in test_inout_mux_pipe.py will create that type of behaviour. beyond that: in a similar way to InputTest, setting up multiple CompLDSTUnits, and getting them to hammer the same L0CacheBuffer via multiple ports, is also a very important test, as it is the way to ensure that multiple LD/ST operations can operate in parallel and not corrupt memory (or be lost). beyond *that*, at some point we need to also arbitrarily test raising an "exception" flag, which will result in the "shadow" condition being pulled, and that should result in the complete cancellation of the operation. > I'll ask if I need further clarifications, and report my progress. appreciated.
(In reply to Luke Kenneth Casson Leighton from comment #11) > excellent. similar to https://bugs.libre-soc.org/show_bug.cgi?id=316 > if you'd like we can find a practical and useful example that's an awful > lot less complicated than this one, by way of getting started. No need, it's fine. I think I got the hang of nMigen now. I got over the initial shock some time ago, while following this project. Verilog is my HDL of choice, but the fundamental concepts are the same.
(In reply to Cesar Strauss from comment #12) > No need, it's fine. I think I got the hang of nMigen now. I got over the > initial shock some time ago, while following this project. very funny :) > Verilog is my HDL > of choice, but the fundamental concepts are the same. yes. everything is there.
Dear Luke, I see uses of "latchregister" and "SRLatch" in code. In the nMigen library, they are internally implemented with synchronous logic, although partly resembling the function of their asynchronous equivalents. Do you, by any chance, intend to replace them, in the ASIC, with real asynchronous latches? As depicted in several diagrams at https://libre-soc.org/3d_gpu/architecture/6600scoreboard/ ? Regards, Cesar
(In reply to Cesar Strauss from comment #14) > Dear Luke, > > I see uses of "latchregister" and "SRLatch" in code. In the nMigen library, > they are internally implemented with synchronous logic, although partly > resembling the function of their asynchronous equivalents. yes. this was the best i could think of that is reasonably close and at the same time is verifiable if we went with standard commercial tools. SR NAND most definitely is not verifiable except with proprietary tools that are rented at USD 250,000 a week. > > Do you, by any chance, intend to replace them, in the ASIC, with real > asynchronous latches? for the CompUnits, probably not. the gate count (number of SRLatch instances) is not high enough to justify the risk. > As depicted in several diagrams at > https://libre-soc.org/3d_gpu/architecture/6600scoreboard/ ? yes. i kept these to make it possible to understand the relationship with what Mitch taught me. however you have to appreciate that the 6600 diagrams worked by alternating low and high clock cycles. in particular, the Dependency Matrix cells very specifically activate "issue" on the HI cycle and clear GOREAD on i believe the LO cycle. i decided that sticking closely to this as a very first chip might not be a good idea, although it is extremely cool. consequently instead i created SRLatch and then used a sync on resetting it as the "primary method" of responding to REQUEST signals. you will therefore see a GO signal be the reset driver on for example the read latch, as a *sync* not a comb. this leaves the SRLatch latch open for precisely one clock cycle, and the corresponding REQ, generated from the Q of the SRlatch instance, will then drop. which is exactly what is needed to capture the incoming data. remember, GOREAD is only set (and set for only one cycle) when the Regfile Bus is svailable to transport data to this Unit. a standard design (wishbone) would use ready/valid signalling to do exactly the same thing, it is just different names.
I'll now proceed to improve the unit test for LDSTCompUnit. I'll take into account your several ideas in comment #11. Some of my own ideas are: 1) The test driver should only present data during a single cycle, coincident with the activation of "go". This detects data being latched at the wrong time. 2) Add some behavioral assertions. For instance, busy_o must rise if, and only if, issue_i was active on the previous clock. Likewise "rel" signals must go low if, and only if, the corresponding "go" was active on the previous clock. And so on. There could be a simulation "process" monitoring these kind of conditions. Or, using nMigen assertions, and/or formal verification. I must find out more about this. I still need to study the interface and functionality of the L0CacheBuffer. Regards, Cesar
(In reply to Cesar Strauss from comment #16) > I'll now proceed to improve the unit test for LDSTCompUnit. > > I'll take into account your several ideas in comment #11. > > Some of my own ideas are: > > 1) The test driver should only present data during a single cycle, > coincident with the activation of "go". This detects data being latched at > the wrong time. oh yeh, really good point. > > 2) Add some behavioral assertions. > For instance, busy_o must rise if, and only if, issue_i was active on the > previous clock. > Likewise "rel" signals must go low if, and only if, the corresponding "go" > was active on the previous clock. > And so on. i believe these kinds of things are what Formal Proofs are for. in particular, when you use Past() to assert things from previous cycles. > There could be a simulation "process" monitoring these kind of conditions. > Or, using nMigen assertions, and/or formal verification. I must find out > more about this. bug 316, look up the formal_bperm.py code. it's really well documented. > I still need to study the interface and functionality of the L0CacheBuffer. i tried two LDs in the score6600 code - they failed. the first place to check is the [prototype] L0CacheBuffer, with a twin-process.
Cesar, fyi: https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=c501606b20c74050fb0d422ff886a7d9fb2a6a96 very straightforward, no disruption at all, due to a hack of making MultiCompUnit look exactly like it was.
(In reply to Luke Kenneth Casson Leighton from comment #18) > Cesar, fyi: > > https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=c501606b20c74050fb0d422ff886a7d9fb2a6a96 > > very straightforward, no disruption at all, due to a hack > of making MultiCompUnit look exactly like it was. Dear Luke, I see you just committed: https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=1dbb3d17d051f73a20a148740be7b65c1a1778c1 I had the impression, maybe wrongly, that you had assigned this task to me. I had my own version mostly working. I will compare mine to yours, and make a review of yours. I guess I failed to explicitly accept the (supposed) assignment, and report my progress. Sorry about that. Lesson learned. Too bad about the duplicate effort. No worries. At least I had a good training on nMigen, RecordObject and the CompUnit API. On the way, I did manage to catch an obvious typo in your previous refactoring of compalu_multi.py. I took the liberty to commit it: https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=d9724d6fed90959a78799c6e48667d31abd3a091 I'm going back now to improve the unit test. Again, no worries. Regards, Cesar
(In reply to Cesar Strauss from comment #19) > (In reply to Luke Kenneth Casson Leighton from comment #18) > > Cesar, fyi: > > > > https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=c501606b20c74050fb0d422ff886a7d9fb2a6a96 > > > > very straightforward, no disruption at all, due to a hack > > of making MultiCompUnit look exactly like it was. > > Dear Luke, > > I see you just committed: > https://git.libre-soc.org/?p=soc.git;a=commitdiff; > h=1dbb3d17d051f73a20a148740be7b65c1a1778c1 yes. i meant to write here this morning, but got side-tracked on pipeline code-morphing. > I had the impression, maybe wrongly, that you had assigned this task to me. ah. sorry! if you had written explicitly "i will do this conversion tomorrow evening", i would have held off. > I had my own version mostly working. I will compare mine to yours, and make > a review of yours. appreciated. > I guess I failed to explicitly accept the (supposed) assignment, and report > my progress. Sorry about that. Lesson learned. by both of us. > Too bad about the duplicate effort. No worries. At least I had a good > training on nMigen, RecordObject and the CompUnit API. about the CompUnit API: if you look at CompUnitBase you can see where i'm going with it. https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/score6600_multi.py;h=4630368e474c9d7532dad3b00a3983435ba01c7b;hb=7b0b09aef28e4329c6b85d933b022112f0c15c50#l44 CompUnitBase is a mess. assigning individual signals from a batch of CompUnits to create a common data structure. i am hoping - planning - to find a way to merge the units into a common data structure. or, if not, at the very least a common *python* level API which allows convenient signal-connectivity without having a hundred lines of tediously repetitive signal-for-signal assignments. about RecordObject: RecordObject exists because whitequark wants Record to be bit-oriented, particularly when it comes to enumerating (using __iter__) through it. whereas what we need - to avoid insanely complex graphviz diagrams that are flat-out impossible to read - is *field*-orientated iteration and assignment. *field*-orientated iteration allows nmigen to generate code that, when you assign one RecordObject to another, does so by creating eq()s on the *members* of the RecordObject, not the *individual bits of every single field*. also, RecordObject looks pretty much exactly like a standard python class, and may be sub-classed (as you can see from LDSTCompUnitRecord) and extended in a standard python fashion. Record (because it takes an explicit layout argument) may *NOT* be extended except by direct manipulation of Record.fields, which can have consequences when people assume that the use of any given Record instance will have exactly the same structure. basically, RecordObject fits python developer expectations: Record does not. > On the way, I did manage to catch an obvious typo in your previous > refactoring of compalu_multi.py. I took the liberty to commit it: > > https://git.libre-soc.org/?p=soc.git;a=commitdiff; > h=d9724d6fed90959a78799c6e48667d31abd3a091 yes i saw - that would have been hard to track down: the unit test didn't catch it. really appreciated. > I'm going back now to improve the unit test. ok :)
cesar just fyi (learning from past mistake...) to match what is in soc.fu.alu.alu_input_record i've moved LDSTOpSubset to soc.fu.ldst.ldst_input_record this should merge cleanly into what you are doing.
cesar, hi, i've managed to get L0CacheBuffer semi-operational: it won't cope with mis-aligned LD/ST. the boundary (memory granularity) is 64-bit, and if you try to read 8 bytes that are starting 32-bits in, that'll fail. but... that's not the issue that needs looking at (because it's to be sorted for when we have the DualPortSplitter). the issue that needs looking at is to confirm that compldst_multi.py's unit tests are properly functional: i'm just not confident that they are. can you take a look when you have a moment? it may just be the case that TstL0CacheBuffer needs to be initialised with the right parameters, to give it a 64-bit width rather than the current (old) values of only a 16-bit underlying memory.
(In reply to Luke Kenneth Casson Leighton from comment #22) > can you take a look when you have a moment? it may just be the case > that TstL0CacheBuffer needs to be initialised with the right parameters, > to give it a 64-bit width rather than the current (old) values of only > a 16-bit underlying memory. Sure, I'll start later today. As it happens, tomorrow is a holiday. I'll have the full day free to work on the project.
(In reply to Luke Kenneth Casson Leighton from comment #22) > the issue that needs looking at is to confirm that compldst_multi.py's > unit tests are properly functional: i'm just not confident that they > are. Have yet to see the traces, but found this by inspection: in load(): 591 yield dut.rd.go.eq(rd) 592 yield from wait_for(dut.rd.rel) 593 yield dut.rd.go.eq(0) This is backwards, I think. We wait for the CompUnit to assert rel, then we send a one-clock pulse on the corresponding go. I guess a solution is to simply wait on one rel signal at a time. Eventually I'll come back to write the parallel test for this. Meanwhile I'll take a look at the traces, see if I can catch anything.
(In reply to Cesar Strauss from comment #24) > (In reply to Luke Kenneth Casson Leighton from comment #22) > > the issue that needs looking at is to confirm that compldst_multi.py's > > unit tests are properly functional: i'm just not confident that they > > are. > > Have yet to see the traces, but found this by inspection: > > in load(): > > 591 yield dut.rd.go.eq(rd) > 592 yield from wait_for(dut.rd.rel) > 593 yield dut.rd.go.eq(0) > > This is backwards, I think. it's simply broken, violating the protocol by potentially leaving the rd.go HI for more than one cycle. it needs to be yield from waitfor(dut.rd.rel) yield rd.go.eq(rd) yield yield rd.go.eq(0) now it *may* be the case that by coincidence there is one clock in waitfor. that would give the correct ptotocol behaviour by accident. > We wait for the CompUnit to assert rel, then we > send a one-clock pulse on the corresponding go. yes. > > I guess a solution is to simply wait on one rel signal at a time. except this would be a different type of test. > > Eventually I'll come back to write the parallel test for this. that would be fantastic. > Meanwhile I'll take a look at the traces, see if I can catch anything. you can see i caught an error where go signals are wired directly to rel signals. this trick will force an immediate response, in parallel, and there were a number of corner cases in LDST and ALU CompUnits. these turned out to be that the latch covering that particular go/req was receiving a set *and* reset on the *same cycle*, and, with set taking priority, the latch was left in the "set" position where it should have been dropped to low. even L0CacheBuffer had this exact same bug. have a look at the last few git diffs.
In store(): 538 if imm_ok: 539 yield dut.rd.go.eq(0b101) 540 else: 541 yield dut.rd.go.eq(0b111) 542 yield from wait_for(dut.rd.rel) 543 yield dut.rd.go.eq(0) Another case of multi-cycle assertion of go. This time, since there are multiple independent rel signals to wait for, a parallel test is the sensible way to go. I'll go ahead and implement it. I intend to do it in a general way, that will apply whenever you need to test a rel/go port. Basically, you tell the producer process what it needs to send, and it sends it whenever it receives a rel request, along with a go pulse. In the same way, you tell the consumer process what it should expect, and assert that it is presented by the port, along with the rel request, until the process emits the go pulse. In both cases, the rel->go delay will be parameterized, including a no-delay case.
(In reply to Cesar Strauss from comment #26) > In store(): > > 538 if imm_ok: > 539 yield dut.rd.go.eq(0b101) > 540 else: > 541 yield dut.rd.go.eq(0b111) > 542 yield from wait_for(dut.rd.rel) > 543 yield dut.rd.go.eq(0) > > Another case of multi-cycle assertion of go. it works... it is however just a coincidence. > This time, since there are multiple independent rel signals to wait for, a > parallel test is the sensible way to go. > > I'll go ahead and implement it. yes please. keep it as a separate test. > > I intend to do it in a general way, that will apply whenever you need to > test a rel/go port. oh good. it applies to ADR as well as ST so yes. a base class that can be instantiated, this would be very handy. > In both cases, the rel->go delay will be parameterized, including a no-delay > case. yes. like it.
(In reply to Cesar Strauss from comment #26) > I intend to do it in a general way, that will apply whenever you need to > test a rel/go port. > > Basically, you tell the producer process what it needs to send, and it sends > it whenever it receives a rel request, along with a go pulse. > > In the same way, you tell the consumer process what it should expect, and > assert that it is presented by the port, along with the rel request, until > the process emits the go pulse. > > In both cases, the rel->go delay will be parameterized, including a no-delay > case. Basically, what you already described in comment #11.
Cesar, apologies, yet another lovely combination for the unit tests :) i've moved little-endian / big-endian from L0CacheBuffer into LDSTCompUnit. this because it's not really the L0CacheBuffer's responsibility. we'll therefore need a series of LD/ST tests that check that byte-reversal operates correctly. note that it's *inverted*!
i just noticed also that CompLDSTOpSubset has all these fields: ('is_32bit', 1), ('is_signed', 1), ('data_len', 4), ('byte_reverse', 1), ('sign_extend', 1), ('update', 1)) update is done, data_len is respected, byte_reverse is done: that leaves any changes needed from is_32bit, is_signed and sign_extend. i'd expect sign_extend in particular to also have to take is_32bit into consideration. i'm tempted to suggest moving this data-manipulation into a separate module in order to keep things looking "clean" in the section covering "PortInterface Connections"