318 – fix LDSTCompUnit

Bug 318 - fix LDSTCompUnit

Summary: fix LDSTCompUnit

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Cesar Strauss

URL:

Depends on:
Blocks:	341 383
	Show dependency tree / graph

Reported:	2020-05-17 11:08 BST by Luke Kenneth Casson Leighton
Modified:	2023-03-19 11:35 GMT (History)
CC List:	2 users (show)

See Also:	216 350 393 1036
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
gtkwave save file for l0 cache/buffer (3.80 KB, application/vnd.gtkwave-gtkw) 2020-05-17 11:09 BST, Luke Kenneth Casson Leighton	Details
gtkwave save file for ldst comp unit (3.87 KB, application/vnd.gtkwave-gtkw) 2020-05-17 11:10 BST, Luke Kenneth Casson Leighton	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-05-17 11:08:54 BST

See section LDST Comp Unit
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/

walk-through videos on how it is supposed to work

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst_multi.py;hb=HEAD

Comment 1 Luke Kenneth Casson Leighton 2020-05-17 11:09:49 BST

Created attachment 53 [details]
gtkwave save file for l0 cache/buffer

Comment 2 Luke Kenneth Casson Leighton 2020-05-17 11:10:21 BST

Created attachment 54 [details]
gtkwave save file for ldst comp unit

Comment 3 Luke Kenneth Casson Leighton 2020-05-17 11:31:24 BST

the "ultimate" success test of this is for experiment/score6600_multi.py
to not go permanently into "busy" mode.

the videos show how LDSTCompUnit is *supposed* to work... but doesn't
quite.

the issue on the current commit is that the address latching is not
occurring - not being transferred through correctly to mem_w_data -
for ST operations.

i *think* it is working fine on LD operations.

you can see in the gtkwave timing diagram that:

* cycle @ 50ns mem_w_en (and so mem_w_data) change
* however cycle @ 40ns *should* have set mem_w_addr to 0x6

this *actually* occurs 4 cycles later for the *second* ST operation!

if you then look at alu_o (further up), you can see it is:

* 0x0002 @ 20ns
* 0x0006 @ 30ns (the required, correct value)
* however ad__rel - the signal that says "address is valid"
  was *one cycle early*

(ad__rel in gtkwave is the name for the object ad.rel, and there is
a "convenience" variable in the python code ad_rel_o which points
to self.ad.rel)

it's nearly there.

notes:

* every req signal (rd.req[0], rd.req[1], rd.req[2], wr.req[0],
  wr.req[1], ad.req and st.req) has a corresponding "go" signal
  (rd.go[0] etc.)

  these are from the data structure created on-the-fly by
  "go_record"

* outgoing request ("req") signals may be held ASSERTed indefinitely until
  such time as their associated "go" signal is ASSERTed for one cycle

* on assertion of the corresponding "go" signal, the associated "req"
  signal **MUST** be de-asserted

this is basically wishbone etc. etc. etc. standard ready_o / valid_i
signalling conventions

in TestLDSTCompUnit, you can see the following line:

        m.d.comb += self.ad.go.eq(self.ad.rel) # link addr-go direct to rel

this is done deliberately because otherwise it becomes necessary to write
a split pair of co-routines in the unit test:

* one that watches ad.rel and sets ad.go for one cycle
* one that watches wr.rel[1] and sets wr.go[1] for one cycle

strictly speaking *all* {sig}.go signals could be made equal to all {sig}.rel
signals, combinatorially, and the code should function perfectly correctly.

Comment 4 Luke Kenneth Casson Leighton 2020-05-17 11:35:22 BST

see also http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/006782.html

(with thanks to Cesar)

Comment 5 Luke Kenneth Casson Leighton 2020-05-17 11:51:00 BST

one important set of API conventions:

* L0CacheBuffer's operating conventions are that the address must be
  *EXTERNALLY* sustained and held, as must its associated addr_ok
  line.

* this is because otherwise L0CacheBuffer itself has to have an
  address latch

* strictly speaking, LDSTCompUnit does not need an actual latch
  for the address *at all*.

* the reason is because, as you can see from the diagram
  https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg

  all of the information necessary to "fire" the adder and make
  the address "valid" is *already latched* and will *remain*
  latched until the busy signal is released.

  - op2 is latched
  - the immediate is latched

Comment 6 Luke Kenneth Casson Leighton 2020-05-17 12:02:06 BST

+++ b/src/soc/experiment/compldst_multi.py
@@ -477,7 +477,7 @@ class LDSTCompUnit(Elaboratable):
         comb += pi.op.eq(self.oper_i)    # op details (not all needed)
         # address
         comb += pi.addr.data.eq(addr_r)           # EA from adder
-        comb += pi.addr.ok.eq(self.ad.go) # "go do address stuff"
+        comb += pi.addr.ok.eq(alu_ok) # "go do address stuff"
         comb += self.addr_exc_o.eq(pi.addr_exc_o) # exception occurred
         comb += addr_ok.eq(self.pi.addr_ok_o)  # no exc, address fine
         # ld - ld gets latched in via lod_l

ok this "fixes" the address issue for ST.  alu_ok is the signal that
indicates the condition that the ALU address is ok to proceed.

however given that alu_valid and alu_ok are *sustained*, this causes
problems on the alu SR Latch which, if using *only* alu_ok to reset it,
would result in the alu_l being permanently held OFF

to fix that i am doing this:

        # alu latch.  use sync-delay between alu_ok and valid to generate pulse
        comb += alu_l.s.eq(reset_i)
        comb += alu_l.r.eq(alu_ok & ~alu_valid & ~rda_any)

alu_ok is one clock cycle behind alu_valid: therefore alu_ok & ~alu_valid
will generate the required one-clock pulse to indicate that the alu latch
can safely capture the address.

(which we noted previously we don't strictly need to do, but hey)

moving on from there, LD is currently bouncing around like mad.  the ld_ok
condition is being raised, which should end the FSM, however it does not.

Comment 7 Luke Kenneth Casson Leighton 2020-05-17 14:20:21 BST

On Sun, May 17, 2020 at 1:59 PM Cesar Strauss <cestrauss@gmail.com> wrote:

> I think I have enough information available, to proceed. I'm beginning
> already to have some understanding of the problem at hand.

there are *eight* possible input-combinations that the FSM has to
handle, based on:

* op: LD or ST
* mode: immediate or indexed address-calculation (termed "Effective Address")
* update: EA is to be outputted as a *SECOND* register yes/no

these 8 permutations make the FSM riiiiidiculouslly complicated and
there's really nothing that can be done about it: we need to be
compliant with POWER9 requirements.

* op LD/ST will cause the FSM to ignore operand3 if op == LD.

  this has to include *NOT* requesting operand3 (not setting rd.req[2])

  however because it is a LD, the LD data will be latched from the
  L0CacheBuffer port: thus when that happens, wr.req[0] MUST send out
  a request for the REGFILE write port.  when this is available,
  wr.go[0] will be activated for ONE cycle.

  after that one cycle, wr.req[0] MUST be dropped.

* if op == ST, operand 3 contains the value to be stored, so must be
  rd.req[2]-ested.

  however because it is a ST, there is no LD data to output, and therefore
  wr.req[0] must not be activated.

* immediate-mode will cause the FSM to use the immediate *instead* of
  operand2.

  therefore, in immediate-mode, operand2 must *NOT* be requested
  (rd.req[1] must never be set).

* update=TRUE will cause the FSM to activate wr.req[1] and wait for wr.go[1]
  when wr.go[1] is activated, this indicates to the FSM that the latched
  address (EA) in addr_o is being broadcast onto a REGFILE WRITE port.

  it will get *one* chance to do that (one cycle), and MUST drop wr.req[1]
  on the next cycle: its job is then "done"

  this is *independent* of the logic that manages / monitors the LD/ST
  Port (!)

  why? because the address computation (EA) is an independent parallel
  path from actual LD/ST

it's... just... ridiculously complicated.

at the very least if you can review it, and greatly improve the unit
tests, it would be fantastic.

as i said i *might* have got it working, 10 minutes ago, however the
complexity is so high i am not in the slightest bit confident of that,
nor of the correctness of the unit tests.

Comment 8 Cesar Strauss 2020-05-17 17:43:10 BST

(In reply to Luke Kenneth Casson Leighton from comment #7)
> at the very least if you can review it, and greatly improve the unit
> tests, it would be fantastic.
> 
> as i said i *might* have got it working, 10 minutes ago, however the
> complexity is so high i am not in the slightest bit confident of that,
> nor of the correctness of the unit tests.

Sure, I'll have a look into it.

I'm glad you were able to advance. I was worried this issue might have got you blocked, for two days already, and turned itself into a time sink.

Regards,

Cesar

Comment 9 Luke Kenneth Casson Leighton 2020-05-17 18:06:15 BST

(In reply to Cesar Strauss from comment #8)

> Sure, I'll have a look into it.

appreciated.

> I'm glad you were able to advance. I was worried this issue might have got
> you blocked, for two days already,

about 20 so far.  in addition to around 20 days analysis prior to that.
and about a month last year on an earlier version.

Comment 10 Cesar Strauss 2020-05-18 00:01:04 BST

I got my development environment running, by consulting https://libre-soc.org/HDL_workflow/.

Then, I ran the testbench at experiment/compldst_multi.py, generated the timing diagram, and confirmed in GTKWave that the problem related in comment #3 does not occur any longer in the latest git master.

I started reading the reference material, in order to get a basic knowledge of the functionalities and interfaces.

Then, I followed the nMigen tutorial at https://github.com/RobertBaruch/nmigen-tutorial, while reading the code of the LD/ST Comp Unit itself, and as a result I got a reasonable understanding of both.

I'll now go back to the reference material a little more. Once I get a better understanding, I'll return to the code and proceed with reviewing it and improving the unit tests. I have formed some thoughts about this already.

I'll ask if I need further clarifications, and report my progress.

Regards,

Cesar

Comment 11 Luke Kenneth Casson Leighton 2020-05-18 00:31:24 BST

(In reply to Cesar Strauss from comment #10)
> I got my development environment running, by consulting
> https://libre-soc.org/HDL_workflow/.
> 
> Then, I ran the testbench at experiment/compldst_multi.py, generated the
> timing diagram, and confirmed in GTKWave that the problem related in comment
> #3 does not occur any longer in the latest git master.
> 
> I started reading the reference material, in order to get a basic knowledge
> of the functionalities and interfaces.
> 
> Then, I followed the nMigen tutorial at
> https://github.com/RobertBaruch/nmigen-tutorial, while reading the code of
> the LD/ST Comp Unit itself, and as a result I got a reasonable understanding
> of both.

excellent.  similar to https://bugs.libre-soc.org/show_bug.cgi?id=316
if you'd like we can find a practical and useful example that's an awful
lot less complicated than this one, by way of getting started.


> I'll now go back to the reference material a little more. Once I get a
> better understanding, I'll return to the code and proceed with reviewing it
> and improving the unit tests. I have formed some thoughts about this already.

me too.  please, when you have a moment, do describe them.  here's a couple:


* we do need a way to test dual ports (two CompLDSTs with
  two ports, 4-4 as well) which write to the same memory

  this will test whether the (prototype) L0CacheBuffer is doing
  its job (which i am not certain that it does.  it handles _one_
  LD or _one_ ST correctly)

* we need some tests to be delayed-response (to each of the rd and wr
  *and* go.ad *and* go.st) and some to be immediate-response

these can be done using multiple "processes".  you notice in the
simulation that a function is passed in: it is possible to pass in
*multiple* such functions.  hmm hmm where's an example of that...

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_inout_mux_pipe.py;hb=HEAD

what that does is: set up a pipeline with a 4-input, 4-output "muxer"
at the input and output.  a context (index) is passed down the pipeline,
and used to associate the "result" back with its corresponding output.

so you can see in InputTest, there are *eight* separate and distinct
simultaneously-operating "processes": 4 do sending, 4 do receiving.

they all operate on the same (global) python data structure, which gives
a way for the python unit test to actually check that they're doing the
right thing.

a) that no data is lost
b) that data sent from input N is *received* at output N
c) that the order of any data sent from input N to output N is preserved
   regardless of the interleaving order.

so in a similar fashion, a class could be designed where one function
takes care of rd.req[N] and rd.go[N] acknowledgement (for N=0..2),
and likewise another function responds to ad.req, another to wr.req[0..1]
and so on, where because those "processes" are all members of the same
class, they have access to the full "state" of all other "processes" and
consequently can verify and report that the request-signals are coming
in correctly (or at all), and in the right order.

in addition, under the control of the caller (the unit test), those
individual processes could be told "respond immediately to the signal"
or "response with an arbitrary delay" extremely easily: a simple for-loop
with a "yield" in it - as you can see at line 101 in test_inout_mux_pipe.py
will create that type of behaviour.


beyond that: in a similar way to InputTest, setting up multiple CompLDSTUnits,
and getting them to hammer the same L0CacheBuffer via multiple ports, is also a
very important test, as it is the way to ensure that multiple LD/ST operations
can operate in parallel and not corrupt memory (or be lost).

beyond *that*, at some point we need to also arbitrarily test raising an
"exception" flag, which will result in the "shadow" condition being pulled,
and that should result in the complete cancellation of the operation.


> I'll ask if I need further clarifications, and report my progress.

appreciated.

Comment 12 Cesar Strauss 2020-05-18 17:26:24 BST

(In reply to Luke Kenneth Casson Leighton from comment #11)
> excellent.  similar to https://bugs.libre-soc.org/show_bug.cgi?id=316
> if you'd like we can find a practical and useful example that's an awful
> lot less complicated than this one, by way of getting started.

No need, it's fine. I think I got the hang of nMigen now. I got over the initial shock some time ago, while following this project. Verilog is my HDL of choice, but the fundamental concepts are the same.

Comment 13 Luke Kenneth Casson Leighton 2020-05-18 17:38:00 BST

(In reply to Cesar Strauss from comment #12)

> No need, it's fine. I think I got the hang of nMigen now. I got over the
> initial shock some time ago, while following this project.

very funny :)

> Verilog is my HDL
> of choice, but the fundamental concepts are the same.

yes. everything is there.

Comment 14 Cesar Strauss 2020-05-18 17:56:22 BST

Dear Luke,

I see uses of "latchregister" and "SRLatch" in code. In the nMigen library, they are internally implemented with synchronous logic, although partly resembling the function of their asynchronous equivalents.

Do you, by any chance, intend to replace them, in the ASIC, with real asynchronous latches? As depicted in several diagrams at https://libre-soc.org/3d_gpu/architecture/6600scoreboard/ ?

Regards,
Cesar

Comment 15 Luke Kenneth Casson Leighton 2020-05-18 18:19:45 BST

(In reply to Cesar Strauss from comment #14)
> Dear Luke,
> 
> I see uses of "latchregister" and "SRLatch" in code. In the nMigen library,
> they are internally implemented with synchronous logic, although partly
> resembling the function of their asynchronous equivalents.

yes.  this was the best i could think of that is reasonably close and at the same time is verifiable if we went with standard commercial tools.

SR NAND most definitely is not verifiable except with proprietary tools that are rented at USD 250,000 a week.

> 
> Do you, by any chance, intend to replace them, in the ASIC, with real
> asynchronous latches?

for the CompUnits, probably not.  the gate count (number of SRLatch instances) is not high enough to justify the risk.

> As depicted in several diagrams at
> https://libre-soc.org/3d_gpu/architecture/6600scoreboard/ ?

yes.  i kept these to make it possible to understand the relationship with what Mitch taught me.

however you have to appreciate that the 6600 diagrams worked by alternating low and high clock cycles.

in particular, the Dependency Matrix cells very specifically activate "issue" on the HI cycle and clear GOREAD on i believe the LO cycle.

i decided that sticking closely to this as a very first chip might not be a good idea, although it is extremely cool.

consequently instead i created SRLatch and then used a sync on resetting it as the "primary method" of responding to REQUEST signals.

you will therefore see a GO signal be the reset driver on for example the read latch, as a *sync* not a comb.

this leaves the SRLatch latch open for precisely one clock cycle, and the corresponding REQ, generated from the Q of the SRlatch instance, will then drop.  which is exactly what is needed to capture the incoming data.

remember, GOREAD is only set (and set for only one cycle) when the Regfile Bus is svailable to transport data to this Unit.

a standard design (wishbone) would use ready/valid signalling to do exactly the same thing, it is just different names.

Comment 16 Cesar Strauss 2020-05-20 04:09:42 BST

I'll now proceed to improve the unit test for LDSTCompUnit.

I'll take into account your several ideas in comment #11.

Some of my own ideas are:

1) The test driver should only present data during a single cycle, coincident with the activation of "go". This detects data being latched at the wrong time.

2) Add some behavioral assertions.
For instance, busy_o must rise if, and only if, issue_i was active on the previous clock.
Likewise "rel" signals must go low if, and only if, the corresponding "go" was active on the previous clock.
And so on.
There could be a simulation "process" monitoring these kind of conditions. Or, using nMigen assertions, and/or formal verification. I must find out more about this.

I still need to study the interface and functionality of the L0CacheBuffer.

Regards,
Cesar

Comment 17 Luke Kenneth Casson Leighton 2020-05-20 04:17:03 BST

(In reply to Cesar Strauss from comment #16)
> I'll now proceed to improve the unit test for LDSTCompUnit.
> 
> I'll take into account your several ideas in comment #11.
> 
> Some of my own ideas are:
> 
> 1) The test driver should only present data during a single cycle,
> coincident with the activation of "go". This detects data being latched at
> the wrong time.

oh yeh, really good point.

> 
> 2) Add some behavioral assertions.
> For instance, busy_o must rise if, and only if, issue_i was active on the
> previous clock.
> Likewise "rel" signals must go low if, and only if, the corresponding "go"
> was active on the previous clock.
> And so on.

i believe these kinds of things are what Formal Proofs are for.  in particular,
when you use Past() to assert things from previous cycles.

> There could be a simulation "process" monitoring these kind of conditions.
> Or, using nMigen assertions, and/or formal verification. I must find out
> more about this.

bug 316, look up the formal_bperm.py code.  it's really well documented.

> I still need to study the interface and functionality of the L0CacheBuffer.

i tried two LDs in the score6600 code - they failed.  the first place to check
is the [prototype] L0CacheBuffer, with a twin-process.

Comment 18 Luke Kenneth Casson Leighton 2020-05-20 15:40:05 BST

Cesar, fyi:

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=c501606b20c74050fb0d422ff886a7d9fb2a6a96

very straightforward, no disruption at all, due to a hack
of making MultiCompUnit look exactly like it was.

Comment 19 Cesar Strauss 2020-05-21 13:05:32 BST

(In reply to Luke Kenneth Casson Leighton from comment #18)
> Cesar, fyi:
> 
> https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=c501606b20c74050fb0d422ff886a7d9fb2a6a96
> 
> very straightforward, no disruption at all, due to a hack
> of making MultiCompUnit look exactly like it was.

Dear Luke,

I see you just committed:
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=1dbb3d17d051f73a20a148740be7b65c1a1778c1

I had the impression, maybe wrongly, that you had assigned this task to me.

I had my own version mostly working. I will compare mine to yours, and make a review of yours.

I guess I failed to explicitly accept the (supposed) assignment, and report my progress. Sorry about that. Lesson learned.

Too bad about the duplicate effort. No worries. At least I had a good training on nMigen, RecordObject and the CompUnit API.

On the way, I did manage to catch an obvious typo in your previous refactoring of compalu_multi.py. I took the liberty to commit it:

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=d9724d6fed90959a78799c6e48667d31abd3a091 

I'm going back now to improve the unit test.

Again, no worries.

Regards,

Cesar

Comment 20 Luke Kenneth Casson Leighton 2020-05-21 13:58:12 BST

(In reply to Cesar Strauss from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #18)
> > Cesar, fyi:
> > 
> > https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=c501606b20c74050fb0d422ff886a7d9fb2a6a96
> > 
> > very straightforward, no disruption at all, due to a hack
> > of making MultiCompUnit look exactly like it was.
> 
> Dear Luke,
> 
> I see you just committed:
> https://git.libre-soc.org/?p=soc.git;a=commitdiff;
> h=1dbb3d17d051f73a20a148740be7b65c1a1778c1

yes.  i meant to write here this morning, but got side-tracked on pipeline
code-morphing.

> I had the impression, maybe wrongly, that you had assigned this task to me.

ah.  sorry!  if you had written explicitly "i will do this conversion tomorrow 
evening", i would have held off.

> I had my own version mostly working. I will compare mine to yours, and make
> a review of yours.

appreciated.

> I guess I failed to explicitly accept the (supposed) assignment, and report
> my progress. Sorry about that. Lesson learned.

by both of us.

> Too bad about the duplicate effort. No worries. At least I had a good
> training on nMigen, RecordObject and the CompUnit API.

about the CompUnit API: if you look at CompUnitBase you can see where i'm
going with it.
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/score6600_multi.py;h=4630368e474c9d7532dad3b00a3983435ba01c7b;hb=7b0b09aef28e4329c6b85d933b022112f0c15c50#l44

CompUnitBase is a mess.  assigning individual signals from a batch of CompUnits
to create a common data structure.   i am hoping - planning - to find a way
to merge the units into a common data structure.  or, if not, at the very
least a common *python* level API which allows convenient signal-connectivity
without having a hundred lines of tediously repetitive signal-for-signal 
assignments.



about RecordObject: RecordObject exists because whitequark wants Record to be
bit-oriented, particularly when it comes to enumerating (using __iter__) through it.

whereas what we need - to avoid insanely complex graphviz diagrams that are
flat-out impossible to read - is *field*-orientated iteration and assignment.

*field*-orientated iteration allows nmigen to generate code that, when you
assign one RecordObject to another, does so by creating eq()s on the *members*
of the RecordObject, not the *individual bits of every single field*.

also, RecordObject looks pretty much exactly like a standard python class,
and may be sub-classed (as you can see from LDSTCompUnitRecord) and extended
in a standard python fashion.

Record (because it takes an explicit layout argument) may *NOT* be extended
except by direct manipulation of Record.fields, which can have consequences
when people assume that the use of any given Record instance will have exactly
the same structure.

basically, RecordObject fits python developer expectations: Record does not.

 
> On the way, I did manage to catch an obvious typo in your previous
> refactoring of compalu_multi.py. I took the liberty to commit it:
> 
> https://git.libre-soc.org/?p=soc.git;a=commitdiff;
> h=d9724d6fed90959a78799c6e48667d31abd3a091 

yes i saw - that would have been hard to track down: the unit test didn't
catch it.  really appreciated.
 
> I'm going back now to improve the unit test.

ok :)

Comment 21 Luke Kenneth Casson Leighton 2020-05-21 18:55:55 BST

cesar just fyi (learning from past mistake...) to match what is in soc.fu.alu.alu_input_record i've moved LDSTOpSubset to soc.fu.ldst.ldst_input_record
this should merge cleanly into what you are doing.

Comment 22 Luke Kenneth Casson Leighton 2020-06-10 17:44:00 BST

cesar, hi,

i've managed to get L0CacheBuffer semi-operational: it won't cope
with mis-aligned LD/ST.  the boundary (memory granularity) is 64-bit,
and if you try to read 8 bytes that are starting 32-bits in, that'll
fail.

but... that's not the issue that needs looking at (because it's to
be sorted for when we have the DualPortSplitter).

the issue that needs looking at is to confirm that compldst_multi.py's
unit tests are properly functional: i'm just not confident that they
are.

can you take a look when you have a moment?  it may just be the case
that TstL0CacheBuffer needs to be initialised with the right parameters,
to give it a 64-bit width rather than the current (old) values of only
a 16-bit underlying memory.

Comment 23 Cesar Strauss 2020-06-10 18:20:21 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> can you take a look when you have a moment?  it may just be the case
> that TstL0CacheBuffer needs to be initialised with the right parameters,
> to give it a 64-bit width rather than the current (old) values of only
> a 16-bit underlying memory.

Sure, I'll start later today.
As it happens, tomorrow is a holiday.
I'll have the full day free to work on the project.

Comment 24 Cesar Strauss 2020-06-11 23:58:45 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> the issue that needs looking at is to confirm that compldst_multi.py's
> unit tests are properly functional: i'm just not confident that they
> are.

Have yet to see the traces, but found this by inspection:

in load():

591        yield dut.rd.go.eq(rd)
592        yield from wait_for(dut.rd.rel)
593        yield dut.rd.go.eq(0)

This is backwards, I think. We wait for the CompUnit to assert rel, then we send a one-clock pulse on the corresponding go.

I guess a solution is to simply wait on one rel signal at a time.

Eventually I'll come back to write the parallel test for this.

Meanwhile I'll take a look at the traces, see if I can catch anything.

Comment 25 Luke Kenneth Casson Leighton 2020-06-12 00:22:31 BST

(In reply to Cesar Strauss from comment #24)
> (In reply to Luke Kenneth Casson Leighton from comment #22)
> > the issue that needs looking at is to confirm that compldst_multi.py's
> > unit tests are properly functional: i'm just not confident that they
> > are.
> 
> Have yet to see the traces, but found this by inspection:
> 
> in load():
> 
> 591        yield dut.rd.go.eq(rd)
> 592        yield from wait_for(dut.rd.rel)
> 593        yield dut.rd.go.eq(0)
> 
> This is backwards, I think.

it's simply broken, violating the protocol by potentially leaving the rd.go HI for more than one cycle.  it needs to be

yield from waitfor(dut.rd.rel)
yield rd.go.eq(rd)
yield
yield rd.go.eq(0)

now it *may* be the case that by coincidence there is one clock in waitfor.  that would give the correct ptotocol behaviour by accident.

> We wait for the CompUnit to assert rel, then we
> send a one-clock pulse on the corresponding go.

yes.

> 
> I guess a solution is to simply wait on one rel signal at a time.


except this would be a different type of test.


> 
> Eventually I'll come back to write the parallel test for this.

that would be fantastic.

> Meanwhile I'll take a look at the traces, see if I can catch anything.

you can see i caught an error where go signals are wired directly to rel signals.

this trick will force an immediate response, in parallel, and there were a number of corner cases in LDST and ALU CompUnits.

these turned out to be that the latch covering that particular go/req was receiving a set *and* reset on the *same cycle*, and, with set taking priority, the latch was left in the "set" position where it should have been dropped to low.

even L0CacheBuffer had this exact same bug.  have a look at the last few git diffs.

Comment 26 Cesar Strauss 2020-06-12 11:58:30 BST

In store():

538    if imm_ok:
539        yield dut.rd.go.eq(0b101)
540    else:
541        yield dut.rd.go.eq(0b111)
542    yield from wait_for(dut.rd.rel)
543    yield dut.rd.go.eq(0)

Another case of multi-cycle assertion of go.

This time, since there are multiple independent rel signals to wait for, a parallel test is the sensible way to go.

I'll go ahead and implement it.

I intend to do it in a general way, that will apply whenever you need to test a rel/go port.

Basically, you tell the producer process what it needs to send, and it sends it whenever it receives a rel request, along with a go pulse.

In the same way, you tell the consumer process what it should expect, and assert that it is presented by the port, along with the rel request, until the process emits the go pulse.

In both cases, the rel->go delay will be parameterized, including a no-delay case.

Comment 27 Luke Kenneth Casson Leighton 2020-06-12 12:13:13 BST

(In reply to Cesar Strauss from comment #26)
> In store():
> 
> 538    if imm_ok:
> 539        yield dut.rd.go.eq(0b101)
> 540    else:
> 541        yield dut.rd.go.eq(0b111)
> 542    yield from wait_for(dut.rd.rel)
> 543    yield dut.rd.go.eq(0)
> 
> Another case of multi-cycle assertion of go.

it works... it is however just a coincidence.

> This time, since there are multiple independent rel signals to wait for, a
> parallel test is the sensible way to go.
> 
> I'll go ahead and implement it.

yes please.  keep it as a separate test.


> 
> I intend to do it in a general way, that will apply whenever you need to
> test a rel/go port.

oh good.  it applies to ADR as well as ST so yes.

a base class that can be instantiated, this would be very handy.


> In both cases, the rel->go delay will be parameterized, including a no-delay
> case.

yes.  like it.

Comment 28 Cesar Strauss 2020-06-12 12:15:25 BST

(In reply to Cesar Strauss from comment #26)
> I intend to do it in a general way, that will apply whenever you need to
> test a rel/go port.
> 
> Basically, you tell the producer process what it needs to send, and it sends
> it whenever it receives a rel request, along with a go pulse.
> 
> In the same way, you tell the consumer process what it should expect, and
> assert that it is presented by the port, along with the rel request, until
> the process emits the go pulse.
> 
> In both cases, the rel->go delay will be parameterized, including a no-delay
> case.

Basically, what you already described in comment #11.

Comment 29 Luke Kenneth Casson Leighton 2020-06-22 11:57:20 BST

Cesar, apologies, yet another lovely combination for the unit tests :)
i've moved little-endian / big-endian from L0CacheBuffer into LDSTCompUnit.
this because it's not really the L0CacheBuffer's responsibility.

we'll therefore need a series of LD/ST tests that check that byte-reversal
operates correctly.  note that it's *inverted*!

Comment 30 Luke Kenneth Casson Leighton 2020-06-22 12:35:32 BST

i just noticed also that CompLDSTOpSubset has all these fields:

                  ('is_32bit', 1),
                  ('is_signed', 1),
                  ('data_len', 4),
                  ('byte_reverse', 1),
                  ('sign_extend', 1),
                  ('update', 1))

update is done, data_len is respected, byte_reverse is done: that leaves
any changes needed from is_32bit, is_signed and sign_extend.  i'd expect
sign_extend in particular to also have to take is_32bit into consideration.

i'm tempted to suggest moving this data-manipulation into a separate module
in order to keep things looking "clean" in the section covering
"PortInterface Connections"