Bug 154 - Cell for Dependency Matrices is needed
Summary: Cell for Dependency Matrices is needed
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Cell Libraries (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks:
 
Reported: 2020-01-11 17:23 GMT by Luke Kenneth Casson Leighton
Modified: 2020-04-03 11:02 BST (History)
4 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-01-11 17:23:58 GMT
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-January/003324.html

the dependency matrices of a 6600 were designed using an SR NOR latch which is only 2 gates, and is extremely efficient and effective.

careful design does not require resets because they are inherent and per row).

SR NOR is unfortunately "unstable" and is no longer part of proprietary toolchains.

the huge numbers and regularity of the Dependency Matrices justifies creating a special cell.

unfortunately SR NOR is not used any more so we have to look at SR NAND.
Comment 1 Luke Kenneth Casson Leighton 2020-01-12 20:42:03 GMT
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-January/003328.html

from staf:

> Do you have reference to code for these dependency matrices ?
> Would be good if we could play around
> ourselves and come up with proposals.

yes, the code is here:
https://git.libre-riscv.org/?p=soc.git;a=tree;f=src/scoreboard

to provide some explanation:

* there are currently three separate dependency matrices: register-functionunit (FU), FU-FU, and FU-Memory.
* SRLatch is the basic unit.
* unfortunately nmigen is (was?) so slow (simulation wise) i had to make SRLatch a multi input multi output design
* therefore SRLatch takes a width parameter indicating *how many* SRLatches are to be created
* therefore this same parameter is *also* passed to DependencyRow.
* FURegDepMatrix is where you get an actual 2D matrix.

early designs had SRLatch as a single entity, DependencyRow created N such entities, FURegDepMatrix created M DependencyRows.

this was so painfully slow it was critical to optimise, hence the strange design of SRLatch.

changes to an SRNand rather than an SRNor may require investigation of inversion of the inputs (outputs?). i will be able to focus on this from the 23rd.
Comment 2 Luke Kenneth Casson Leighton 2020-01-12 23:34:23 GMT
https://www.youtube.com/watch?v=kt8d3CYWGH4&app=desktop

here is a comparison between the truth tables of SR NOR and SR NAND

SR NOR

S R Q
0 0 Latch
0 1 0
1 0 1
1 1 0

SR NAND

S R Q
0 0 *INVALID STATE*
0 1 1
1 0 0
1 1 Latch

as i suspected there is an inversion of both the output and of the inputs.  in many cases both the Q and ~Q are needed, so the negation of output is not costing gates.

however the inversion of the inputs, and in particular the instability, will both need investigating.

of course normally a full DFF would be deployed to prevent that instability however that is a cost of 10 gates vs 2 which is where the estimated 5 fold increase from 50,000 to 250,000 gates just for one of the DMs comes from.

if we have to do that, we do that. it is not a huge "dealbreaker"

ah.  i just went 13 mins into that video and it explains that if SR NOR changes in the same clock from SR=1,1 to SR=0,0 the output is INDETERMINATE.

this we know from the fact that 6600 DMs work fine was catered for in its design.  i am still investigating.

Bartek: this is why the Formal Proofs are so important.
Comment 3 Luke Kenneth Casson Leighton 2020-01-13 01:27:55 GMT
staf i apologise if this is wellkniwb to you already, i went through the two truth tables and it seems that SRNand is just the logical inverted equivalent of SRNor, *including* the so-called "instability".

when SRNAND S,R=0,0 i see that the output is *not* unstable, it is 1,1

however if there is a simultaneous transition to 1,1 *then* the output is indeterminate.

this is exactly the same as for SRNor except all input and output inverted from the above paragraphs.

so i see no reason at all why SRNAND should not be used, and the simplest and quickest way to check that would be to flip Q and ~Q as outputs and put inverters on SRLatch's inputs:

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/latch.py;h=84235ffada66bb4b69249e585970e5097e939277;hb=a60f0c986284e28b204c1057ba9a501353683449#l33

_s and _r can be created which will be set to ~s and ~r.

oh, ah... you can probably see straight away from that code, it is *not* an SRNOR (it should be).  it is a priority-set SR latch which you can see at line 47:

m.d.sync += q_int.eq((q_int & ~self.r) | self.s)

also... ah :)  one thing that turned out to be very important: hum.  a register mechanism is built-in to that code.

this you can see from the fact that line 47 is a *sync* not a comb.

it turns out that these built-in registers help avoid having to sprinkle them consistently across the entire set of DM Cells.
Comment 4 Staf Verhaegen 2020-01-13 11:15:59 GMT
Some of my comments:
- A SR latch could be easily added to the standard cell library and I will do that. I'm not sure though current nmigen + yosys will actually infer SR latches.
- If one wants to use SR latches nmigen is not the right language. nmigen is based on normal latches with just a reset signal which is only an init value. I do think it is better to rephrase the code as logic with one sync statement of one wants to keep on using nmigen for the scoreboards. Certainly on nmigen level one should not try to take into account the undefined behaviour of a latch when both inputs are 1 or 0. That should be taken care of behind the scene by nmigen and yosys.
- Efficiency of simulation should not be the guidance on how to implement things. Better to switch to cocotb + verilator or other solution for simulation then.
Comment 5 Staf Verhaegen 2020-01-13 11:32:48 GMT
Where do I find nmutil and the SRLatch definition ?
Comment 6 Staf Verhaegen 2020-01-13 13:46:31 GMT
(In reply to Staf Verhaegen from comment #5)
> Where do I find nmutil and the SRLatch definition ?

Did find ieee754fpu.
Comment 7 Staf Verhaegen 2020-01-13 15:35:58 GMT
(comment out of date: see comment #8)

(edit: no problem.  i see you found the edit capability that bugzilla doesn' normally have :) )
Comment 8 Staf Verhaegen 2020-01-13 16:15:12 GMT
(Sorry accidently commit previous comment. Redo comment)

After reading a little further I think that the architecture was developed at a moment where not the current synthesis, place-and-route and timing check tools were available. With current tools I think it is better to implement it all on the DependencyMatrix level without sublevels. Incomplete and non-checked code but I hope to get the idea across:

    class DependencyMatrix(Elaboratable):
        def __init__(self, n_fus, n_regs):
             self.fu_start = Signal(n_fus)
             self.fu_read_in = Array(Signal(n_regs) for _ in range(n_fus))
             self.fu_write_in = Array(Signal(n_regs) for _ in range(n_fus))
             self.fu_read_ok = Signal(n_fus)
             self.fu_read_done = Signal(n_fus)
             self.reg_read_reserve = Signal(n_refs)
             self.fu_write_ok = Signal(n_fus)
             self.fu_write_done = Signal(n_fus)
             self.reg_write_reserve = Signal(n_refs)

             self._n_fus = n_fus
             self._n_regs = n_regs

        def elaborate(self, platform):
            n_fus = self._n_fus
            n_regs = self._n_regs

            m = Module()

            matrix_read_regs = Array(Signal(n_regs) for _ in range(n_fus))
            matrix_write_regs = Array(Signal(n_regs) for _ in range(n_fus))
            matrix_read_fus = Array(
                Cat(matrix_read_regs[i][j] for i in range(n_fus))
                for j in range(n_regs)
            )
            matrix_write_fus = Array(
                Cat(matrix_write_regs[i][j] for i in range(n_fus))
                for j in range(n_regs)
            )

            m.d.comb += [
                 self.reg_read_reserve.eq(Cat(matrix_read_fus[i] != 0) for i in range(n_regs)),
                 self.reg_write_reserve.eq(Cat(matrix_write_fus[i] != 0) for i in range(n_regs)),
            ]

            for i in range(n_fus):
                 needed = matrix_read_regs[i]
                 not_reserved_read = self.reg_read_reserve & ~matrix_read_refs[i]
                 m.d.comb += self.fu_read_ok.eq(~(needed | not_reserved_read) == 0)

                 needed = matrix_write_regs[i]
                 not_reserved_write = self.reg_write_reserve & ~matrix_write_refs[i]
                 m.d.comb += self.fu_write_ok.eq(~(needed | (not_reserved_read & not_reserved_write)) == 0)

                 with m.If(self.fu_start[i]):
                      m.d.sync += [
                          matrix_read_regs[i].eq(self.fu_read_in[i]),
                          matrix_write_regs[i].eq(self.fu_write_in[i]),
                      ]
                 with m.Else():
                      with m.If(self.fu_read_done[i]):
                          m.d.sync += matrix_read_regs[i].eq(0)
                      with m.If(self.fu_write_done[i]):
                          m.d.sync += matrix_write_regs[i].eq(0)

            ...

            return m
Comment 9 Luke Kenneth Casson Leighton 2020-01-13 17:57:48 GMT
(In reply to Staf Verhaegen from comment #4)
> Some of my comments:
> - A SR latch could be easily added to the standard cell library and I will
> do that. I'm not sure though current nmigen + yosys will actually infer SR
> latches.

yosys is supposed to.  very quick search

https://www.reddit.com/r/yosys/comments/8014ji/sr_latches_flipflops/?utm_source=amp&utm_medium=&utm_content=post_num_comments

yosys has $sr and others.  (that author was not able to get advice on how to use them, i will find others)

we can ask whitequark to add support for them or even add them ourselves 

> - If one wants to use SR latches nmigen is not the right language. nmigen is
> based on normal latches with just a reset signal which is only an init
> value. I do think it is better to rephrase the code as logic with one sync
> statement of one wants to keep on using nmigen for the scoreboards.

there are design reasons why the twin output (combination of the sync and comb) are needed.

if the code is turned into sync only, it will require a near total redesign of the Dependency Matrix logic, which took several months of 100% focus to understand and implement, with almost daily help and input from Mitch Alsup.

also it will adversely affect the performance, introducing unnecessary gate delays.  these additional syncs would cause pipelined data and control to lose sync, hence the implications of a near-total redesign.

even if done, this will, i believe, result in a DFF which will be 10 gates instead of 2, and the 128 x 128 Register Dependency Matrix will have an additional 200,000 gates added.

clearly this is something we would like to avoid :)

however if impractical, we accept it.

> Certainly on nmigen level one should not try to take into account the
> undefined behaviour of a latch when both inputs are 1 or 0. That should be
> taken care of behind the scene by nmigen and yosys.

the analysis i did after finding that initial "state" diagram online shows thar the "UNSTABLE" assessment is wrong (more accurately: a huge misleading simplification).

both SRNOR *and* SRNAND are entirely stable: they both however if you try to transition simultaneously from 1s to 0s on the inputs (or 0s to 1s for NAND) *then* the output is "indeterminate".

in simulations, a simple "exception" or even setting the output to verilog "X" to represent "uninitialised" would seem the logical thing to do here.

and we know that the 6600 avoided this situation with some extremely good design.


> - Efficiency of simulation should not be the guidance on how to implement
> things. 

good advice.

> Better to switch to cocotb + verilator or other solution for
> simulation then.

i needed to get up and running in a fast iterative loop.  i like cocotb.  immanuel is working on nmigen verilator integration.  we have options coming up.
Comment 10 Luke Kenneth Casson Leighton 2020-01-13 18:00:04 GMT
(In reply to Staf Verhaegen from comment #5)
> Where do I find nmutil and the SRLatch definition ?

i think you found it?

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/latch.py;h=84235ffada66bb4b69249e585970e5097e939277;hb=a60f0c986284e28b204c1057ba9a501353683449#l33

nmutil really does have to move to its own repo.  hasn't been done yet.
Comment 11 Luke Kenneth Casson Leighton 2020-01-13 18:24:01 GMT
(In reply to Staf Verhaegen from comment #8)
> (Sorry accidently commit previous comment. Redo comment)
> 
> After reading a little further I think that the architecture was developed
> at a moment where not the current synthesis, place-and-route and timing
> check tools were available. 

yes. i need to get back to Taiwan and update everything. so much has happened even in 8 months.

> With current tools I think it is better to
> implement it all on the DependencyMatrix level without sublevels. Incomplete
> and non-checked code but I hope to get the idea across:

if i can summarise: DependencyRow is gone, and merged into DependencyMatrix?

assuming yes: my concern with that was for P&R - we will not be doing full automatic P&R in coriolis2 - was to do isolated *small* blocks (cells) that *may* be done automated (or manual), and are small enough to visually inspect without being completely overwhelmed by the sight of the resulting GDS image (or intermediate output before that)

a 2 dimensional matrix is not only extremely unlikely for the autorouter to get right, it is much more complex to design the manual layout.

(as you will no doubt know, i mention this for the benefit of other readers, staf, coriolis2 layouts are *python* programs!)

so the justification for doing 1D of 1D objects was so that the person doing the DependencyRow can write a python P&R coriolis2 program that just puts SRLatch objects down in a simple 1D for-loop...

... and the person doing the DependencyMatrix can likewise write a python coriolis2 program that does the same for DependencyRow.

that was the plan.  does it make sense? i described it to JP and, although he hasn't reviewed the actual code as you did, he liked the idea.

hm i should ask him if he could take a quick look.

could i ask, could you perhaps explain the reasoning behind the recommendation to flatten the structure? what would be the advantages of doing so?
Comment 12 Staf Verhaegen 2020-01-13 18:34:52 GMT
(In reply to Luke Kenneth Casson Leighton from comment #11)
> (In reply to Staf Verhaegen from comment #8)
> > (Sorry accidently commit previous comment. Redo comment)
> > 
> > After reading a little further I think that the architecture was developed
> > at a moment where not the current synthesis, place-and-route and timing
> > check tools were available. 
> 
> yes. i need to get back to Taiwan and update everything. so much has
> happened even in 8 months.

I was talking about the CDC6600 architecture...
Comment 13 Luke Kenneth Casson Leighton 2020-01-13 19:02:00 GMT
(In reply to Staf Verhaegen from comment #12)
> (In reply to Luke Kenneth Casson Leighton from comment #11)
> > (In reply to Staf Verhaegen from comment #8)
> > > (Sorry accidently commit previous comment. Redo comment)
> > > 
> > > After reading a little further I think that the architecture was developed
> > > at a moment where not the current synthesis, place-and-route and timing
> > > check tools were available. 
> > 
> > yes. i need to get back to Taiwan and update everything. so much has
> > happened even in 8 months.
> 
> I was talking about the CDC6600 architecture...

oh! yes, haha, yes absolutely.  writing for benefit of list archives here: google "Design of a Computer, James Thornton".

they used ECL logic, they used *actual transistors* (3 pins, thru-hole), and did the entire design over several years, on paper, *waiting for that transistor to be invented*

how stunning is that.

Mitch's book then "converts" that ECL logic into standard ISO gates, and, with a *lot* of help i was able to replicate it.
  
Mitch is one of the last people left on the entire planet with the crossover knowledge between the very early supercomputer days and modern EDA design.  even he still only does gate-level design, not verilog.
Comment 14 Luke Kenneth Casson Leighton 2020-01-13 19:11:00 GMT
i apologise i hit "back button" and lost a reply.

battery low so must be brief.

greatly appreciate the review and insights on this exceptionally complex deceptive code.

let us however focus, please do comment on this logical reasoning for any flaws or misassertions

* early analysis shows SRNOR and SRNAND to be functionally equivalent even down to the "unknown" transitions
* the difference is the inversion of inputs and outputs, which, given boolean algebra manipulation, makes sense.
* a careful code-morph can put SRNAND in place of SRNOR
* yosys appears to have $sr and _SR_* however they are undocumented and need investigation
* nmigen limitations can be worked around by using cocotb and "specials"

basically what i am saying is that if you were to focus on adding SRNAND to the cell library, for the March deadline, we can use it.

again, to emphasise: someone reading and asking questions on this exceptionally difficult design is genuinely appreciated.
Comment 15 Luke Kenneth Casson Leighton 2020-01-13 19:17:31 GMT
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-January/003345.html

jacob kindly points out the typo in the truth tables i found randomly on the internet.  SRNOR and SRNAND are duals of each other.

thus, to emphasise, staf, if an SRNAND cell were available, we could use it.  and will take care of the tools to do so.
Comment 16 Staf Verhaegen 2020-01-13 19:21:09 GMT
(In reply to Luke Kenneth Casson Leighton from comment #11)
> (In reply to Staf Verhaegen from comment #8)
> 
> if i can summarise: DependencyRow is gone, and merged into DependencyMatrix?
> 
> assuming yes: my concern with that was for P&R - we will not be doing full
> automatic P&R in coriolis2 - was to do isolated *small* blocks (cells) that
> *may* be done automated (or manual), and are small enough to visually
> inspect without being completely overwhelmed by the sight of the resulting
> GDS image (or intermediate output before that)
> 
> a 2 dimensional matrix is not only extremely unlikely for the autorouter to
> get right, it is much more complex to design the manual layout.

If your autorouter is not able to that, I don't think you will ever be able to make a design working in a more recent technology node at a decent speed. For these nodes meeting timing will need buffer insertion for optimal timing performance, if you will need to do that manually in your custom matrix layout this will need to be optimized for each node, maybe even every design. You should be able to rely on a tool to do this properly.

> 
> (as you will no doubt know, i mention this for the benefit of other readers,
> staf, coriolis2 layouts are *python* programs!)
> 
> so the justification for doing 1D of 1D objects was so that the person doing
> the DependencyRow can write a python P&R coriolis2 program that just puts
> SRLatch objects down in a simple 1D for-loop...
> 
> ... and the person doing the DependencyMatrix can likewise write a python
> coriolis2 program that does the same for DependencyRow.
> 
> that was the plan.  does it make sense? i described it to JP and, although
> he hasn't reviewed the actual code as you did, he liked the idea.
> 
> hm i should ask him if he could take a quick look.
> 
> could i ask, could you perhaps explain the reasoning behind the
> recommendation to flatten the structure? what would be the advantages of
> doing so?

It's the same discussion of old time of assembly coding versus higher level language. What you are actually doing is the equivalent of inline assembly for the dependency matrix because you think you can do the work better than the compiler. Personally I would rather work on optimizing the compiler than overriding the compiler with own inline assembly.

To be more concrete, the placer currently tries to minimize the length of the interconnects. By doing the SRLatch array manually you say that you know better than the placer what the good placement of the cells is. When scaling to smaller nodes, the placer has to be changed to do placement for optimizing timing and do proper buffer insertion. At that point I think it would be difficult to do that better manually. As said above when scaling to smaller nodes you will need to insert buffers on longer signals etc. meaning that you may need to do that also manually.

But I leave it up to you and JP to decide how to do it, I will include a SR latch. But it will mean that you do the Dependency Matrix not in nmigen + yosys because the tools will try to optimize your design and fiddle with the design you wrote in nmigen.
I think for getting this to work you will need to use Instance for all cells in the design you want to place manually in coriolis script otherwise nmigen and yosys will fiddle with your design.
Comment 17 Luke Kenneth Casson Leighton 2020-01-13 22:19:02 GMT
hi staf, i realised after some thought that it is not so much DependencyMatrix or DependencyRow that your recommendation critically  applies to, as it is the current SRLatch being a multi-width object.

if *that* was returned to its original implementation (single bit inputs, single bit outputs) then the valid concerns you raise would disappear, allow me to explain.

to answer about the manual vs automatic placement: you need to see (and in under one second you will immediately visually understand the following) the gate-level diagrams from either "Design of a Computer" or from Mitch Alsup's book chapters.

the wires between DM Cells (where the SRlatches are) *only* go dead straight, horizontal, *or* they go vertical, *right* through the entire Matrix.  they never do both, and they never curve.  ever.

in this regard they are extremely similar to how OpenRAM works, and consequently i believe a similar programmatic layout approach may be deployed.

thus even if we did have to redesign the layout per geometry it is such a trivial task (move some fully and exclusively straight and horizontal wires up or down a bit) that even without having done this task yet i would be honestly surprised if it was problematic or costly.

now, running the autorouter on DependencyCell (which i had to remove and agree it should be reinstated sigh), *that* is a good idea, i feel, after working out and setting suitable inputs and output locations best suited to the "straight line" connections in the rows and columns.

btw you should be able to deduce what DependencyCell is, from its name :)

the only reason i went with the multi-bit  design of SRLatch was because i was seeing literally seconds per simulated cycle and from an iterative design logic perspective that was severely hampering my ability to make small fast "debugging and understanding" changes.

now that i am happy with the overall logic as being reasonably correct in maintaining instruction order, yes that optimisation, suited to nmigen's limitations, can be undone.

i feel i will have to think how to make it possible to run both (and for there to be a formal equivalence proof) because with the single bit SRLatch, even with verilator and cocotb we may well be looking at seconds per simulated cycle. will find out soon :)
Comment 18 Luke Kenneth Casson Leighton 2020-01-14 03:05:03 GMT
(In reply to Staf Verhaegen from comment #16)


> placement of the cells is. When scaling
> to smaller nodes, the placer has to be changed to do placement for
> optimizing timing and do proper buffer insertion.

i missed this earlier.  i remember now a discussion on comp.arch where people warned about driving a 128 bit unary array from a single source.

that is effectively what each row (and column) inputs are driven by.

each input from the left side drives every entry in the DM Row

each input from the top drives every entry in a DM column.

thus you are absolutely right, there will be buffers needed.

i believe that timing *should* be ok because despite the size it is extremely regular.  no interactions horizontally, no interactions vertically.  *only* at the actual DependencyCell itself does interaction take place.

very much like an SRAM Matrix.

 
this will be, i assume and hope, that Jean Paul will be teaching the interns at LIP6.fr about buffers and timing.
Comment 19 Staf Verhaegen 2020-01-14 09:21:34 GMT
> if *that* was returned to its original implementation (single bit inputs, single bit outputs) then the valid concerns you raise would disappear, allow me to explain.
> ...
> in this regard they are extremely similar to how OpenRAM works, and consequently i believe a similar programmatic layout approach may be deployed.
>
> thus even if we did have to redesign the layout per geometry it is such a
> trivial task (move some fully and exclusively straight and horizontal wires up
> or down a bit) that even without having done this task yet i would be honestly
> surprised if it was problematic or costly.

Having been a SRAM compiler developer in my previous live, I tend to have a different opinion on trivialness of maintaining a DepenceMatrix compiler and porting it to new nodes. For porting an existing SRAM compiler to a new node we accounted I think 9 manmonths of work. I agree, that for the DependencyMatrix the work on the SRLatch will be much less than on the SRAM bit cell but, still ...

That said I like having different opinions, would be nice if both options could be included and compared to show the actual benefit of doing the DependencyMatrix in a custom way. Having the DependencyMatrix as nmigen code like I propose definitively seems the trivial option to me. It will also allow to easily test design on FPGA etc.
Comment 20 Luke Kenneth Casson Leighton 2020-01-14 16:22:01 GMT
hi staf ok raised bug #156 will fill more details in it later.
Comment 21 Staf Verhaegen 2020-03-03 15:05:08 GMT
I made first version of the rslatches for nsxlib. I need help in getting them through lvs check. I did a merge request on LIP6 gitlab (https://gitlab.lip6.fr/vlsi-eda/alliance-check-toolkit/merge_requests/4). I propose to discuss it further in that merge request.