Bug 276 - SR NAND Latch needed in nmigen
Summary: SR NAND Latch needed in nmigen
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 201
  Show dependency treegraph
 
Reported: 2020-04-03 11:00 BST by Luke Kenneth Casson Leighton
Modified: 2020-12-02 14:19 GMT (History)
4 users (show)

See Also:
NLnet milestone: NLNet.2019.Coriolis2
total budget (EUR) for completion of task and all subtasks: 250
budget (EUR) for this task, excluding subtasks' budget: 250
parent task for budget allocation: 201
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments
SR latch demonstration (10.00 KB, application/x-tar)
2020-04-10 17:31 BST, whitequark
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-04-03 11:00:37 BST
The Dependency Matrices are huge and need SR NAND latches to get their size down.  This means we need to be able to express them in nmigen, perform some form of equivalent simulation, and also have them passed through to yosys.
Comment 1 whitequark 2020-04-03 12:00:13 BST
(The following comment is copied from the email I sent earlier, so that it is accessible publicly.)

Of the things you mentioned, this includes only the SR latch issue.

This issue is unfortunately quite involved. (n)Migen is designed to
connect large islands of purely synchronous logic with a few async
bridges. It does not, very deliberately, directly support arbitrary
asynchronous primitives like the SR latch. So you can't just drop one
into your design and have it work the same way normal logic works.
However, you're in luck because better support for asynchronous
signals was something I took into account when designing nMigen and
both of the new simulators, pysim and cxxsim.

To solve this issue, we'll need to work together to determine your
specific use case. Do you want to use the exact same netlist for
nMigen simulation and synthesis? If not, it means you can model
whatever it is that contains the SR latches (I'm not sure what
matrices you're referring to, here) using synchronous code, and
replace them with true SR latches for synthesis. Of course, you will
need to convince yourself these two designs are equivalent.

If you *do* want to use the exact same netlist, there are a few
options. Both the new nMigen simulator and cxxrtl use an architecture
that, in principle, supports logic feedback loops. So you can model
the SR latch using two ordinary NAND gates. If you have an instance
of a SRNAND cell, you can provide a simulation model for this instance
that uses two NAND gates, and it'll work. However, I must warn you
that both for nMigen pysim and cxxrtl, this will come with a severe
performance penalty of at least ~10x, possibly worse depending on your
exact design. If you are planning to use fine SRNAND cells (one per
bit rather than one per word), expect a further slowdown of the word
size factor. Another issue is that neither pysim nor cxxsim provide
any way to control the possible race conditions. That is, the SRNAND
latch that is simulated in pysim or cxxsim will initialize to
an indeterminate value, and chaining them together will lead to
unpredictable results.
Comment 2 Luke Kenneth Casson Leighton 2020-04-03 13:05:29 BST
(In reply to whitequark from comment #1)
> (The following comment is copied from the email I sent earlier, so that it
> is accessible publicly.)

appreciated, whitequark.

summary (the rest is archive-suitable / context) is: we'd like to be able
to do "exact" netlist (we already have semi-equivalent, and this is working).


> Of the things you mentioned, this includes only the SR latch issue.
> 
> This issue is unfortunately quite involved. (n)Migen is designed to
> connect large islands of purely synchronous logic with a few async
> bridges. It does not, very deliberately, directly support arbitrary
> asynchronous primitives like the SR latch. So you can't just drop one
> into your design and have it work the same way normal logic works.

yes.  very much aware that standard proprietary commercial tools completely
dropped support for SR latches.

(reiterating this for the archives:
however we really cannot use DFFs, here.  that is 10 gates rather than 2,
and the number of Cells needed is massive: one of the Matrices may need
to be 128 x 30, with at least four maybe five latches per Cell.  that's
192,000 gates *just for that Matrix* if we use DFFs.
if we use SR Latches, it's 38400 gates, which is tolerable)


> However, you're in luck because better support for asynchronous
> signals was something I took into account when designing nMigen and
> both of the new simulators, pysim and cxxsim.

whew :)

> To solve this issue, we'll need to work together to determine your
> specific use case. Do you want to use the exact same netlist for
> nMigen simulation and synthesis?

i honestly don't know: we're open to suggestions.  (having read ahead)
if we can do both (and use one for rapid prototyping and the other as
a cross-check before moving to synthethis) that would be great.

right now we have something that "works" however
it uses DFFs, not SR latches (see latch.py link, below).

> If not, it means you can model
> whatever it is that contains the SR latches (I'm not sure what
> matrices you're referring to, here)

Out-of-Order Read/Write Hazard detection and avoidance matrices.
see "Modifications to Dependency Cell"
https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/

* the DMs encode all registers in Unary (single bit activation).
  thus one "row" represents all registers used for a given FunctionUnit
  [and access to its ALU].
* each Cell thus records, in bit-level (unary) form, the fact that read/write
  for a given function (add) is needed, by raising the "Reg 5 needs READ"
  and "Reg 7 needs WRITE" SR Latches
* these now-active signals indicate to subsequent instructions
  "hello i have a read hazard, hello i have a write hazard" respectively.
  attempts to use those registers thus STOPs that FunctionUnit from executing.
* hazards get cleared out once results are written
* when there are zero hazards (in any given row), that instruction
  becomes free-and-clear to proceed.

it's actually incredibly simple.

here's one for Register-to-FunctionUnit (FU being the "arbitrator" for access
to an ALU pipeline), you have to drill down a bit (through DependencyRow)
to get to the SR Latches themselves

https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/fu_reg_matrix.py;h=06380434c8d7d20828d80c3d0e020161bcb2c2e4;hb=b01f6bc0ad28cda131beae33e5fe338daaf5e9ea

> using synchronous code, 

funnily enough that's where we are right now:
https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/latch.py;h=7d6a1efe22c881585a626e397590337186f6ef1b;hb=HEAD#l41

> and
> replace them with true SR latches for synthesis. Of course, you will
> need to convince yourself these two designs are equivalent.

yes.  fortunately, coriolis2 has gate-level simulation (which needs
investigation) so we can triple-check.


> If you *do* want to use the exact same netlist, there are a few
> options. Both the new nMigen simulator and cxxrtl use an architecture
> that, in principle, supports logic feedback loops. So you can model
> the SR latch using two ordinary NAND gates. If you have an instance
> of a SRNAND cell, you can provide a simulation model for this instance
> that uses two NAND gates, and it'll work. 

with the caveat that the current SRLatch class is actually a Reset-priority
SR latch, i *believe* we already have the former, so this latter is really
what we'd like to have.

we can always flip in/out the current SR Latch for a properly asynchronous
one (exact same netlist), run some (unbelievably slow) tests, then generate
the actual netlist to be handed to yosys (and from there to coriolis2).


> However, I must warn you
> that both for nMigen pysim and cxxrtl, this will come with a severe
> performance penalty of at least ~10x, possibly worse depending on your
> exact design. If you are planning to use fine SRNAND cells (one per
> bit rather than one per word), expect a further slowdown of the word
> size factor.

interesting, because a multi-word design turned out to be necessary
to get a speedup factor (some of the Matrices are so large - 128 x 30) that
we get seconds per clock in pysim if done as individual one-bit Cells.
the Matrices are so large - 128 x 30 will be the largest - that with a
4 to 5 level hierarchy of Elaboratable classes it was intolerable.

by removing one level of hierarchy, pysim ran in reasonable time.

i found this generally to be true: the more levels of hierarchy, the
slower pysim got.  and due to the MASSIVE size of our designs, we need
hierarchically-laid-out classes (well over 250 so far).


> Another issue is that neither pysim nor cxxsim provide
> any way to control the possible race conditions. That is, the SRNAND
> latch that is simulated in pysim or cxxsim will initialize to
> an indeterminate value, and chaining them together will lead to
> unpredictable results.

you'll like this: it's much "worse" than it looks :)  we actually have
a triple-connected ring of SRNAND latches, acting as a pseudo pipeline.

however, it turns out that the set-reset logic which *enables* the three
SRNANDs is very, very specifically organised such that only two of them
are ever possible to have active at any one time, and thus we create
a "revolving door" effect.

those "protections" are all synchronous.  so whilst the SRNAND cells
are asynchronous and could hypothetically end up in "unknown", in
practice this can *never* occur because the conditions which create
"unknown" are specifically avoided (and avoided using synchronous
logic).

in case you were wondering: this is a proven design.  it was used in the
original CDC 6600, and also in the AMD Opteron Series.  however, in the
CDC 6600 they used transistors (as in: *actual* three-pronged transistors,
the largest ever single order made for transistors in the world), and
in the AMD Opteron Series they of course had the financial budget and
resources of a multi-billion-dollar company and so could happily pay for
custom silicon.

so.  conclusion, after all that (apologies), is back at the top.
Comment 3 whitequark 2020-04-03 13:11:31 BST
OK, having read that I suggest we follow the standard procedure. The new pysim design should already handle this kind of logic loop just fine. I suggest you make a (coarse) SRNAND latch out of normal (coarse) NAND cells, test that it works correctly in your designs, and if it does not, file an MCVE over at the nmigen repository. Then I investigate and fix any issues. The same with cxxsim once it's ready.

Regarding hierarchical design slowdown: this is inevitable in the current pysim architecture, but cxxsim can trivially use the Yosys flattening functionality and does not suffer from any hierarchy-related issues.
Comment 4 whitequark 2020-04-03 13:52:17 BST
Ah, there's another option that you might find interesting. Yosys has a `$sr` coarse cell. If you're willing to forgo pysim completely and get cxxsim working for you, then I could implement support for that cell in cxxsim. It would probably be the least amount of work out of every discussed option, provided that you can get cxxsim working for you.

Of course, you might want to look into cxxsim anyway given the massive performance improvements it promises (on fully synchronous designs, it's actually competitive with single-threaded Verilator generated code; on designs with feedback arcs performance will degrade by the above-mentioned factor of about 5-10x, but still quite fast.)
Comment 5 Luke Kenneth Casson Leighton 2020-04-03 14:24:01 BST
(In reply to whitequark from comment #4)
> Ah, there's another option that you might find interesting. Yosys has a
> `$sr` coarse cell. 

yes.  this i _believe_ is what Staf has implemented for us, in the nsxlib
Cell Library:
http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21

> If you're willing to forgo pysim completely and get
> cxxsim working for you, then I could implement support for that cell in
> cxxsim.

yes please.

> It would probably be the least amount of work out of every discussed
> option, provided that you can get cxxsim working for you.

ah then that's worthwhile investigating straight away.  are there examples
anywhere?

> Of course, you might want to look into cxxsim anyway

yes.  we've a half million gate design.  using e.g. cocotb (a python
co-simulator which compiles verilog using verilator and then annotates
and interacts with it from python) was on the cards.

out of interest would that be feasible using cxxsim (either as an addition
to cocotb or as a separate project)?
Comment 6 whitequark 2020-04-03 14:42:15 BST
> yes.  this i _believe_ is what Staf has implemented for us, in the nsxlib Cell Library: http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21

If your designs are already using $sr cells then the cxxsim approach would be ideal.

> ah then that's worthwhile investigating straight away.  are there examples anywhere?

Yes, in the pull request introducing the backend (which is not yet merged, I'm adding some polish to it right now): https://github.com/YosysHQ/yosys/pull/1562. The examples are a bit sparse but I think the developers working on this project will have no issues figuring things out (or just ask me).

The general idea behind cxxsim is to be as straightforward translation of RTLIL to C++ as possible. Of course, RTLIL is not imperative, so some extra logic is necessary.

This extra logic isn't something I invented myself, but rather I took the well-known VHDL two-phase simulator that provides deterministic* results regardless of evaluation order, even for combinatorial loops, and adapted it to RTLIL. Some VHDL developers call it "the crown jewel of VHDL" and I concur, it is an excellent algorithm.

* The "deterministic" in reference to this simulator means that the results are the same regardless of the order in which the (parallel) RTLIL is translated to (sequential) C++, taking that one step out of consideration when debugging your code. If your design is inherently racy, you will get nondeterministic results anyway.

> using e.g. cocotb (a python co-simulator which compiles verilog using verilator and then annotates and interacts with it from python) was on the cards.
> out of interest would that be feasible using cxxsim (either as an addition
to cocotb or as a separate project)?

So, I'm working on two different patches, which are somewhat confusingly named "cxxrtl" and "cxxsim".

Cxxrtl is an addition to Yosys that allows it to emit C++ simulating virtually any valid RTLIL, similar to Verilator but more flexible. (Verilator doesn't support asynchronous logic at all, cxxrtl does, at a fairly significant performance penalty if you have any in your design.) Cxxrtl is largely finished; it has been used for simulating medium-sized SoCs and I expect it would not present any issues for your codebase as long as you stick to synchronous logic only.

Cxxsim is an addition to nMigen that would do basically the same thing as cocotb, but present an interface identical to the existing pure-Python simulator. Cxxsim does not yet exist. It is however fairly high on my list of things to do.

For now I suggest that you simulate your code using a C++ testbench to see what sort of performance you get, and running simple acceptanace tests to see if there are any glaring issues in cxxrtl that break your design. I think there is no point writing code to integrate cxxrtl with cocotb; cxxsim is about as much effort, and will let you use your exiting testcases unmodified.
Comment 7 Luke Kenneth Casson Leighton 2020-04-03 14:58:51 BST
(In reply to whitequark from comment #6)
> > yes.  this i _believe_ is what Staf has implemented for us, in the nsxlib Cell Library: http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21
> 
> If your designs are already using $sr cells then the cxxsim approach would
> be ideal.

brilliant.  at a suitable time in the future we'd need a (quick) indicator
on how to express $sr in nmigen


> > ah then that's worthwhile investigating straight away.  are there examples anywhere?
> 
> Yes, in the pull request introducing the backend (which is not yet merged,
> I'm adding some polish to it right now):
> https://github.com/YosysHQ/yosys/pull/1562. The examples are a bit sparse
> but I think the developers working on this project will have no issues
> figuring things out

correct :)

> (or just ask me).
> 
> The general idea behind cxxsim is to be as straightforward translation of
> RTLIL to C++ as possible. Of course, RTLIL is not imperative, so some extra
> logic is necessary.
> 
> This extra logic isn't something I invented myself, but rather I took the
> well-known VHDL two-phase simulator that provides deterministic* results
> regardless of evaluation order, even for combinatorial loops, and adapted it
> to RTLIL. Some VHDL developers call it "the crown jewel of VHDL" and I
> concur, it is an excellent algorithm.

i heard rumour about this from elsewhere.

> > out of interest would that be feasible using cxxsim (either as an addition
> > to cocotb or as a separate project)?
> 
> So, I'm working on two different patches, which are somewhat confusingly
> named "cxxrtl" and "cxxsim".

:)

> For now I suggest that you simulate your code using a C++ testbench to see
> what sort of performance you get, and running simple acceptanace tests to
> see if there are any glaring issues in cxxrtl that break your design. 

willdo.

> I
> think there is no point writing code to integrate cxxrtl with cocotb; cxxsim
> is about as much effort, and will let you use your exiting testcases
> unmodified.

fantastic.
Comment 8 whitequark 2020-04-03 15:02:21 BST
> brilliant.  at a suitable time in the future we'd need a (quick) indicator on how to express $sr in nmigen

assert width == len(set) and width == len(clr) and width == len(q)
m.submodules += Instance("$sr",
    p_WIDTH=width,
    p_SET_POLARITY=1,
    p_CLR_POLARITY=1,
    i_SET=set,
    i_CLR=clr,
    o_Q=q)
Comment 9 Staf Verhaegen 2020-04-03 19:51:45 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to whitequark from comment #4)
> > Ah, there's another option that you might find interesting. Yosys has a
> > `$sr` coarse cell. 
> 
> yes.  this i _believe_ is what Staf has implemented for us, in the nsxlib
> Cell Library:
> http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21

Not really, I implemented the standard cells implementing the SR latches. A variant with NOR gates and one with NAND gates and two drive strengths of each.

Yosys and synthesis the logic netlist coming from nMigen needs to mapped to the cells in the library is on a higher level than what I am doing.
Comment 10 Luke Kenneth Casson Leighton 2020-04-03 20:21:57 BST
(In reply to Staf Verhaegen from comment #9)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > (In reply to whitequark from comment #4)
> > > Ah, there's another option that you might find interesting. Yosys has a
> > > `$sr` coarse cell. 
> > 
> > yes.  this i _believe_ is what Staf has implemented for us, in the nsxlib
> > Cell Library:
> > http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21
> 
> Not really, I implemented the standard cells implementing the SR latches. A
> variant with NOR gates and one with NAND gates and two drive strengths of
> each.

i think that's what i meant.
https://gitlab.lip6.fr/vlsi-eda/alliance-check-toolkit/-/merge_requests/4/diffs

ok so these are called nsnrlatch and srlatch, and there are times-1 and times-4
variants.

> Yosys and synthesis the logic netlist coming from nMigen needs to mapped to
> the cells in the library is on a higher level than what I am doing.

my main reason to cross-ref what you kindly wrote for our use was to highlight
that it had happened.

so let me try to understand:

* yosys $sr (and variants) will need to map to these cells (nsnrlatch_xN
  and srlatch_xN).

  will this need patches (or a plugin) into yosys?

* nmigen will need to map to $sr

  i presume that given that yosys already has $sr, only nmigen will need
  augmenting to produce $sr?
Comment 11 whitequark 2020-04-04 09:15:12 BST
> i presume that given that yosys already has $sr, only nmigen will need  augmenting to produce $sr?

There's no augmenting necessary. I gave you the code above (the `Instance("$sr"...)` that causes nMigen to emit a Yosys `$sr` cell.
Comment 12 Luke Kenneth Casson Leighton 2020-04-04 12:05:34 BST
(In reply to whitequark from comment #11)
> > i presume that given that yosys already has $sr, only nmigen will need  augmenting to produce $sr?
> 
> There's no augmenting necessary. I gave you the code above (the
> `Instance("$sr"...)` that causes nMigen to emit a Yosys `$sr` cell.

fantastic! oh wait - i missed comment 8 (doh - how??)
http://bugs.libre-riscv.org/show_bug.cgi?id=276#c8

ok so the simulation(s) would be what need to recognise that.

could you let me know what amount you'd be happy to receive as a donation
from NLNet, for including $sr in cxxrtl?

also, i appreciate you suggested that we do the exploration, however that
would mean that someone else (other than you) receives money for doing so :)

as we'd very much like to support your work - and because we are a small
team there is such a lot else to do - would you be willing to add a
SRNAND latch example into cxxrtl for example?
Comment 13 whitequark 2020-04-04 12:10:08 BST
(In reply to Luke Kenneth Casson Leighton from comment #12)
> could you let me know what amount you'd be happy to receive as a donation
> from NLNet, for including $sr in cxxrtl?

As it turns out, there's actually already support for `$dffsr` latch in cxxrtl, which is what you need but with an extra clock. So `$sr` support would be about ten lines of code I have to copy and paste from the `$dffsr` case.
Comment 14 Luke Kenneth Casson Leighton 2020-04-04 12:15:47 BST
(In reply to whitequark from comment #13)
> (In reply to Luke Kenneth Casson Leighton from comment #12)
> > could you let me know what amount you'd be happy to receive as a donation
> > from NLNet, for including $sr in cxxrtl?
> 
> As it turns out, there's actually already support for `$dffsr` latch in
> cxxrtl, which is what you need but with an extra clock. So `$sr` support
> would be about ten lines of code I have to copy and paste from the `$dffsr`
> case.

*snort*.  funny.  well, you're spending time here, which is important as it
defines the scope, solves the problem, and, due to the critical importance
for the project, easily justifies giving you something.
Comment 15 whitequark 2020-04-10 16:04:38 BST
Cxxrtl is upstream in Yosys, including SR latch support.
Comment 16 Luke Kenneth Casson Leighton 2020-04-10 17:16:23 BST
(In reply to whitequark from comment #15)
> Cxxrtl is upstream in Yosys, including SR latch support.

fantastic, that's really appreciated.  we have *sigh* a bureaucratic
MoU to be signed (4 - now 5 - people waiting on EUR for that one).

not least because of all the other support that you've given, 
and recognising that this is really strategically crucial for us, if
you have time to write a short demo / unit test of the srlatch
i think we can easily justify increasing this one to EUR 300?

(although i'm sure it would be fairly trivial for you to write,
that's because you're most familiar with the code).
Comment 17 whitequark 2020-04-10 17:31:10 BST
Created attachment 50 [details]
SR latch demonstration
Comment 18 Jacob Lifshay 2020-04-10 18:27:19 BST
(In reply to whitequark from comment #17)
> Created attachment 50 [details]
> SR latch demonstration

For some reason, bugzilla doesn't send out notifications for creating attachments.

Thanks, whitequark, looks good to me!