The Dependency Matrices are huge and need SR NAND latches to get their size down. This means we need to be able to express them in nmigen, perform some form of equivalent simulation, and also have them passed through to yosys.
(The following comment is copied from the email I sent earlier, so that it is accessible publicly.) Of the things you mentioned, this includes only the SR latch issue. This issue is unfortunately quite involved. (n)Migen is designed to connect large islands of purely synchronous logic with a few async bridges. It does not, very deliberately, directly support arbitrary asynchronous primitives like the SR latch. So you can't just drop one into your design and have it work the same way normal logic works. However, you're in luck because better support for asynchronous signals was something I took into account when designing nMigen and both of the new simulators, pysim and cxxsim. To solve this issue, we'll need to work together to determine your specific use case. Do you want to use the exact same netlist for nMigen simulation and synthesis? If not, it means you can model whatever it is that contains the SR latches (I'm not sure what matrices you're referring to, here) using synchronous code, and replace them with true SR latches for synthesis. Of course, you will need to convince yourself these two designs are equivalent. If you *do* want to use the exact same netlist, there are a few options. Both the new nMigen simulator and cxxrtl use an architecture that, in principle, supports logic feedback loops. So you can model the SR latch using two ordinary NAND gates. If you have an instance of a SRNAND cell, you can provide a simulation model for this instance that uses two NAND gates, and it'll work. However, I must warn you that both for nMigen pysim and cxxrtl, this will come with a severe performance penalty of at least ~10x, possibly worse depending on your exact design. If you are planning to use fine SRNAND cells (one per bit rather than one per word), expect a further slowdown of the word size factor. Another issue is that neither pysim nor cxxsim provide any way to control the possible race conditions. That is, the SRNAND latch that is simulated in pysim or cxxsim will initialize to an indeterminate value, and chaining them together will lead to unpredictable results.
(In reply to whitequark from comment #1) > (The following comment is copied from the email I sent earlier, so that it > is accessible publicly.) appreciated, whitequark. summary (the rest is archive-suitable / context) is: we'd like to be able to do "exact" netlist (we already have semi-equivalent, and this is working). > Of the things you mentioned, this includes only the SR latch issue. > > This issue is unfortunately quite involved. (n)Migen is designed to > connect large islands of purely synchronous logic with a few async > bridges. It does not, very deliberately, directly support arbitrary > asynchronous primitives like the SR latch. So you can't just drop one > into your design and have it work the same way normal logic works. yes. very much aware that standard proprietary commercial tools completely dropped support for SR latches. (reiterating this for the archives: however we really cannot use DFFs, here. that is 10 gates rather than 2, and the number of Cells needed is massive: one of the Matrices may need to be 128 x 30, with at least four maybe five latches per Cell. that's 192,000 gates *just for that Matrix* if we use DFFs. if we use SR Latches, it's 38400 gates, which is tolerable) > However, you're in luck because better support for asynchronous > signals was something I took into account when designing nMigen and > both of the new simulators, pysim and cxxsim. whew :) > To solve this issue, we'll need to work together to determine your > specific use case. Do you want to use the exact same netlist for > nMigen simulation and synthesis? i honestly don't know: we're open to suggestions. (having read ahead) if we can do both (and use one for rapid prototyping and the other as a cross-check before moving to synthethis) that would be great. right now we have something that "works" however it uses DFFs, not SR latches (see latch.py link, below). > If not, it means you can model > whatever it is that contains the SR latches (I'm not sure what > matrices you're referring to, here) Out-of-Order Read/Write Hazard detection and avoidance matrices. see "Modifications to Dependency Cell" https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/ * the DMs encode all registers in Unary (single bit activation). thus one "row" represents all registers used for a given FunctionUnit [and access to its ALU]. * each Cell thus records, in bit-level (unary) form, the fact that read/write for a given function (add) is needed, by raising the "Reg 5 needs READ" and "Reg 7 needs WRITE" SR Latches * these now-active signals indicate to subsequent instructions "hello i have a read hazard, hello i have a write hazard" respectively. attempts to use those registers thus STOPs that FunctionUnit from executing. * hazards get cleared out once results are written * when there are zero hazards (in any given row), that instruction becomes free-and-clear to proceed. it's actually incredibly simple. here's one for Register-to-FunctionUnit (FU being the "arbitrator" for access to an ALU pipeline), you have to drill down a bit (through DependencyRow) to get to the SR Latches themselves https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/fu_reg_matrix.py;h=06380434c8d7d20828d80c3d0e020161bcb2c2e4;hb=b01f6bc0ad28cda131beae33e5fe338daaf5e9ea > using synchronous code, funnily enough that's where we are right now: https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/latch.py;h=7d6a1efe22c881585a626e397590337186f6ef1b;hb=HEAD#l41 > and > replace them with true SR latches for synthesis. Of course, you will > need to convince yourself these two designs are equivalent. yes. fortunately, coriolis2 has gate-level simulation (which needs investigation) so we can triple-check. > If you *do* want to use the exact same netlist, there are a few > options. Both the new nMigen simulator and cxxrtl use an architecture > that, in principle, supports logic feedback loops. So you can model > the SR latch using two ordinary NAND gates. If you have an instance > of a SRNAND cell, you can provide a simulation model for this instance > that uses two NAND gates, and it'll work. with the caveat that the current SRLatch class is actually a Reset-priority SR latch, i *believe* we already have the former, so this latter is really what we'd like to have. we can always flip in/out the current SR Latch for a properly asynchronous one (exact same netlist), run some (unbelievably slow) tests, then generate the actual netlist to be handed to yosys (and from there to coriolis2). > However, I must warn you > that both for nMigen pysim and cxxrtl, this will come with a severe > performance penalty of at least ~10x, possibly worse depending on your > exact design. If you are planning to use fine SRNAND cells (one per > bit rather than one per word), expect a further slowdown of the word > size factor. interesting, because a multi-word design turned out to be necessary to get a speedup factor (some of the Matrices are so large - 128 x 30) that we get seconds per clock in pysim if done as individual one-bit Cells. the Matrices are so large - 128 x 30 will be the largest - that with a 4 to 5 level hierarchy of Elaboratable classes it was intolerable. by removing one level of hierarchy, pysim ran in reasonable time. i found this generally to be true: the more levels of hierarchy, the slower pysim got. and due to the MASSIVE size of our designs, we need hierarchically-laid-out classes (well over 250 so far). > Another issue is that neither pysim nor cxxsim provide > any way to control the possible race conditions. That is, the SRNAND > latch that is simulated in pysim or cxxsim will initialize to > an indeterminate value, and chaining them together will lead to > unpredictable results. you'll like this: it's much "worse" than it looks :) we actually have a triple-connected ring of SRNAND latches, acting as a pseudo pipeline. however, it turns out that the set-reset logic which *enables* the three SRNANDs is very, very specifically organised such that only two of them are ever possible to have active at any one time, and thus we create a "revolving door" effect. those "protections" are all synchronous. so whilst the SRNAND cells are asynchronous and could hypothetically end up in "unknown", in practice this can *never* occur because the conditions which create "unknown" are specifically avoided (and avoided using synchronous logic). in case you were wondering: this is a proven design. it was used in the original CDC 6600, and also in the AMD Opteron Series. however, in the CDC 6600 they used transistors (as in: *actual* three-pronged transistors, the largest ever single order made for transistors in the world), and in the AMD Opteron Series they of course had the financial budget and resources of a multi-billion-dollar company and so could happily pay for custom silicon. so. conclusion, after all that (apologies), is back at the top.
OK, having read that I suggest we follow the standard procedure. The new pysim design should already handle this kind of logic loop just fine. I suggest you make a (coarse) SRNAND latch out of normal (coarse) NAND cells, test that it works correctly in your designs, and if it does not, file an MCVE over at the nmigen repository. Then I investigate and fix any issues. The same with cxxsim once it's ready. Regarding hierarchical design slowdown: this is inevitable in the current pysim architecture, but cxxsim can trivially use the Yosys flattening functionality and does not suffer from any hierarchy-related issues.
Ah, there's another option that you might find interesting. Yosys has a `$sr` coarse cell. If you're willing to forgo pysim completely and get cxxsim working for you, then I could implement support for that cell in cxxsim. It would probably be the least amount of work out of every discussed option, provided that you can get cxxsim working for you. Of course, you might want to look into cxxsim anyway given the massive performance improvements it promises (on fully synchronous designs, it's actually competitive with single-threaded Verilator generated code; on designs with feedback arcs performance will degrade by the above-mentioned factor of about 5-10x, but still quite fast.)
(In reply to whitequark from comment #4) > Ah, there's another option that you might find interesting. Yosys has a > `$sr` coarse cell. yes. this i _believe_ is what Staf has implemented for us, in the nsxlib Cell Library: http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21 > If you're willing to forgo pysim completely and get > cxxsim working for you, then I could implement support for that cell in > cxxsim. yes please. > It would probably be the least amount of work out of every discussed > option, provided that you can get cxxsim working for you. ah then that's worthwhile investigating straight away. are there examples anywhere? > Of course, you might want to look into cxxsim anyway yes. we've a half million gate design. using e.g. cocotb (a python co-simulator which compiles verilog using verilator and then annotates and interacts with it from python) was on the cards. out of interest would that be feasible using cxxsim (either as an addition to cocotb or as a separate project)?
> yes. this i _believe_ is what Staf has implemented for us, in the nsxlib Cell Library: http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21 If your designs are already using $sr cells then the cxxsim approach would be ideal. > ah then that's worthwhile investigating straight away. are there examples anywhere? Yes, in the pull request introducing the backend (which is not yet merged, I'm adding some polish to it right now): https://github.com/YosysHQ/yosys/pull/1562. The examples are a bit sparse but I think the developers working on this project will have no issues figuring things out (or just ask me). The general idea behind cxxsim is to be as straightforward translation of RTLIL to C++ as possible. Of course, RTLIL is not imperative, so some extra logic is necessary. This extra logic isn't something I invented myself, but rather I took the well-known VHDL two-phase simulator that provides deterministic* results regardless of evaluation order, even for combinatorial loops, and adapted it to RTLIL. Some VHDL developers call it "the crown jewel of VHDL" and I concur, it is an excellent algorithm. * The "deterministic" in reference to this simulator means that the results are the same regardless of the order in which the (parallel) RTLIL is translated to (sequential) C++, taking that one step out of consideration when debugging your code. If your design is inherently racy, you will get nondeterministic results anyway. > using e.g. cocotb (a python co-simulator which compiles verilog using verilator and then annotates and interacts with it from python) was on the cards. > out of interest would that be feasible using cxxsim (either as an addition to cocotb or as a separate project)? So, I'm working on two different patches, which are somewhat confusingly named "cxxrtl" and "cxxsim". Cxxrtl is an addition to Yosys that allows it to emit C++ simulating virtually any valid RTLIL, similar to Verilator but more flexible. (Verilator doesn't support asynchronous logic at all, cxxrtl does, at a fairly significant performance penalty if you have any in your design.) Cxxrtl is largely finished; it has been used for simulating medium-sized SoCs and I expect it would not present any issues for your codebase as long as you stick to synchronous logic only. Cxxsim is an addition to nMigen that would do basically the same thing as cocotb, but present an interface identical to the existing pure-Python simulator. Cxxsim does not yet exist. It is however fairly high on my list of things to do. For now I suggest that you simulate your code using a C++ testbench to see what sort of performance you get, and running simple acceptanace tests to see if there are any glaring issues in cxxrtl that break your design. I think there is no point writing code to integrate cxxrtl with cocotb; cxxsim is about as much effort, and will let you use your exiting testcases unmodified.
(In reply to whitequark from comment #6) > > yes. this i _believe_ is what Staf has implemented for us, in the nsxlib Cell Library: http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21 > > If your designs are already using $sr cells then the cxxsim approach would > be ideal. brilliant. at a suitable time in the future we'd need a (quick) indicator on how to express $sr in nmigen > > ah then that's worthwhile investigating straight away. are there examples anywhere? > > Yes, in the pull request introducing the backend (which is not yet merged, > I'm adding some polish to it right now): > https://github.com/YosysHQ/yosys/pull/1562. The examples are a bit sparse > but I think the developers working on this project will have no issues > figuring things out correct :) > (or just ask me). > > The general idea behind cxxsim is to be as straightforward translation of > RTLIL to C++ as possible. Of course, RTLIL is not imperative, so some extra > logic is necessary. > > This extra logic isn't something I invented myself, but rather I took the > well-known VHDL two-phase simulator that provides deterministic* results > regardless of evaluation order, even for combinatorial loops, and adapted it > to RTLIL. Some VHDL developers call it "the crown jewel of VHDL" and I > concur, it is an excellent algorithm. i heard rumour about this from elsewhere. > > out of interest would that be feasible using cxxsim (either as an addition > > to cocotb or as a separate project)? > > So, I'm working on two different patches, which are somewhat confusingly > named "cxxrtl" and "cxxsim". :) > For now I suggest that you simulate your code using a C++ testbench to see > what sort of performance you get, and running simple acceptanace tests to > see if there are any glaring issues in cxxrtl that break your design. willdo. > I > think there is no point writing code to integrate cxxrtl with cocotb; cxxsim > is about as much effort, and will let you use your exiting testcases > unmodified. fantastic.
> brilliant. at a suitable time in the future we'd need a (quick) indicator on how to express $sr in nmigen assert width == len(set) and width == len(clr) and width == len(q) m.submodules += Instance("$sr", p_WIDTH=width, p_SET_POLARITY=1, p_CLR_POLARITY=1, i_SET=set, i_CLR=clr, o_Q=q)
(In reply to Luke Kenneth Casson Leighton from comment #5) > (In reply to whitequark from comment #4) > > Ah, there's another option that you might find interesting. Yosys has a > > `$sr` coarse cell. > > yes. this i _believe_ is what Staf has implemented for us, in the nsxlib > Cell Library: > http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21 Not really, I implemented the standard cells implementing the SR latches. A variant with NOR gates and one with NAND gates and two drive strengths of each. Yosys and synthesis the logic netlist coming from nMigen needs to mapped to the cells in the library is on a higher level than what I am doing.
(In reply to Staf Verhaegen from comment #9) > (In reply to Luke Kenneth Casson Leighton from comment #5) > > (In reply to whitequark from comment #4) > > > Ah, there's another option that you might find interesting. Yosys has a > > > `$sr` coarse cell. > > > > yes. this i _believe_ is what Staf has implemented for us, in the nsxlib > > Cell Library: > > http://bugs.libre-riscv.org/show_bug.cgi?id=154#c21 > > Not really, I implemented the standard cells implementing the SR latches. A > variant with NOR gates and one with NAND gates and two drive strengths of > each. i think that's what i meant. https://gitlab.lip6.fr/vlsi-eda/alliance-check-toolkit/-/merge_requests/4/diffs ok so these are called nsnrlatch and srlatch, and there are times-1 and times-4 variants. > Yosys and synthesis the logic netlist coming from nMigen needs to mapped to > the cells in the library is on a higher level than what I am doing. my main reason to cross-ref what you kindly wrote for our use was to highlight that it had happened. so let me try to understand: * yosys $sr (and variants) will need to map to these cells (nsnrlatch_xN and srlatch_xN). will this need patches (or a plugin) into yosys? * nmigen will need to map to $sr i presume that given that yosys already has $sr, only nmigen will need augmenting to produce $sr?
> i presume that given that yosys already has $sr, only nmigen will need augmenting to produce $sr? There's no augmenting necessary. I gave you the code above (the `Instance("$sr"...)` that causes nMigen to emit a Yosys `$sr` cell.
(In reply to whitequark from comment #11) > > i presume that given that yosys already has $sr, only nmigen will need augmenting to produce $sr? > > There's no augmenting necessary. I gave you the code above (the > `Instance("$sr"...)` that causes nMigen to emit a Yosys `$sr` cell. fantastic! oh wait - i missed comment 8 (doh - how??) http://bugs.libre-riscv.org/show_bug.cgi?id=276#c8 ok so the simulation(s) would be what need to recognise that. could you let me know what amount you'd be happy to receive as a donation from NLNet, for including $sr in cxxrtl? also, i appreciate you suggested that we do the exploration, however that would mean that someone else (other than you) receives money for doing so :) as we'd very much like to support your work - and because we are a small team there is such a lot else to do - would you be willing to add a SRNAND latch example into cxxrtl for example?
(In reply to Luke Kenneth Casson Leighton from comment #12) > could you let me know what amount you'd be happy to receive as a donation > from NLNet, for including $sr in cxxrtl? As it turns out, there's actually already support for `$dffsr` latch in cxxrtl, which is what you need but with an extra clock. So `$sr` support would be about ten lines of code I have to copy and paste from the `$dffsr` case.
(In reply to whitequark from comment #13) > (In reply to Luke Kenneth Casson Leighton from comment #12) > > could you let me know what amount you'd be happy to receive as a donation > > from NLNet, for including $sr in cxxrtl? > > As it turns out, there's actually already support for `$dffsr` latch in > cxxrtl, which is what you need but with an extra clock. So `$sr` support > would be about ten lines of code I have to copy and paste from the `$dffsr` > case. *snort*. funny. well, you're spending time here, which is important as it defines the scope, solves the problem, and, due to the critical importance for the project, easily justifies giving you something.
Cxxrtl is upstream in Yosys, including SR latch support.
(In reply to whitequark from comment #15) > Cxxrtl is upstream in Yosys, including SR latch support. fantastic, that's really appreciated. we have *sigh* a bureaucratic MoU to be signed (4 - now 5 - people waiting on EUR for that one). not least because of all the other support that you've given, and recognising that this is really strategically crucial for us, if you have time to write a short demo / unit test of the srlatch i think we can easily justify increasing this one to EUR 300? (although i'm sure it would be fairly trivial for you to write, that's because you're most familiar with the code).
Created attachment 50 [details] SR latch demonstration
(In reply to whitequark from comment #17) > Created attachment 50 [details] > SR latch demonstration For some reason, bugzilla doesn't send out notifications for creating attachments. Thanks, whitequark, looks good to me!
very useful tutorial / demo on cxxrtl simulation https://tomverbeure.github.io/2020/08/08/CXXRTL-the-New-Yosys-Simulation-Backend.html