https://github.com/nmigen/nmigen/issues/439 - cxxsim fsm bug (DONE) cxxsim needs some adjustments after its first iteration encountered issues
whitequark, hi, i saw this extensive explanation: https://freenode.irclog.whitequark.org/nmigen/2020-09-02 if you feel that you need to do more than would be reasonably fairly covered by EUR 1500 please do not think that you have to limit decisions because of that. we can go up to EUR 3000 as that is the available budget within #362.
whitequark hi, apologies i am dealing woth a deadline and rtlil sim is sufficient for that. cesar reports that he tried running test_issuer.py with the latest revisions to cxxsim and it "hung" i.e. likely went into a combinatorial loop. libresoc has some unusual combinatorial bypass modules which may be triggering unanticipated behaviour, making it a very good way to catch errors. unfortunately the fact that it is such a massive design means that creating MVCEs is tricky (not helped by the fact that i am occupied with the deadline). cesar if you can help, you can do a couple of things: 1) run any of the test_pipe_caller.py unit tests in soc/fu/* 2) run all of the compunit tests in soc/fu/compunits/test/* 3) cut back test_issuer.py as i described in the list message by changing the number of ALUs (to zero) and also cutting out the tests that are added to it (at the bottom of test_issuer.py) you may not be able to cut out LDST entirely without an extensive overhaul of test_issuer.py if test_ldst_compunits.py works but a severely cut back test_issuer.py does not then we have narrowed things down considrrably. i will add some EUR for you to investigate this, cesar. whitequark this will not come from the amount agreed with you.
(In reply to Luke Kenneth Casson Leighton from comment #2) > cesar if you can help, you can do a couple of things: Sure. Will let you know what I can find.
(In reply to Cesar Strauss from comment #3) > (In reply to Luke Kenneth Casson Leighton from comment #2) > > > cesar if you can help, you can do a couple of things: > > Sure. Will let you know what I can find. star. there's at least 20 pipe_caller and compunit tests that can be used, to at least narrow down the area.
All non-broken pipe_caller and compunit tests seem to hang in cxxsim. I will try to see what they have in common, and investigate further. Interestingly, these seem to hang even in pysim: src/soc/fu/compunits/test/test_alu_compunit.py src/soc/fu/compunits/test/test_cr_compunit.py The following are broken for me, and were not tested: src/soc/fu/alu/test/test_pipe_caller.py src/soc/fu/branch/test/test_pipe_caller.py src/soc/fu/cr/test/test_pipe_caller.py src/soc/fu/compunits/test/test_div_compunit.py src/soc/fu/ldst/test/test_pipe_caller.py src/soc/fu/mul/test/test_pipe_caller.py src/soc/fu/shift_rot/test/test_pipe_caller.py
(In reply to Cesar Strauss from comment #5) > All non-broken pipe_caller and compunit tests seem to hang in cxxsim. > > I will try to see what they have in common, and investigate further. > > Interestingly, these seem to hang even in pysim: > > src/soc/fu/compunits/test/test_alu_compunit.py > src/soc/fu/compunits/test/test_cr_compunit.py damnit i've not been keeping these up-to-date, relying on test_issuer instead. this means that one of the operands needed is not being set up properly this leaves either the rd.req "unanswered" (with the obvious result that the compunit waits indefinitely for an operand that is never going to be supplied) or the wr.req "unanswered" (with the obvious result that the compunit waits indefinitely for a result to be read that is never going to be read). this is since creating the PowerDecoder2 "subsets". test_issuer.py uses its own completely separate (full) PowerDecoder2 called "simdec", but the */test_pipe_caller.py tests do not do that. they'll need to be updated.
ok cesar i found the issue, it is unique to the simulator which needs an assembly code ("asmcode") which is the comment field from the CSV files, to be able to interpret the assembly listings in the unit tests. can you try the tests again, they should now work.
(In reply to Luke Kenneth Casson Leighton from comment #7) > can you try the tests again, they should now work. Thanks, it helped a lot. Every */test_pipe_caller.py test is affected. I already found a single combination of inputs that reproduces the issue, for those cases. I will proceed with the reducing task.
(In reply to Cesar Strauss from comment #8) > (In reply to Luke Kenneth Casson Leighton from comment #7) > > can you try the tests again, they should now work. > > Thanks, it helped a lot. > > Every */test_pipe_caller.py test is affected. > > I already found a single combination of inputs that reproduces the issue, > for those cases. fantastic. and yet, compalu_multi is not? that narrows it down quite considerably, given that there's actually not a huge difference between them. > I will proceed with the reducing task. awesome.
(In reply to Luke Kenneth Casson Leighton from comment #9) > > fantastic. and yet, compalu_multi is not? that narrows it down quite > considerably, given that there's actually not a huge difference between them. I hadn't tried compalu_multi. It does seem to be affected. I had only tried alu_fsm.py, which is not affected.
(In reply to Cesar Strauss from comment #10) > I hadn't tried compalu_multi. It does seem to be affected. ok that's also good as it is way less code. > > I had only tried alu_fsm.py, which is not affected. which is even more interesting. the likely candidate then, taking a wild guess, is the use of the latchregister function. this creates a mux with a loop that is guaranteed 100% not to ever become a combinatorial loop however that requires manual analysis to verify.
Found it! https://github.com/nmigen/nmigen/issues/439#issuecomment-699641967 In a wild turn of events, the hang turned out to be triggered just by updating an input wider than 32 bits, in the simulation process. I guess we tended to use 32 bits or less in our own experiments, which resulted a lack of coverage for wider signals. This was only caught in higher-level tests, which used the full 64 bits of our processor design.
an intriguing but ultimately simple find :) it is quite normal to expect the remaining bits (MSBs) to be set to zero, and also for any bits that will not fit in the LHS to be thrown away. this for example is how all 1s are set: x.eq(~0) or x.eq(-1) where x is unsigned and way less than 64 bit.
(In reply to Luke Kenneth Casson Leighton from comment #13) > it is quite normal to expect the remaining bits (MSBs) to be set to zero, > and also for any bits that will not fit in the LHS to be thrown away. > > this for example is how all 1s are set: x.eq(~0) or x.eq(-1) where x is > unsigned and way less than 64 bit. Any non-zero update triggers the issue, provided that the signal being updated has a width greater than 32 bits. For instance, in my trials, alu_fsm.py and alu_hier.py were not affected. This is because they instance 8-bit and 16-bit ALUs in their unit tests. I can easily make them hang by instancing a 64-bit ALU instead. On the other hand, compalu_multi hangs because the imm_data field in the CompALUOpSubset Record is hardcoded to be 64 bits.
(In reply to Cesar Strauss from comment #14) > (In reply to Luke Kenneth Casson Leighton from comment #13) > > it is quite normal to expect the remaining bits (MSBs) to be set to zero, > > and also for any bits that will not fit in the LHS to be thrown away. > > > > this for example is how all 1s are set: x.eq(~0) or x.eq(-1) where x is > > unsigned and way less than 64 bit. > > Any non-zero update triggers the issue, provided that the signal being > updated has a width greater than 32 bits. interesting. that's a very important distinction that should go in the nmigen bugreport. i suspect it's likely down to the use of ctypes. it would be worthwhile specifically adding in very long signal unit tests (128, 256, 512 bit).
Whitequark fixed the hanging cxxsim issue in nMigen. Now that the tests no longer hang, I can see that the test_pipe_caller.py tests do give the same results as pysim. However, the compunit tests seem to wait for a result that never comes. I'll continue investigating.
(In reply to Cesar Strauss from comment #16) > Whitequark fixed the hanging cxxsim issue in nMigen. i saw that, fantastic, and thank you for narrowing it down. > Now that the tests no longer hang, I can see that the test_pipe_caller.py > tests do give the same results as pysim. excellent. > However, the compunit tests seem to wait for a result that never comes. and pysim works? > I'll continue investigating. ok
(In reply to Luke Kenneth Casson Leighton from comment #17) > > However, the compunit tests seem to wait for a result that never comes. > > and pysim works? Yes. I will proceed by comparing pysim and cxxsim traces for ~/src/soc/src/soc/experiment/test/test_compalu_multi.py
(In reply to Cesar Strauss from comment #18) > However, the compunit tests seem to wait for a result that never comes. This is the problem: self.shadown_i = Signal(name="cu_shadown_i", reset=1) On pysim, it keeps high, even if not driven by the simulation process. On cxxsim, it stays low. So, shadown_i is kept active, the result is never written, and the process keeps waiting for a result that never comes. Actually, if I explicitly set shadown_i high, most of the compunit tests begin to pass already.
(In reply to Cesar Strauss from comment #19) > (In reply to Cesar Strauss from comment #18) > > However, the compunit tests seem to wait for a result that never comes. > > This is the problem: > > self.shadown_i = Signal(name="cu_shadown_i", reset=1) > > On pysim, it keeps high, even if not driven by the simulation process. > > On cxxsim, it stays low. ah yes. > Actually, if I explicitly set shadown_i high, most of the compunit tests > begin to pass already. ahh very interesting. this might be optimised out (by yosys), or just forgot that reset arg can be nonzero. the signal is not driven yet because we have not yet added the shadow matrices.
waiting for merge, but not a blocker for this task: * https://github.com/YosysHQ/yosys/pull/2466 * https://github.com/YosysHQ/yosys/pull/2459