Bug 805 - gram randomly comes up in an unworkable condition
Summary: gram randomly comes up in an unworkable condition
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Other
: --- major
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on: 811
Blocks: 813
  Show dependency treegraph
 
Reported: 2022-04-09 21:07 BST by tpearson
Modified: 2023-08-29 22:46 BST (History)
1 user (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description tpearson 2022-04-09 21:07:59 BST
The current implementation of gram is bistable, with a working state and a non-working state, randomly entered at FPGA clock tree reset.  If the FPGA enters the working state, memory access are reliable until another reset (or reprogram) is issued, at which point the device has a 50% chance to enter the non-working state.  Similarly, in the non-working state the device has a 50% change of entering the working state on reset / reprogram.

After significant debugging effort, the root cause of this has been located.

In an nutshell, the DDR3 interface blocks require two aligned clocks, the ECLK (DDR clock) and SCLK (SDR clock).  In the working state, the transmitter blocks are aligned with the receiver blocks, and in the non-working state they are misaligned by 1T (1 ECLK period, or 1/2 SCLK period).

The ECP5 relies on a single master reset wire, shared among all DDR primitives in a specific logical controller, to synchronize the interface blocks at startup.  While this reset wire has been plumbed to the data I/O blocks, it is absent from the address / command blocks where it is hardwired by nmigen to 0.  This means the address / command generator can be up to 1/2 SCLK out of alignment with the rest of the system, and this unwanted phase delay is randomly applied at startup from analog stochastic behavior.
Comment 1 Luke Kenneth Casson Leighton 2022-04-09 21:58:09 BST
(In reply to tpearson from comment #0)

> The ECP5 relies on a single master reset wire, shared among all DDR
> primitives in a specific logical controller, to synchronize the interface
> blocks at startup.  While this reset wire has been plumbed to the data I/O
> blocks, it is absent from the address / command blocks where it is hardwired
> by nmigen to 0. 

urrr.

do you happen to know of any other implementation thst gets this right?
if i have something to work from i can take a look.
Comment 2 tpearson 2022-04-09 22:08:13 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> (In reply to tpearson from comment #0)
> 
> > The ECP5 relies on a single master reset wire, shared among all DDR
> > primitives in a specific logical controller, to synchronize the interface
> > blocks at startup.  While this reset wire has been plumbed to the data I/O
> > blocks, it is absent from the address / command blocks where it is hardwired
> > by nmigen to 0. 
> 
> urrr.
> 
> do you happen to know of any other implementation thst gets this right?
> if i have something to work from i can take a look.

LiteDRAM gets it right:

https://github.com/enjoy-digital/litedram/blob/15f7ba27138367f21832e5c00e7882db8a6fab54/litedram/phy/ecp5ddrphy.py#L229
Comment 3 Luke Kenneth Casson Leighton 2022-04-09 22:58:53 BST
(In reply to tpearson from comment #2)

> LiteDRAM gets it right:
> 
> https://github.com/enjoy-digital/litedram/blob/
> 15f7ba27138367f21832e5c00e7882db8a6fab54/litedram/phy/ecp5ddrphy.py#L229

got it.  i know what to do now.

https://gitlab.com/nmigen/nmigen/-/issues/2
Comment 4 Luke Kenneth Casson Leighton 2022-04-15 11:10:18 BST
part of the solution here is to take the rather drastic but necessary
step of altering the nmigen API by adding a reset line to the Pin
data structure.

there's really no other way to get down to the DDR Instances with
the reset signal needed.

whilst investigating i noticed that the assumption that the reset pads
are "straight" (xdr=1) is wrong: they're also supposed to be 4x phased
(xdr=4), which is quite fascinating, it means that rising and falling
edge *reset* lines do different things inside the DRAM IC.

i've now wired up all the DRAM pad resets to ResetSignal("dramsync")
which in theory should start to get stability.  an early check showed
that yes they were locking much more often.  making rst xdr=4 should
also help, it meant that 1/2 the IOpads were not being properly reset
at all (hm)