Bug 806 - Nest should be able to run at different clock rate than main CPU
Summary: Nest should be able to run at different clock rate than main CPU
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Other
: --- enhancement
Assignee: tpearson
URL:
Depends on: 807
Blocks: 813
  Show dependency treegraph
 
Reported: 2022-04-10 03:26 BST by tpearson
Modified: 2022-08-29 01:31 BST (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.10.043.Wishbone
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 249
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
tpearson={amount=2000,submitted=2022-04-28,paid=2022-06-14}


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description tpearson 2022-04-10 03:26:29 BST
The main CPU maximum frequency currently limits the overall speed of the SoC, including the "nest" (DRAM controller, HyperRAM, etc.) due to the use of a synchronous clock Wishbone interconnect.

The CPU and any synchronous peripherals (e.g. UART, other low-speed "southbridge" components) should be moved to their own clock domain on a dedicated Wishbone bus, then bridged in with an asynchronous converter to the high speed nest bus.

In ASIC a similar design will be needed, but in that case the CPU will be running faster than the nest, and a third even slower bus would be used for the southbridge.

Implementation is in progress, this bug report is a placeholder and a location to track progress.
Comment 1 tpearson 2022-04-11 20:51:20 BST
Initial implementation available as of soc cfb86bb7 and ls2 2d7021ba (ddr3 branch).  Tested and apparently working at 50MHz core / 100MHz nest, although the nest doesn't currently meet timing at that frequency and will need further optimization / bus splits.

fw..DRAM init... initseq
done
MR profile: 00000B20 00000806 00000200 00000000
Rdly
p0: 01110011
Rdly
p1: 01110011
Auto calibrating... find mindone
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
Reloading built-in calibration profile...DRAM test...
done
Comment 2 Luke Kenneth Casson Leighton 2022-04-11 22:28:14 BST
@@ -276,6 +278,8 @@ class DDR3SoC(SoC, Elaboratable):
         #       |     |
         #       +--+--+
         #          |
+        #      WBAsyncBridge
+        #          |
         #      64to32DownCvt
         #          |


ah, i was expecting / thinking that should
go explicitly on the DRAM controller.

speed of other peripherals does not matter (at all)
and having more running at 100 mhz (64to32, uart,
arbiter) actually risks more combinatorial gate
delay and not making 100mhz.

ultimately the async bridge is a temporary hack that
has to go, pending Async FIFOs going in the actual
DRAM controller at the DFII level, or even connecting
directly to the PHY

also doing things that way is much less disruptive
a patch.
Comment 3 Luke Kenneth Casson Leighton 2022-04-11 22:46:19 BST
irony is, if the core were madly fast, Async on all the peripherals
would be perfect.

however the WB64to32 because it is specifically designed with synchronous
one clock delay serves a double role of "protecting" the core from
peripheral combinatorial delay.
Comment 4 Luke Kenneth Casson Leighton 2022-04-11 22:53:06 BST
https://git.libre-soc.org/?p=ls2.git;a=blob;f=src/ecp5_crg.py;h=36c8f1dd4f5ecde53a30f8b949862abfdb828117;hb=2d7021ba09c3e08e1780bf73c26deeaccf689221


see dramsync at the bottom.

dramsync speed as a parameter dram_freq is
supposed to be passed in (not a new frequency
and a new domain called "cpu")

the plans i had in mind were that if dram_freq is
None then dramsync.eq(sync) making them the same
otherwise add a new call

not

 240         else:
 241             pll.create_clkout(ClockSignal("cpu"), self.core_clk_freq)

this

 240         else:
 241             pll.create_clkout(ClockSignal("dramsync"), self.dram_clk_freq)

remove cpu freq remove replacement clk_freq core_freq less disruptive
patch add one line "dram_freq"
Comment 5 tpearson 2022-04-12 02:13:45 BST
Gotcha.

Bear in mind DRAM isn't the only peripheral that benefits from a clock boost -- for example Tercel has a maximum SPI speed of 1/2 the Wishbone bus clock, and the Flash devices we use on nearly all of the POWER products and many of the ECP5 boards can actually run at 100MHz SPI clock...
Comment 6 tpearson 2022-04-12 02:18:31 BST
I still want us to stop thinking of the CPU and fabric all running at the same clock speed -- any real-world ASIC will not have that "feature"...

I can move the arbiter after the downconverters, but I'll probably end up creating a couple of different peripheral clock domains because both Tercel and the DRAM controller can use the higher clock speeds for higher bandwidth.
Comment 7 Luke Kenneth Casson Leighton 2022-04-12 02:41:15 BST
(In reply to tpearson from comment #5)
> Gotcha.
> 
> Bear in mind DRAM isn't the only peripheral that benefits from a clock boost
> -- for example Tercel has a maximum SPI speed of 1/2 the Wishbone bus clock,
> and the Flash devices we use on nearly all of the POWER products and many of
> the ECP5 boards can actually run at 100MHz SPI clock...

i know: the problem is that most wishbone peripherals are only WB 3 (classic)
compliant. you *cannot* connect WB 3 (classic) to WB 4 (pipeline, with stall)
together without doing the "stall=stb&~ack" trick.

unfortunately (i checked), the alex forencich async-wb-bridge is *WB 3*
compliant *only*.  no, burst-mode is *not* the same as "WB4 pipeline" mode.

neither is GRAM, or uart16550, or tercel, or in fact anything at all that
i've been able to find!!

not litex, not nmigen-soc (until i fixed it), nothing.  they're *all* WB 3
(classic) compliant (or in the case of nmigen-soc, broken until i fixed it)

paul designed microwatt i/d-cache to be WB 4.0 compliant pipelined.
that means you either *have* to connect it downstream to WB 4.0 compliant
(pipelined, w/stall) or you have to "fake" it with the stall=stb&~ack trick.

but, the other way round is *not* possible - WB3 classic master can *not* be
connected to WB 4.0 pipeline slave.  and unfortunately, the AsyncWBBridge
is WB3-classic-master to WB-classic-slave.

this is the root cause of the problems that you're seeing.

the "solution" is to move the AsyncWBBridge to directly in front of the
GRAM controller.

it will need to have a wb.Interface Record that does contain a "stall"
signal on the slave side, which is then "faked" just like the way that
drambone.bus.stall is currently faked:

 521  # grrr, same problem with drambone: not WB4-pipe compliant
 522  comb += drambone.bus.stall.eq(drambone.bus.cyc & ~drambone.bus.ack)

that then needs to be REMOVED because now the AsyncWBBridge (which is
WB 3 classic only) will be connected to *drambone* which is *also*
WB 3 classic only.

if you think of it in terms of putting ASyncWBBridge into the actual
gram controller itself (presenting the exact same interface), then
the ls2.py code does not change (at all) except for the clock domains.

(In reply to tpearson from comment #6)

> I still want us to stop thinking of the CPU and fabric all running at the
> same clock speed -- any real-world ASIC will not have that "feature"...

i know.  however that will require AsyncWBBridge to be upgraded to
add "stall" signalling, because of above: once something drops down
to WB 3 classic you cannot go back "up" to WB 4 pipeline mode.

to use AsyncWBBridge as it stands, it would be necessary to propagate
the "stall=stb&~ack" trick right the way up to I-Cache and D-Cache,
instantly halving transfer speeds.

> I can move the arbiter after the downconverters, but I'll probably end up
> creating a couple of different peripheral clock domains because both Tercel
> and the DRAM controller can use the higher clock speeds for higher bandwidth.

tercel will need its own separate ASyncWBBridge for now.
Comment 8 Luke Kenneth Casson Leighton 2022-04-12 12:49:18 BST
https://git.libre-soc.org/?p=ls2.git;a=commitdiff;h=8d8b7e7e92c2728e484494fbd70b5500394b3f64


        #   +---decoder----+--------+---------+-------+--------+ |
        #   |      |       |        |         |       |        | |
        #  uart  XICS    CSRs   AsyncBridge XIP SPI HyperRAM   EthMAC
        #                           |
        #                         DRAM
Comment 9 tpearson 2022-04-14 02:00:47 BST
Bridges added in commits:

gram: 6f433acb
soc:  58393ccb
ls2:  9545d27f
Comment 10 Luke Kenneth Casson Leighton 2022-04-14 02:27:27 BST
(In reply to tpearson from comment #9)
> Bridges added in commits:
> 
> gram: 6f433acb

needs to be reverted (force master push, this
change should not have been made, DomainRenamer
is already doing everything)

> soc:  58393ccb

all good

> ls2:  9545d27f

needs to be moved to branch, restore master
to prior commit.
Comment 11 tpearson 2022-04-14 02:31:34 BST
(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to tpearson from comment #9)
> > Bridges added in commits:
> > 
> > gram: 6f433acb
> 
> needs to be reverted (force master push, this
> change should not have been made, DomainRenamer
> is already doing everything)

OK, done.

> > soc:  58393ccb
> 
> all good
> 
> > ls2:  9545d27f
> 
> needs to be moved to branch, restore master
> to prior commit.

Done, moved to async branch.
Comment 12 Luke Kenneth Casson Leighton 2022-04-14 10:32:41 BST
rrright.  i think i know what's going on here.

* the original crg has dramsync hard-coded to sync.  this is nice and simple

* dramsync therefore, being an alias for sync, is misleading

* DRAM (DDR3) pads need the 4x IO spec (xdr=4) which is why you get
  platform.request(xdr={'dqs': 4"})

* this causes the 4-phase clock thing to be REQUIRED - eclk, sclk -
  to be set up by the PLL, which in turn needs a 2x clock to be set
  up i.e. an Instance("ECLKSYNCB") and an Instance("CLKDIVF")

* in the current ecp5_crg.py - totally undocumented and without any comments -
  this is done for the ***SYNC*** domain ***NOT*** the ***dramsync*** domain

to move DDR3 to its own synchronous domain, a function will be needed
which calls pll.create_clkout, that does the:

* creation of an ECLKSYNCB,
* creation of a CLKDIVF,
* creation of a {%s}2x_unbuf,
* creation of a {%s}2x
* creation of a {%s} domain

and wires them all together in a non-hardwired-way.

   def do_2x_phase_clock_create(name):

        # Generating sync2x (200Mhz) and init (25Mhz) from extclk
        cd_2x = ClockDomain("%s2x" % name, local=False)
        cd_2x_unbuf = ClockDomain("%s2x_unbuf" % name,
                                      local=False, reset_less=True)
        cd_ = ClockDomain("%s" % name, local=False)

        # create PLL clocks
        pll.create_clkout(ClockSignal("%s2x_unbuf" % name), 2*freq)
        m.submodules += Instance("ECLKSYNCB",
                i_ECLKI = ClockSignal("%s2x_unbuf" % name),
                i_STOP  = 0,
                o_ECLKO = ClockSignal("2x" % name))
        m.domains["%s2x_unbuf" % name] += cd_2x_unbuf
        m.domains["%s2x" % name] += cd_2x_unbuf
        m.domains["%s" % name] += cd

        # # Generating sync (100Mhz) from sync2x

        m.submodules += Instance("CLKDIVF",
            p_DIV="2.0",
            i_ALIGNWD=0,
            i_CLKI=ClockSignal("%s2x" % name),
            i_RST=0,
            o_CDIVX=ClockSignal("%s" % name))

and then call that **TWICE**

* once for "sync"
* once for "dramsync"

now, actually, bizarrely, if there are no other peripherals that are
going to be using 4x xdr in the "sync" domain, i suspect it will *not*
be necessary to call do_2x_phase_clock_create("sync") but instead just
do a straight, boring, 1x (just like the "init" domain)

i'll add that function and commit it, along with some comments.
Comment 14 Luke Kenneth Casson Leighton 2022-04-14 11:52:02 BST
this

    https://git.libre-soc.org/?p=gram.git;a=commitdiff;h=dedba0951797896c4f6c44238ebffc7e4472f0f2

--- a/gram/phy/ecp5ddrphy.py
+++ b/gram/phy/ecp5ddrphy.py
@@ -40,7 +40,7 @@ class ECP5DDRPHYInit(Elaboratable):
         lock = Signal()
         lock_d = Signal()
         m.submodules += Instance("DDRDLLA",
-            i_CLK=ClockSignal("sync2x"),
+            i_CLK=ClockSignal("dramsync2x"),
             i_RST=ResetSignal("init"),
             i_UDDCNTLN=~update,
             i_FREEZE=freeze,

goes with a corresponding/matching this:

             drs = DomainRenamer({"sync": "dramsync",
                                  "sync2x": "dramsync2x"})

and this:

        # and a dram 2x sigh
        cd_dramsync2x = ClockDomain("dramsync2x", local=False)
        m.domains += cd_dramsync2x
        m.d.comb += ClockSignal("dramsync2x").eq(ClockSignal("sync2x"))

which should have bloody well been added right from the start. sigh.

anyway.

i've now added this:

https://git.libre-soc.org/?p=ls2.git;a=commitdiff;h=ea438cb108e627de4c1fd1bf96e273fc8659a2b9

+        if self.dram_clk_freq is not None:
+            self.phase2_domain(m, pll, "dramsync", self.dram_clk_freq)
+        else:
+            # alias dramsync and dramsync2x to sync and sync2x

which means that the ASyncBridge can be added with an optional
parameter, in a way that doesn't break things.
Comment 15 Luke Kenneth Casson Leighton 2022-04-14 12:49:19 BST
no. has to be done in the parent because it is only the parent
that knows if the WB4-pipe is connected to a WB3-classic.

burying the WB3-to-4-botch inside the peripheral will only
make using it harder due to undocumented side-effects.


diff --git a/src/soc/bus/wb_async.py b/src/soc/bus/wb_async.py
index 60844e56055ea1e2402ee92fc9aa3754fbe41325..6775427fd300d4e2113122ba3c9e5354332458fc 100644 (file)
--- a/src/soc/bus/wb_async.py
+++ b/src/soc/bus/wb_async.py
@@ -135,16 +135,6 @@ class WBAsyncBridge(Elaboratable):
                             i_wbs_rty_i=slave_rty
                             );
 
-        # Synthesize STALL signal for master port
-        if hasattr(self.master_bus, "stall"):
-            comb += self.master_bus.stall.eq(self.master_bus.cyc & ~self.master_bus.ack)
-
-        # Convert incoming slave STALL signal to a format that the async bridge understands...
-        if hasattr(self.slave_bus, "stall"):
-            comb += slave_ack.eq(self.slave_bus.ack & ~self.slave_bus.stall)
-        else:
-            comb += slave_ack.eq(self.slave_bus.ack)
-
         # Wire unused signals to 0
         comb += slave_err.eq(0)
         comb += slave_rty.eq(0)
Comment 16 Luke Kenneth Casson Leighton 2022-04-14 13:11:32 BST
not quite.  

https://git.libre-soc.org/?p=ls2.git;a=commitdiff;h=b586cc83eea22270d7914e7315565118c18fbf75

there are 4 CLKDIVFs available and 10 ECLKSYNCBs
so it is possible to a separate spi clock domain
from dram domain from hyperram domain

Info:                CLKDIVF:     1/    4    25%
Info:              ECLKSYNCB:     1/   10    10%

i'd prefer they were done one at a time (there is a project
"minimum patch" rule), let's focus on DRAM first

dram_clk_freq.  also please keep to under 80 chars

+                self.ddrphy = drs(ECP5DDRPHY(ddr_pins,
                  sys_clk_freq=memory_clk_freq))

ah... yyeah that's ok. for now.


yes, annoyingly:

-            drambone = gramWishbone(dramcore, features={'stall'})
+            drambone = gramWishbone(dramcore)

needing two Asyncs here is annoying, removing the features={stall}
is the right thing to do, it's the WBAsyncBridge that needs it.

yes:

+            self.drambone_async_br = WBAsyncBridge(
+                                         master_features={'stall'})

except stall=cyc&~ack has to be applied (i took it out
of WBAsyncBridge because it will quickly become hell:
unintended consequences)

@@ -470,6 +557,7 @@ class DDR3SoC(SoC, Elaboratable):
         self.memory_map = self._decoder.bus.memory_map
 
         self.clk_freq = clk_freq
+        self.memory_clk_freq = memory_clk_freq
         self.fpga = fpga

dram_clk_freq.  then, later, spi_clk_freq.  if they happen to be
made the same they can probably be aliased. TODO on that one.

[EDIT: no, i have to get on with the linux bring-up
y'know what? i'll pull this over.  there's great bits of code here
in easy-to-cut-paste chunks]
Comment 17 tpearson 2022-04-14 17:56:50 BST
(In reply to Luke Kenneth Casson Leighton from comment #16)
> dram_clk_freq.  then, later, spi_clk_freq.  if they happen to be
> made the same they can probably be aliased. TODO on that one.
> 
> [EDIT: no, i have to get on with the linux bring-up
> y'know what? i'll pull this over.  there's great bits of code here
> in easy-to-cut-paste chunks]

OK, so as I understand things you'll pull the bits you want for now, and when that's done I'll go back and verify the overall clock trees etc. for DRAM?

We do need to be careful here with clock domains, there are a fixed number of clock routing resources (device dependent) and it's very easy to get oneself into an unnecessary corner halting future development by consuming too many of them early on...

There's one rather obvious option that hasn't been applied here either -- what about simply paramaterizing the DRAM clock names at module instantation?  That's basically what I am doing for the WBASyncBridge clock domain connections.
Comment 18 tpearson 2022-04-14 18:22:45 BST
So, backing up to try to de-muddy the waters a bit...

The ECP5 DDR3 controllers require very specific -- and *board-specific* -- clock and reset wiring between various hard IP blocks to function.  If this wiring is wrong, *best case* you'll get an unroutable design, worst case you'll get a conditionally stable design that will never work properly for *every* boot, but will work for some boots.

As an example, here's the Arctic Tern clock wiring.  It's even more complex and yes it's absolutely mandatory to have it wired that way for *that specific DRAM configuration* on the board (the Lattice VIP board requires similar clock trees):

https://github.com/litex-hub/litex-boards/blob/5aab6f01b6d2044cb698e936bff11491f0d52584/litex_boards/targets/rcs_arctic_tern_bmc_card.py#L68

Normally, since the number of global clock routes is very small (IIRC 8 or less on ECP5), the design would use the slowest clock (a.k.a. "dramsync") that the board-specific wiring of the DDR3 hard IP generates, thus saving a single global clock route and avoiding async bridges entirely.

However, I think the fact that we have an async bridge available affords us the opportunity to move the DDR3 clock generator into its own board-specific module that is then wired up to the DDR3 controller, similar to how specific RAM devices are also "paramaterized" via modules.  The caveat is that once that's done *there is no going back* -- *everything* talking to DDR3 must always go through the bridge, since it's purposefully been moved off the main synchronous clock domain.

I can do the changes, but need to know if the above makes sense and you're OK with the general design.
Comment 19 Luke Kenneth Casson Leighton 2022-04-14 19:30:32 BST
(In reply to tpearson from comment #17)

> OK, so as I understand things you'll pull the bits you want for now, and
> when that's done I'll go back and verify the overall clock trees etc. for
> DRAM?

no, i said no: [EDIT: no, i have to get on with the linux bring-up]

as in: i was going to, then decided against it.

> There's one rather obvious option that hasn't been applied here either --
> what about simply paramaterizing the DRAM clock names at module
> instantation?

this is precisely what DomainRenamer is designed exactly not to
need you to have to do.  everything can be added to one domain
(e.g. sync) followed by, at instantiation time, the *renaming*
of that domain to another domain, as the *instantiator* sees fit.

this is why we have this:

    drs = DomainRenamer{'sync': 'dramsync'}
    self.ddrphy = drs(ECP5DDRPHY(ddr_pins, sys_clk_freq=clk_freq))

what that is doing is taking absolutely everything that is
    "m.d.sync +="
and dropping it instead into:
    "m.d.dramsync"

and the only reason everything works at the moment is because, back
in ECP5CRG, this is done:

    "m.d.comb += ClockSignal("dramsync").eq(ClockSignal("sync"))


DomainRenamer() is what makes the global/search/replace "m.d.sync"
with "m.d.memory" (or anything else) completely redundant.

it can be done, instead, with:

    drs = DomainRenamer({"sync": "memory"})


>  That's basically what I am doing for the WBASyncBridge clock
> domain connections.

that's different, that's at the level that's controlling the renames.
Comment 20 Luke Kenneth Casson Leighton 2022-04-14 19:42:47 BST
(In reply to tpearson from comment #17)

> There's one rather obvious option that hasn't been applied here either --
> what about simply paramaterizing the DRAM clock names at module
> instantation?

can you explain (with an illustration) what you mean "parameterising
the DRAM clock names"?
Comment 21 tpearson 2022-04-14 19:55:16 BST
(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to tpearson from comment #17)
> 
> > There's one rather obvious option that hasn't been applied here either --
> > what about simply paramaterizing the DRAM clock names at module
> > instantation?
> 
> can you explain (with an illustration) what you mean "parameterising
> the DRAM clock names"?

On deeper investigation, it looks like the DomainRenamer may do what we need here.

However, the gram tree right now contains a bunch of instances of "dramsync".  From where I sit we need to change those back to "sync" so that the renamer is consistent, it's just plain confusing at the moment and I think this was the origin of some of my misconceptions...
Comment 22 tpearson 2022-04-14 20:00:15 BST
> However, the gram tree right now contains a bunch of instances of
> "dramsync".  From where I sit we need to change those back to "sync" so that
> the renamer is consistent, it's just plain confusing at the moment and I
> think this was the origin of some of my misconceptions...

EDIT: looks like dedba095 just made them at least internally consistent.  If we're good with those names, I'll add the renamer to get the DRAM on the memory clock domain and update the async branch...
Comment 23 Luke Kenneth Casson Leighton 2022-04-14 21:49:42 BST
(In reply to tpearson from comment #21)

> On deeper investigation, it looks like the DomainRenamer may do what we need
> here.

wheww :)
 
> However, the gram tree right now contains a bunch of instances of
> "dramsync".  From where I sit we need to change those back to "sync" so that
> the renamer is consistent, it's just plain confusing at the moment and I
> think this was the origin of some of my misconceptions...

well... ultimately, they should in fact be different, because when
the ASync-ness is moved down into GRAM *itself*, it would be *GRAM*
that... ah yes and it would also use DomainRenamer :)

wark-wark

let me try that again.  there are 3 components in GRAM

* grambone (WB-to-DFII)
* gramcore (DFII-to-PHY)
* gramphy  (PHY-to-IOpads of the ECP5)

we're going to put *all* of those into a 100 mhz clock domain
called dramsync, with a 2x domain to run the IOPads on 4-phase
(xdr=4)

then, ASyncWBBridge will front the cross-over from 40mhz (whatever)
in the "sync" domain to 100mhz (whatever) of GRAM.

what *actually* needs to happen is:

* grambone runs in the *nest* clock domain (aka the cpu)
* gramcore needs to have its *OWN* ASync FIFOs (not wishbone, at all)
           these FIFOs are at the *DFI command* level
* gramphy  stays at it is (100 mhz, dramsync)

something like that.  it's not straightfoward, at all, but it basically
reduces the amount of crap that has to be run in GRAM at the highest
clock frequency.

if the only thing running in GRAM at the higher clock frequency is the
gramphy and that's connected directly to Async FIFOs i see no reason why
attempts should not be made to run the PHY at 150-200 mhz.

but

i am getting way ahead.

right now the primary task is to decouple the "entirety-of-GRAM" to
get it, as a single job-lot, to run at 100 mhz.
Comment 24 Luke Kenneth Casson Leighton 2022-04-16 18:17:23 BST
preliminary simulation using runsimsoc2.sh iverilog shows successful
communication on at least one of the WBAsyncBridges.  it may even be
possible to push the DDR3 Controller to 200 mhz which is apparently
the limit for the ECP5, but it will be fun finding out.