The main CPU maximum frequency currently limits the overall speed of the SoC, including the "nest" (DRAM controller, HyperRAM, etc.) due to the use of a synchronous clock Wishbone interconnect. The CPU and any synchronous peripherals (e.g. UART, other low-speed "southbridge" components) should be moved to their own clock domain on a dedicated Wishbone bus, then bridged in with an asynchronous converter to the high speed nest bus. In ASIC a similar design will be needed, but in that case the CPU will be running faster than the nest, and a third even slower bus would be used for the southbridge. Implementation is in progress, this bug report is a placeholder and a location to track progress.
Initial implementation available as of soc cfb86bb7 and ls2 2d7021ba (ddr3 branch). Tested and apparently working at 50MHz core / 100MHz nest, although the nest doesn't currently meet timing at that frequency and will need further optimization / bus splits. fw..DRAM init... initseq done MR profile: 00000B20 00000806 00000200 00000000 Rdly p0: 01110011 Rdly p1: 01110011 Auto calibrating... find mindone Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002 Reloading built-in calibration profile...DRAM test... done
@@ -276,6 +278,8 @@ class DDR3SoC(SoC, Elaboratable): # | | # +--+--+ # | + # WBAsyncBridge + # | # 64to32DownCvt # | ah, i was expecting / thinking that should go explicitly on the DRAM controller. speed of other peripherals does not matter (at all) and having more running at 100 mhz (64to32, uart, arbiter) actually risks more combinatorial gate delay and not making 100mhz. ultimately the async bridge is a temporary hack that has to go, pending Async FIFOs going in the actual DRAM controller at the DFII level, or even connecting directly to the PHY also doing things that way is much less disruptive a patch.
irony is, if the core were madly fast, Async on all the peripherals would be perfect. however the WB64to32 because it is specifically designed with synchronous one clock delay serves a double role of "protecting" the core from peripheral combinatorial delay.
https://git.libre-soc.org/?p=ls2.git;a=blob;f=src/ecp5_crg.py;h=36c8f1dd4f5ecde53a30f8b949862abfdb828117;hb=2d7021ba09c3e08e1780bf73c26deeaccf689221 see dramsync at the bottom. dramsync speed as a parameter dram_freq is supposed to be passed in (not a new frequency and a new domain called "cpu") the plans i had in mind were that if dram_freq is None then dramsync.eq(sync) making them the same otherwise add a new call not 240 else: 241 pll.create_clkout(ClockSignal("cpu"), self.core_clk_freq) this 240 else: 241 pll.create_clkout(ClockSignal("dramsync"), self.dram_clk_freq) remove cpu freq remove replacement clk_freq core_freq less disruptive patch add one line "dram_freq"
Gotcha. Bear in mind DRAM isn't the only peripheral that benefits from a clock boost -- for example Tercel has a maximum SPI speed of 1/2 the Wishbone bus clock, and the Flash devices we use on nearly all of the POWER products and many of the ECP5 boards can actually run at 100MHz SPI clock...
I still want us to stop thinking of the CPU and fabric all running at the same clock speed -- any real-world ASIC will not have that "feature"... I can move the arbiter after the downconverters, but I'll probably end up creating a couple of different peripheral clock domains because both Tercel and the DRAM controller can use the higher clock speeds for higher bandwidth.
(In reply to tpearson from comment #5) > Gotcha. > > Bear in mind DRAM isn't the only peripheral that benefits from a clock boost > -- for example Tercel has a maximum SPI speed of 1/2 the Wishbone bus clock, > and the Flash devices we use on nearly all of the POWER products and many of > the ECP5 boards can actually run at 100MHz SPI clock... i know: the problem is that most wishbone peripherals are only WB 3 (classic) compliant. you *cannot* connect WB 3 (classic) to WB 4 (pipeline, with stall) together without doing the "stall=stb&~ack" trick. unfortunately (i checked), the alex forencich async-wb-bridge is *WB 3* compliant *only*. no, burst-mode is *not* the same as "WB4 pipeline" mode. neither is GRAM, or uart16550, or tercel, or in fact anything at all that i've been able to find!! not litex, not nmigen-soc (until i fixed it), nothing. they're *all* WB 3 (classic) compliant (or in the case of nmigen-soc, broken until i fixed it) paul designed microwatt i/d-cache to be WB 4.0 compliant pipelined. that means you either *have* to connect it downstream to WB 4.0 compliant (pipelined, w/stall) or you have to "fake" it with the stall=stb&~ack trick. but, the other way round is *not* possible - WB3 classic master can *not* be connected to WB 4.0 pipeline slave. and unfortunately, the AsyncWBBridge is WB3-classic-master to WB-classic-slave. this is the root cause of the problems that you're seeing. the "solution" is to move the AsyncWBBridge to directly in front of the GRAM controller. it will need to have a wb.Interface Record that does contain a "stall" signal on the slave side, which is then "faked" just like the way that drambone.bus.stall is currently faked: 521 # grrr, same problem with drambone: not WB4-pipe compliant 522 comb += drambone.bus.stall.eq(drambone.bus.cyc & ~drambone.bus.ack) that then needs to be REMOVED because now the AsyncWBBridge (which is WB 3 classic only) will be connected to *drambone* which is *also* WB 3 classic only. if you think of it in terms of putting ASyncWBBridge into the actual gram controller itself (presenting the exact same interface), then the ls2.py code does not change (at all) except for the clock domains. (In reply to tpearson from comment #6) > I still want us to stop thinking of the CPU and fabric all running at the > same clock speed -- any real-world ASIC will not have that "feature"... i know. however that will require AsyncWBBridge to be upgraded to add "stall" signalling, because of above: once something drops down to WB 3 classic you cannot go back "up" to WB 4 pipeline mode. to use AsyncWBBridge as it stands, it would be necessary to propagate the "stall=stb&~ack" trick right the way up to I-Cache and D-Cache, instantly halving transfer speeds. > I can move the arbiter after the downconverters, but I'll probably end up > creating a couple of different peripheral clock domains because both Tercel > and the DRAM controller can use the higher clock speeds for higher bandwidth. tercel will need its own separate ASyncWBBridge for now.
https://git.libre-soc.org/?p=ls2.git;a=commitdiff;h=8d8b7e7e92c2728e484494fbd70b5500394b3f64 # +---decoder----+--------+---------+-------+--------+ | # | | | | | | | | # uart XICS CSRs AsyncBridge XIP SPI HyperRAM EthMAC # | # DRAM
Bridges added in commits: gram: 6f433acb soc: 58393ccb ls2: 9545d27f
(In reply to tpearson from comment #9) > Bridges added in commits: > > gram: 6f433acb needs to be reverted (force master push, this change should not have been made, DomainRenamer is already doing everything) > soc: 58393ccb all good > ls2: 9545d27f needs to be moved to branch, restore master to prior commit.
(In reply to Luke Kenneth Casson Leighton from comment #10) > (In reply to tpearson from comment #9) > > Bridges added in commits: > > > > gram: 6f433acb > > needs to be reverted (force master push, this > change should not have been made, DomainRenamer > is already doing everything) OK, done. > > soc: 58393ccb > > all good > > > ls2: 9545d27f > > needs to be moved to branch, restore master > to prior commit. Done, moved to async branch.
rrright. i think i know what's going on here. * the original crg has dramsync hard-coded to sync. this is nice and simple * dramsync therefore, being an alias for sync, is misleading * DRAM (DDR3) pads need the 4x IO spec (xdr=4) which is why you get platform.request(xdr={'dqs': 4"}) * this causes the 4-phase clock thing to be REQUIRED - eclk, sclk - to be set up by the PLL, which in turn needs a 2x clock to be set up i.e. an Instance("ECLKSYNCB") and an Instance("CLKDIVF") * in the current ecp5_crg.py - totally undocumented and without any comments - this is done for the ***SYNC*** domain ***NOT*** the ***dramsync*** domain to move DDR3 to its own synchronous domain, a function will be needed which calls pll.create_clkout, that does the: * creation of an ECLKSYNCB, * creation of a CLKDIVF, * creation of a {%s}2x_unbuf, * creation of a {%s}2x * creation of a {%s} domain and wires them all together in a non-hardwired-way. def do_2x_phase_clock_create(name): # Generating sync2x (200Mhz) and init (25Mhz) from extclk cd_2x = ClockDomain("%s2x" % name, local=False) cd_2x_unbuf = ClockDomain("%s2x_unbuf" % name, local=False, reset_less=True) cd_ = ClockDomain("%s" % name, local=False) # create PLL clocks pll.create_clkout(ClockSignal("%s2x_unbuf" % name), 2*freq) m.submodules += Instance("ECLKSYNCB", i_ECLKI = ClockSignal("%s2x_unbuf" % name), i_STOP = 0, o_ECLKO = ClockSignal("2x" % name)) m.domains["%s2x_unbuf" % name] += cd_2x_unbuf m.domains["%s2x" % name] += cd_2x_unbuf m.domains["%s" % name] += cd # # Generating sync (100Mhz) from sync2x m.submodules += Instance("CLKDIVF", p_DIV="2.0", i_ALIGNWD=0, i_CLKI=ClockSignal("%s2x" % name), i_RST=0, o_CDIVX=ClockSignal("%s" % name)) and then call that **TWICE** * once for "sync" * once for "dramsync" now, actually, bizarrely, if there are no other peripherals that are going to be using 4x xdr in the "sync" domain, i suspect it will *not* be necessary to call do_2x_phase_clock_create("sync") but instead just do a straight, boring, 1x (just like the "init" domain) i'll add that function and commit it, along with some comments.
like this: https://git.libre-soc.org/?p=gram.git;a=blob;f=gram/simulation/crg.py;h=8f9a78be073c24776a01c301cfa308f025860531;hb=8f6a40bb418e346a78af5a83a5bc20f0ed538d57#l230 except done as a function, not hard-coded. https://git.libre-soc.org/?p=ls2.git;a=blob;f=src/ecp5_crg.py;h=1740169befd5ffa2d8444a8ef59beaced7679eb6;hb=55b5acc9cbe8c67b4b64ab4ba145b93347020e04#l176
this https://git.libre-soc.org/?p=gram.git;a=commitdiff;h=dedba0951797896c4f6c44238ebffc7e4472f0f2 --- a/gram/phy/ecp5ddrphy.py +++ b/gram/phy/ecp5ddrphy.py @@ -40,7 +40,7 @@ class ECP5DDRPHYInit(Elaboratable): lock = Signal() lock_d = Signal() m.submodules += Instance("DDRDLLA", - i_CLK=ClockSignal("sync2x"), + i_CLK=ClockSignal("dramsync2x"), i_RST=ResetSignal("init"), i_UDDCNTLN=~update, i_FREEZE=freeze, goes with a corresponding/matching this: drs = DomainRenamer({"sync": "dramsync", "sync2x": "dramsync2x"}) and this: # and a dram 2x sigh cd_dramsync2x = ClockDomain("dramsync2x", local=False) m.domains += cd_dramsync2x m.d.comb += ClockSignal("dramsync2x").eq(ClockSignal("sync2x")) which should have bloody well been added right from the start. sigh. anyway. i've now added this: https://git.libre-soc.org/?p=ls2.git;a=commitdiff;h=ea438cb108e627de4c1fd1bf96e273fc8659a2b9 + if self.dram_clk_freq is not None: + self.phase2_domain(m, pll, "dramsync", self.dram_clk_freq) + else: + # alias dramsync and dramsync2x to sync and sync2x which means that the ASyncBridge can be added with an optional parameter, in a way that doesn't break things.
no. has to be done in the parent because it is only the parent that knows if the WB4-pipe is connected to a WB3-classic. burying the WB3-to-4-botch inside the peripheral will only make using it harder due to undocumented side-effects. diff --git a/src/soc/bus/wb_async.py b/src/soc/bus/wb_async.py index 60844e56055ea1e2402ee92fc9aa3754fbe41325..6775427fd300d4e2113122ba3c9e5354332458fc 100644 (file) --- a/src/soc/bus/wb_async.py +++ b/src/soc/bus/wb_async.py @@ -135,16 +135,6 @@ class WBAsyncBridge(Elaboratable): i_wbs_rty_i=slave_rty ); - # Synthesize STALL signal for master port - if hasattr(self.master_bus, "stall"): - comb += self.master_bus.stall.eq(self.master_bus.cyc & ~self.master_bus.ack) - - # Convert incoming slave STALL signal to a format that the async bridge understands... - if hasattr(self.slave_bus, "stall"): - comb += slave_ack.eq(self.slave_bus.ack & ~self.slave_bus.stall) - else: - comb += slave_ack.eq(self.slave_bus.ack) - # Wire unused signals to 0 comb += slave_err.eq(0) comb += slave_rty.eq(0)
not quite. https://git.libre-soc.org/?p=ls2.git;a=commitdiff;h=b586cc83eea22270d7914e7315565118c18fbf75 there are 4 CLKDIVFs available and 10 ECLKSYNCBs so it is possible to a separate spi clock domain from dram domain from hyperram domain Info: CLKDIVF: 1/ 4 25% Info: ECLKSYNCB: 1/ 10 10% i'd prefer they were done one at a time (there is a project "minimum patch" rule), let's focus on DRAM first dram_clk_freq. also please keep to under 80 chars + self.ddrphy = drs(ECP5DDRPHY(ddr_pins, sys_clk_freq=memory_clk_freq)) ah... yyeah that's ok. for now. yes, annoyingly: - drambone = gramWishbone(dramcore, features={'stall'}) + drambone = gramWishbone(dramcore) needing two Asyncs here is annoying, removing the features={stall} is the right thing to do, it's the WBAsyncBridge that needs it. yes: + self.drambone_async_br = WBAsyncBridge( + master_features={'stall'}) except stall=cyc&~ack has to be applied (i took it out of WBAsyncBridge because it will quickly become hell: unintended consequences) @@ -470,6 +557,7 @@ class DDR3SoC(SoC, Elaboratable): self.memory_map = self._decoder.bus.memory_map self.clk_freq = clk_freq + self.memory_clk_freq = memory_clk_freq self.fpga = fpga dram_clk_freq. then, later, spi_clk_freq. if they happen to be made the same they can probably be aliased. TODO on that one. [EDIT: no, i have to get on with the linux bring-up y'know what? i'll pull this over. there's great bits of code here in easy-to-cut-paste chunks]
(In reply to Luke Kenneth Casson Leighton from comment #16) > dram_clk_freq. then, later, spi_clk_freq. if they happen to be > made the same they can probably be aliased. TODO on that one. > > [EDIT: no, i have to get on with the linux bring-up > y'know what? i'll pull this over. there's great bits of code here > in easy-to-cut-paste chunks] OK, so as I understand things you'll pull the bits you want for now, and when that's done I'll go back and verify the overall clock trees etc. for DRAM? We do need to be careful here with clock domains, there are a fixed number of clock routing resources (device dependent) and it's very easy to get oneself into an unnecessary corner halting future development by consuming too many of them early on... There's one rather obvious option that hasn't been applied here either -- what about simply paramaterizing the DRAM clock names at module instantation? That's basically what I am doing for the WBASyncBridge clock domain connections.
So, backing up to try to de-muddy the waters a bit... The ECP5 DDR3 controllers require very specific -- and *board-specific* -- clock and reset wiring between various hard IP blocks to function. If this wiring is wrong, *best case* you'll get an unroutable design, worst case you'll get a conditionally stable design that will never work properly for *every* boot, but will work for some boots. As an example, here's the Arctic Tern clock wiring. It's even more complex and yes it's absolutely mandatory to have it wired that way for *that specific DRAM configuration* on the board (the Lattice VIP board requires similar clock trees): https://github.com/litex-hub/litex-boards/blob/5aab6f01b6d2044cb698e936bff11491f0d52584/litex_boards/targets/rcs_arctic_tern_bmc_card.py#L68 Normally, since the number of global clock routes is very small (IIRC 8 or less on ECP5), the design would use the slowest clock (a.k.a. "dramsync") that the board-specific wiring of the DDR3 hard IP generates, thus saving a single global clock route and avoiding async bridges entirely. However, I think the fact that we have an async bridge available affords us the opportunity to move the DDR3 clock generator into its own board-specific module that is then wired up to the DDR3 controller, similar to how specific RAM devices are also "paramaterized" via modules. The caveat is that once that's done *there is no going back* -- *everything* talking to DDR3 must always go through the bridge, since it's purposefully been moved off the main synchronous clock domain. I can do the changes, but need to know if the above makes sense and you're OK with the general design.
(In reply to tpearson from comment #17) > OK, so as I understand things you'll pull the bits you want for now, and > when that's done I'll go back and verify the overall clock trees etc. for > DRAM? no, i said no: [EDIT: no, i have to get on with the linux bring-up] as in: i was going to, then decided against it. > There's one rather obvious option that hasn't been applied here either -- > what about simply paramaterizing the DRAM clock names at module > instantation? this is precisely what DomainRenamer is designed exactly not to need you to have to do. everything can be added to one domain (e.g. sync) followed by, at instantiation time, the *renaming* of that domain to another domain, as the *instantiator* sees fit. this is why we have this: drs = DomainRenamer{'sync': 'dramsync'} self.ddrphy = drs(ECP5DDRPHY(ddr_pins, sys_clk_freq=clk_freq)) what that is doing is taking absolutely everything that is "m.d.sync +=" and dropping it instead into: "m.d.dramsync" and the only reason everything works at the moment is because, back in ECP5CRG, this is done: "m.d.comb += ClockSignal("dramsync").eq(ClockSignal("sync")) DomainRenamer() is what makes the global/search/replace "m.d.sync" with "m.d.memory" (or anything else) completely redundant. it can be done, instead, with: drs = DomainRenamer({"sync": "memory"}) > That's basically what I am doing for the WBASyncBridge clock > domain connections. that's different, that's at the level that's controlling the renames.
(In reply to tpearson from comment #17) > There's one rather obvious option that hasn't been applied here either -- > what about simply paramaterizing the DRAM clock names at module > instantation? can you explain (with an illustration) what you mean "parameterising the DRAM clock names"?
(In reply to Luke Kenneth Casson Leighton from comment #20) > (In reply to tpearson from comment #17) > > > There's one rather obvious option that hasn't been applied here either -- > > what about simply paramaterizing the DRAM clock names at module > > instantation? > > can you explain (with an illustration) what you mean "parameterising > the DRAM clock names"? On deeper investigation, it looks like the DomainRenamer may do what we need here. However, the gram tree right now contains a bunch of instances of "dramsync". From where I sit we need to change those back to "sync" so that the renamer is consistent, it's just plain confusing at the moment and I think this was the origin of some of my misconceptions...
> However, the gram tree right now contains a bunch of instances of > "dramsync". From where I sit we need to change those back to "sync" so that > the renamer is consistent, it's just plain confusing at the moment and I > think this was the origin of some of my misconceptions... EDIT: looks like dedba095 just made them at least internally consistent. If we're good with those names, I'll add the renamer to get the DRAM on the memory clock domain and update the async branch...
(In reply to tpearson from comment #21) > On deeper investigation, it looks like the DomainRenamer may do what we need > here. wheww :) > However, the gram tree right now contains a bunch of instances of > "dramsync". From where I sit we need to change those back to "sync" so that > the renamer is consistent, it's just plain confusing at the moment and I > think this was the origin of some of my misconceptions... well... ultimately, they should in fact be different, because when the ASync-ness is moved down into GRAM *itself*, it would be *GRAM* that... ah yes and it would also use DomainRenamer :) wark-wark let me try that again. there are 3 components in GRAM * grambone (WB-to-DFII) * gramcore (DFII-to-PHY) * gramphy (PHY-to-IOpads of the ECP5) we're going to put *all* of those into a 100 mhz clock domain called dramsync, with a 2x domain to run the IOPads on 4-phase (xdr=4) then, ASyncWBBridge will front the cross-over from 40mhz (whatever) in the "sync" domain to 100mhz (whatever) of GRAM. what *actually* needs to happen is: * grambone runs in the *nest* clock domain (aka the cpu) * gramcore needs to have its *OWN* ASync FIFOs (not wishbone, at all) these FIFOs are at the *DFI command* level * gramphy stays at it is (100 mhz, dramsync) something like that. it's not straightfoward, at all, but it basically reduces the amount of crap that has to be run in GRAM at the highest clock frequency. if the only thing running in GRAM at the higher clock frequency is the gramphy and that's connected directly to Async FIFOs i see no reason why attempts should not be made to run the PHY at 150-200 mhz. but i am getting way ahead. right now the primary task is to decouple the "entirety-of-GRAM" to get it, as a single job-lot, to run at 100 mhz.
preliminary simulation using runsimsoc2.sh iverilog shows successful communication on at least one of the WBAsyncBridges. it may even be possible to push the DDR3 Controller to 200 mhz which is apparently the limit for the ECP5, but it will be fun finding out.