Bug 199 - Layout using coriolis2 main core, 180nm
Summary: Layout using coriolis2 main core, 180nm
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Hardware Layout (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Jean-Paul Chaput
URL:
Depends on: 528 200 502 506 507 508 521 526 620
Blocks: 138 204
  Show dependency treegraph
 
Reported: 2020-03-02 16:57 GMT by Luke Kenneth Casson Leighton
Modified: 2022-09-01 20:10 BST (History)
1 user (show)

See Also:
NLnet milestone: NLNet.2019.02.029.Coriolis2
total budget (EUR) for completion of task and all subtasks: 9000
budget (EUR) for this task, excluding subtasks' budget: 650
parent task for budget allocation: 138
child tasks for budget allocation: 490 502 506 507 508 521 620
The table of payments (in EUR) for this task; TOML format:
red = { amount = 650, submitted = 2022-08-26, paid = 2022-08-31 }


Attachments
Patch to create experiments9 (140.89 KB, application/x-bzip)
2020-06-30 09:22 BST, Jean-Paul Chaput
Details
Makefile for nmutil with install dir (298 bytes, text/plain)
2020-06-30 12:35 BST, Jean-Paul Chaput
Details
staggered pipeline quick drawing (153.28 KB, image/jpeg)
2020-08-02 22:50 BST, Luke Kenneth Casson Leighton
Details
zoom-in on dense areas of routing (78.01 KB, image/png)
2020-09-21 11:08 BST, Luke Kenneth Casson Leighton
Details
how jtag is organised (673.58 KB, image/png)
2020-09-30 17:43 BST, Luke Kenneth Casson Leighton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-03-02 16:57:18 GMT
do layout for single-core 180nm ASIC including 1st level cache.
also peripherals: minimum priority is SDRAM 32 bit, 16550 UART, JTAG and SPI. secondary priorities are 64 bit SDRAM, GPIO, PWM, EINT, QSPI, SDMMC, RGBTTL, I2C and the pinmux.
package is to be QFP, maximum around 200 pins only including power and ground.

https://libre-soc.org/3d_gpu/layouts/coriolis2_180nm/
Comment 2 Luke Kenneth Casson Leighton 2020-06-25 16:28:25 BST
here's the diagram and page containing notes:
https://libre-soc.org/3d_gpu/layouts/coriolis2_180nm/

the first thing to note: there are quite a lot of Register Files
and there are quite a lot of Function Units.  therefore, as
there are quite a lot of unique Register File Ports, there are
also quite a lot of PriorityPickers (exactly one PP for *each* port).

i would recommend that every Function Unit's inputs and outputs
be on the same "side", because those inputs and outputs ultimately
have to go to the Regfiles, which is a single location.

the only exception to this is the LDSTCompUnit, which has the
additional connectivity to L0CacheBuffer, through which access
to Memory is attained.

LDSTCompUnit can have the memory access on the opposite side of
the registers.
Comment 3 Luke Kenneth Casson Leighton 2020-06-25 17:32:57 BST
(In reply to Luke Kenneth Casson Leighton from comment #2)

> i would recommend that every Function Unit's inputs and outputs
> be on the same "side", because those inputs and outputs ultimately
> have to go to the Regfiles, which is a single location.

 
so, the data which goes through the pipeline will "loop back".

when this is expanded and there are 15 to 28 Function Units, i would recommend having very "narrow" Function Units that have half of their pipeline stages going one direction, turn round, then half of them come back again.

when we include DIV and MUL this will almost certainly need to be done.  these will be very big Function Units as large as all other Function Units combined.

at that point a kind of "tree" will be needed that fans out the "thin" Function Units like branches, all of them leading back to the point where the PriorityPickers are.

the Regfile Broadcast Buses will be quite large (a lot of wires back and forth).  i do not yet have a handle on the relative sizes, here.

however from the internals, we have INTRegs which is 3R2W and whilst the data is 64 bit the "address" remember is in *unary* so is 32 bits, one bit for each register.

therefore there are 96 x 5 wires going in and out of the INT regfile.

* for the RA Read Port the 6 bit data fans out to i think 5 Function Units.
* likewise for RB
* for the RT Write Port i think it is 3 fan-in

basically these relationships between Regfiles and Function Units are multiple fan-in and multiple fan-out each being Broadcast Buses, each Bus being managed by a PriorityPicker.
Comment 4 Jean-Paul Chaput 2020-06-26 16:41:56 BST
Coriolis commit b48f9b4 fixes:

* soclayout/experiments6, we can generate the fpmul64 example without
  using Yosys flatten, the vst should now be correct. The synthesis
  gives ~21K gates and the P&R takes a little above 3 minutes.
  So perfectly manageable.

* soclayout/experiment9, invalid syntax in port map (no right hand
  signal...).

Additional commit f3dd4bc fixes:

* Incomplete hierarchical save (Cumulus rsave plugin).

About the size of test_issuer:

It appears that most of the cells are in the "mem" module, that is
93% of them (for a total of 844321). It seems wrong to me.
IMHO, two possibilities here:

1. The real complexity of the "mem" module was drastically
   underestimated.

2. The way the nMignen code of "mem" is written trick Yosys in
   doing very unoptimzed things.

Note that with such an unbalance in the size of the modules / FU,
a realistic placement makes little sense.

Anyway, I strongly suggest a review of that module to, at least,
understand and justify such a size.
Comment 5 Luke Kenneth Casson Leighton 2020-06-26 16:49:18 BST
(In reply to Jean-Paul.Chaput from comment #4)
> Coriolis commit b48f9b4 fixes:
> 
> * soclayout/experiments6, we can generate the fpmul64 example without
>   using Yosys flatten, the vst should now be correct. The synthesis
>   gives ~21K gates and the P&R takes a little above 3 minutes.
>   So perfectly manageable.

fantastic.

> 
> * soclayout/experiment9, invalid syntax in port map (no right hand
>   signal...).

hmmm... will take a look later, am in the middle of sorting out other
memory-bus stuff.
 
> Additional commit f3dd4bc fixes:
> 
> * Incomplete hierarchical save (Cumulus rsave plugin).
> 
> About the size of test_issuer:
> 
> It appears that most of the cells are in the "mem" module, that is
> 93% of them (for a total of 844321). It seems wrong to me.

yep, it is initialised with 262144 bits.  that means that somewhere the "mem" instance is being passed an address range of (1<<18) where it should only be around 1<<6 for these purposes.

i'll take a look now.

l.
Comment 6 Luke Kenneth Casson Leighton 2020-06-26 17:21:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)

> yep, it is initialised with 262144 bits.  that means that somewhere the
> "mem" instance is being passed an address range of (1<<18) where it should
> only be around 1<<6 for these purposes.
> 
> i'll take a look now.

ok that's down to a more sane (hard-coded) 32 entries so the initialisation
is still large (32*64 bits) however it's not 2^18 bits.

btw you'll need to git pull on nmutil as well as soc, there's some modifications
to the RecordObject class (which will give different names to some signals,
however those names now include the parent object name)
Comment 7 Jean-Paul Chaput 2020-06-30 09:22:04 BST
Created attachment 71 [details]
Patch to create experiments9

As the port 922 is filtered by my ISP in vacation, I directly provide the
commit as a patch here.

This is very preliminary work. The router do not complete yet, it reaches
only 99.9%. The coriolis2/settings.py contains some parameters variations
to tweak the router and compare the different outcomes. I will use it to
find geomtric cases where the router takes bad decisions and correct them.

A note about nMigen and a potential annoying bug at installation. nMigen
is listed as a dependency so setup tools will try to install it.
But if it is not installed *prior* to the Libre-SOC repositories, the
m-lab version will be pulled from the Python archives. Which is not
what we need. And if you don't install in system directories, like I
do, you even get two versions... One good and one bad...
Comment 8 Luke Kenneth Casson Leighton 2020-06-30 10:09:54 BST
(In reply to Jean-Paul.Chaput from comment #7)
> Created attachment 71 [details]
> Patch to create experiments9

got it.

> As the port 922 is filtered by my ISP in vacation, I directly provide the
> commit as a patch here.

applied and pushed, thank you jean-paul
 
> This is very preliminary work. The router do not complete yet, it reaches
> only 99.9%. The coriolis2/settings.py contains some parameters variations
> to tweak the router and compare the different outcomes. I will use it to
> find geomtric cases where the router takes bad decisions and correct them.

i will run it later and see how it looks.  i am fascinated to see how far
it gets.

> 
> A note about nMigen and a potential annoying bug at installation. nMigen
> is listed as a dependency so setup tools will try to install it.
> But if it is not installed *prior* to the Libre-SOC repositories, the
> m-lab version will be pulled from the Python archives.

yes.  this is normal because of the reliance on pip3 (via setuptools)
one solution: we remove all dependencies and expect people to install them 
manually (by way of a script / Makefile).
Comment 9 Jean-Paul Chaput 2020-06-30 11:56:53 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)

> i will run it later and see how it looks.  i am fascinated to see how far
> it gets.

  Far. It's almost OK. I did it with only 5% of space margin and
  without using METAL6. And, if you run in graphic mode, its 
  fascinating how the placer seems to find back the blocks...
  I add a picture in attachement.


> > A note about nMigen and a potential annoying bug at installation. nMigen
> > is listed as a dependency so setup tools will try to install it.
> > But if it is not installed *prior* to the Libre-SOC repositories, the
> > m-lab version will be pulled from the Python archives.
> 
> yes.  this is normal because of the reliance on pip3 (via setuptools)
> one solution: we remove all dependencies and expect people to install them 
> manually (by way of a script / Makefile).

  I just wanted to hint at this potential problem so people don't
  loose too much time next they encounter it. I slightly modified
  the top Makefile so I can install in a non-system directory,
  that is, the Coriolis install tree. This way I keep a system like
  directory tree but requiring only user permission.
    As a system administrator, I'm very very reluctant to directly
  install things in the system tree as root. Because after some
  time you loose track of what has been installed or not.
  So only packaged things (rpm, deb) sould go there as the packager
  keeps track for you.
Comment 10 Luke Kenneth Casson Leighton 2020-06-30 12:01:05 BST
(In reply to Jean-Paul.Chaput from comment #9)
> (In reply to Luke Kenneth Casson Leighton from comment #8)
> 
> > i will run it later and see how it looks.  i am fascinated to see how far
> > it gets.
> 
>   Far. It's almost OK. I did it with only 5% of space margin and
>   without using METAL6. And, if you run in graphic mode, its 
>   fascinating how the placer seems to find back the blocks...
>   I add a picture in attachement.

missed :)
 
> > > A note about nMigen and a potential annoying bug at installation. nMigen
> > > is listed as a dependency so setup tools will try to install it.
> > > But if it is not installed *prior* to the Libre-SOC repositories, the
> > > m-lab version will be pulled from the Python archives.
> > 
> > yes.  this is normal because of the reliance on pip3 (via setuptools)
> > one solution: we remove all dependencies and expect people to install them 
> > manually (by way of a script / Makefile).
> 
>   I just wanted to hint at this potential problem so people don't
>   loose too much time next they encounter it. I slightly modified
>   the top Makefile so I can install in a non-system directory,

can you send me that so i can take a look?

btw i tried fpmul64 - experiment6 - and yosys is locked up solid 100% CPU
indefinitely in "clean".  which is particularly odd given that it worked
perfectly well last time i tried it.  which admittedly was with "flatten".
Comment 11 Jean-Paul Chaput 2020-06-30 12:31:49 BST
(In reply to Jean-Paul.Chaput from comment #9)
> (In reply to Luke Kenneth Casson Leighton from comment #8)
> 
> > i will run it later and see how it looks.  i am fascinated to see how far
> > it gets.
> 
>   Far. It's almost OK. I did it with only 5% of space margin and
>   without using METAL6. And, if you run in graphic mode, its 
>   fascinating how the placer seems to find back the blocks...
>   I add a picture in attachement.

    Zut! pdf file is too big (1.1M).
Comment 12 Jean-Paul Chaput 2020-06-30 12:35:58 BST
Created attachment 72 [details]
Makefile for nmutil with install dir

Very basic patch to set where to install. The install dir may be guessed in a much smarter way...
Comment 13 Luke Kenneth Casson Leighton 2020-06-30 13:55:55 BST
(In reply to Jean-Paul.Chaput from comment #12)
> Created attachment 72 [details]
> Makefile for nmutil with install dir
> 
> Very basic patch to set where to install. The install dir may be guessed in
> a much smarter way...

oh ok i get it.  just python3 setup.py develop --install-dir={somewhere}.
ok that makes sense.

i _believe_ this may be what virtualenv does in a transparent way
(except not everyone loves virtualenv)
Comment 14 Jean-Paul Chaput 2020-06-30 14:04:44 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)
> i _believe_ this may be what virtualenv does in a transparent way
> (except not everyone loves virtualenv)

  The less layers of install tools, the better (so I can patch them
  more easily) (IMHO).
Comment 15 Luke Kenneth Casson Leighton 2020-07-16 01:25:37 BST
jean-paul, i see you're back from holiday in the rainy lovely beaches.

i have pushed a couple of updates to test_issuer.il one of which added (then removed) the div unit.  i also, back in issuer.py, provided an option in the code to add pipeline types, so mul can be added etc. by changing one line.

you will need to git pull all soc repositories however *do not* update nmigen right now as there are issues outstanding with it.

i would if there is time very much like to do at least a top level hierarchical layout, regardless but also because there will be space unused.

the reason is that when it comes to showing people the layout, it is possible to point and say, "this is the Logical pipeline" and so on.

to help with that, i would like to be able to set the width but not height or height but not width when doing the area calculation.

what can then be done is:

* run all pipeline layouts with the exact same height (large height)

* get a series of varied widths back for each pipeline (some of them will be very thin, some like MUL will be fat)

* lay them out in a row

* have the regfiles below them, placed optimally closest to the pipelines that need them

based on the widths of the pipelines and the widths of the regfiles it may even be practical to use an algorithm that works out the shortest paths, in 1D.

what's your thoughts, is this reasonable?

then also this would help identify the areas which are not routing, because it is less gates.  also it would speed up layout time.
Comment 16 Jean-Paul Chaput 2020-07-16 13:20:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #15)
> jean-paul, i see you're back from holiday in the rainy lovely beaches.

  Sadly, yes.

> i have pushed a couple of updates to test_issuer.il one of which added (then
> removed) the div unit.  i also, back in issuer.py, provided an option in the
> code to add pipeline types, so mul can be added etc. by changing one line.
> 
> you will need to git pull all soc repositories however *do not* update
> nmigen right now as there are issues outstanding with it.

  Maybe too late, I just did it a couple of days ago. But I can easily
  roll back if you give me a commit hash to stick to.

> i would if there is time very much like to do at least a top level
> hierarchical layout, regardless but also because there will be space unused.
> 
> the reason is that when it comes to showing people the layout, it is
> possible to point and say, "this is the Logical pipeline" and so on.

  I understand very well, it makes much easier to comment layout,
  but maybe not the most efficient.

> to help with that, i would like to be able to set the width but not height
> or height but not width when doing the area calculation.
> 
> what can then be done is:
> 
> * run all pipeline layouts with the exact same height (large height)
> 
> * get a series of varied widths back for each pipeline (some of them will be
> very thin, some like MUL will be fat)
> 
> * lay them out in a row
> 
> * have the regfiles below them, placed optimally closest to the pipelines
> that need them
> 
> based on the widths of the pipelines and the widths of the regfiles it may
> even be practical to use an algorithm that works out the shortest paths, in
> 1D.

  Making blocs with fixed height or width is easy. The problems lays in
  the top assembly. In ASIC terminology, it's the floorplan (Placement of
  the top level blocks). Coriolis has no real support for that yet.

  To achieve that quickly we may try to create blocks that are directly
  connectable side by side. Meaning that the connectors are exactly at
  the same position & layer on each sides of both blocks. This is ok for
  2 pins nets, but if there are more, we have to route though a block a
  net (it can be done with minimum fuss also).

  Normally we should use a block & channel routing (routing space between
  the blocks).

> what's your thoughts, is this reasonable?

  I will try a first flat run to get a feel about runtime and memory size.
  Then I will see if we must break it. Note that, the ASIC IBM benchmarks
  supplied for the ISPD contest are completely flat (no block whatsoever,
  up to 1 million gates).

> then also this would help identify the areas which are not routing, because
> it is less gates.  also it would speed up layout time.

  breaking the design in smaller block would certainly reduce the P&R time
  and help solve problem one block at a time. As Staf did put some time ago,
  for big ASICs, the maximum run time should be "one night".
Comment 17 Luke Kenneth Casson Leighton 2020-07-16 16:44:54 BST
(In reply to Jean-Paul.Chaput from comment #16)
> (In reply to Luke Kenneth Casson Leighton from comment #15)

>   I understand very well, it makes much easier to comment layout,
>   but maybe not the most efficient.

not a problem, this is a test chip.



> > based on the widths of the pipelines and the widths of the regfiles it may
> > even be practical to use an algorithm that works out the shortest paths, in
> > 1D.
> 
>   Making blocs with fixed height or width is easy. 

> The problems lays in
>   the top assembly. In ASIC terminology, it's the floorplan (Placement of
>   the top level blocks). Coriolis has no real support for that yet.

that's ok.  the alu16 example showed how to do it.

what i do not want, is, to have different width and height pipelines, which if we add even just one new function to one pipeline the entire layout must be redone.

the current system, you call a function and it tells you the width *and* height estimate needed to route that block, and they are squares (appx).

i would like the estimate system to be able to set a fixed height, and it to tell me the width.

then the placement of all pipelines cab be lined up.

the inputs and outputs will all be on one side (SOUTH) to connect to register files.


>   To achieve that quickly we may try to create blocks that are directly
>   connectable side by side.

ok so the pipelines except for LDST have *ZERO* connectivity to anything other that the register file Buses.

this is a VERY deliberate hard rule that has been set.

there is NO interconnection between pipelines.

therefore the floorplan is:

* pipelines in a row at the top
* Register Buses and "Priority Pickers" in the middle
* Register Files at the bottom
* Decoder to the side, connected to the Fast Regfile (to get the Program Counter).

so it is very regularly organised.

therefore for most Regfiles, all the ports can be NORTH.

> Meaning that the connectors are exactly at
>   the same position & layer on each sides of both blocks. This is ok for
>   2 pins nets, but if there are more, we have to route though a block a
>   net (it can be done with minimum fuss also).

the only one which might need pins on different sides is FAST Regs because it contains the Program Counter, the MSR (sets 64 bit mode, User mode, LE/BE etc).

oh, and LDST of course, the memory connection comes out NORTH but registers are SOUTH.

everything else is extremely regular.

>   Normally we should use a block & channel routing (routing space between
>   the blocks).

nice.
 
> > what's your thoughts, is this reasonable?
> 
>   I will try a first flat run to get a feel about runtime and memory size.
>   Then I will see if we must break it. Note that, the ASIC IBM benchmarks
>   supplied for the ISPD contest are completely flat (no block whatsoever,
>   up to 1 million gates).

mad :)

well given that there are things to fix i would prefer that you are able to run and debug on a fast loop.
Comment 18 Jean-Paul Chaput 2020-07-21 12:50:56 BST
Hello Luke,

I'm starting to work on:
    soc/src/soc/simple/test/test_issuer.py

And, of course, I get problems. First, I updated all the libre-soc git
repositories and re-installed them. Then I got into various errors
with nMigen. The latest being:

Traceback (most recent call last):
  File "./test_issuer.py", line 21, in <module>
    from soc.simple.test.test_core import (setup_regs, check_regs,
  File ".../soc/src/soc/simple/test/test_core.py", line 28, in <module>
    from soc.fu.alu.test.test_pipe_caller import ALUTestCase
  File ".../soc/src/soc/fu/alu/test/test_pipe_caller.py", line 5, in <module>
    from nmigen.sim.cxxsim import Simulator
ModuleNotFoundError: No module named 'nmigen.sim.cxxsim'

Currently I use nMigen d714d78 (HEAD of 14/07/2020).
I did git update, but even with the latest one, I can't locate any nmigen.sim.cxxsim module... What do I do wrong. Could you pinpoint me
the working commit of nMignen ?

Best regards,
Comment 19 Luke Kenneth Casson Leighton 2020-07-21 13:50:32 BST
(In reply to Jean-Paul.Chaput from comment #18)
> Hello Luke,
> 
> I'm starting to work on:
>     soc/src/soc/simple/test/test_issuer.py
> 
> And, of course, I get problems. First, I updated all the libre-soc git
> repositories and re-installed them. Then I got into various errors
> with nMigen. The latest being:
> 
> Traceback (most recent call last):
>   File "./test_issuer.py", line 21, in <module>
>     from soc.simple.test.test_core import (setup_regs, check_regs,
>   File ".../soc/src/soc/simple/test/test_core.py", line 28, in <module>
>     from soc.fu.alu.test.test_pipe_caller import ALUTestCase
>   File ".../soc/src/soc/fu/alu/test/test_pipe_caller.py", line 5, in <module>
>     from nmigen.sim.cxxsim import Simulator
> ModuleNotFoundError: No module named 'nmigen.sim.cxxsim'

ah right.  yes.  that relies on the nmigen cxx_sim branch, apologies.
let me sort that with a try/except Import.
Comment 20 Luke Kenneth Casson Leighton 2020-07-21 13:54:13 BST
jean-paul: git pull on soc.

commit 3c425869fd36a73a040e07b58050069c6022b0de (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Tue Jul 21 13:53:28 2020 +0100

    make cxxsim optional and print warning
Comment 21 Jean-Paul Chaput 2020-07-21 14:59:33 BST
Sorry to bother you... Same player shoot again.

Traceback (most recent call last):
  File "./test_issuer.py", line 36, in <module>
    from soc.simulator.test_sim import (GeneralTestCases, AttnTestCase)
  File ".../soc/src/soc/simulator/test_sim.py", line 3, in <module>
    from nmigen.test.utils import FHDLTestCase
ModuleNotFoundError: No module named 'nmigen.test'
Comment 22 Luke Kenneth Casson Leighton 2020-07-21 15:15:59 BST
(In reply to Jean-Paul.Chaput from comment #21)
> Sorry to bother you... Same player shoot again.
> 
> Traceback (most recent call last):
>   File "./test_issuer.py", line 36, in <module>
>     from soc.simulator.test_sim import (GeneralTestCases, AttnTestCase)
>   File ".../soc/src/soc/simulator/test_sim.py", line 3, in <module>
>     from nmigen.test.utils import FHDLTestCase
> ModuleNotFoundError: No module named 'nmigen.test'

sigh it's likely nmigen.test has been removed (or moved) in the past day
or so. sorted:

commit 4d8a7e65660df9e41a061997631763d51dbe2124 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Tue Jul 21 15:14:00 2020 +0100

    spurious imports of FHDLTestCase, should be from nmutil


generally, keeping "up-to-date" with absolute latest nmigen is inadviseable
without coordinating: it's a moving target.
Comment 23 Jean-Paul Chaput 2020-07-21 18:03:38 BST
(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Jean-Paul.Chaput from comment #21)
> > Sorry to bother you... Same player shoot again. 
> commit 4d8a7e65660df9e41a061997631763d51dbe2124 (HEAD -> master,
> origin/master)
> Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
> Date:   Tue Jul 21 15:14:00 2020 +0100
> 
>     spurious imports of FHDLTestCase, should be from nmutil

  Got it working.
 
> generally, keeping "up-to-date" with absolute latest nmigen is inadviseable
> without coordinating: it's a moving target.

  I totally agree. My update policy is to stick to a version as long
  as it works. Then, when it do not, update to the newest possible.
  So I make leaps between "very old" and "very new". Maybe I did miss
  it but, I think you should keep track of the latest "compatible"
  nMigen version, and maybe put it in a doc file at the root of the
  soc repository. So this way people would quickly now which one
  to install.
Comment 24 Luke Kenneth Casson Leighton 2020-07-21 19:50:21 BST
(In reply to Jean-Paul.Chaput from comment #23)
> >     spurious imports of FHDLTestCase, should be from nmutil
> 
>   Got it working.

excellent

> > generally, keeping "up-to-date" with absolute latest nmigen is inadviseable
> > without coordinating: it's a moving target.
> 
>   I totally agree. My update policy is to stick to a version as long
>   as it works. Then, when it do not, update to the newest possible.
>   So I make leaps between "very old" and "very new". Maybe I did miss
>   it but, I think you should keep track of the latest "compatible"
>   nMigen version, 

i am... except... well it's complicated, i am helping whitequark debug
cxxsim and also working on the processor: cxxsim should offer up to a *100*
times increase in simulation performance so is worth pursuing.

>   and maybe put it in a doc file at the root of the
>   soc repository. So this way people would quickly know which one
>   to install.

well i think we're good, now.  we did have a point where gtkwave wasn't
working, that i believe is fixed now.  and the spurious import is ok...
probably in the clear, now.

btw do do a "git pull" on soclayout, i just updated non_generated/test_issuer.il

i have removed two read ports on the fast regfile which is so ridiculously
large (20% of the gate area) that it justified the effort.

these were reading the PC and the MSR (Machine Status Register) and i decided
to pass them as "immediates" to Branch and Trap, respectively, rather than
have the CompUnits read them a *second* time from *another* Fast Regfile port.
Comment 25 Luke Kenneth Casson Leighton 2020-07-22 11:00:05 BST
Jean-Paul: ta-daaaa :)
https://ftp.libre-soc.org/2020-07-22_10-55.png

that's from last night.

....
....

  o  Configuration of ToolEngine<Etesian> for Cell <test_issuer>
     - Cell Gauge ...................................................... <sxlib>
     - Place Effort .......................................................... 2
     - Update Conf ........................................................... 2
     - Spreading Conf ........................................................ 1
     - Routing driven .................................................... false
     - Space Margin ......................................................... 5%
     - Aspect Ratio ....................................................... 100%
     - Bloat model .................................................... disabled
  o  Erasing previous placement of <test_issuer>
  o  Creating abutment box (margin:5% aspect ratio:100% g-length:66242.3)
     - Bloat space margin: 0%.
     - <Box 0l 0l 13200l 13200l>
     - GCell grid: [264x264]
  o  Converting <test_issuer> into Coloquinte.
     - H-pitch .............................................................. 5l
     - V-pitch .............................................................. 5l
     - Converting 88436 instances
     - Building RoutingPads (transhierarchical) ...
     - Converting 88746 nets

....
....

     - Track Segment Completion Ratio ....................... 99.99% [685064+94]
     - Wire Length Completion Ratio ..................... 99.98% [43489516+6675]
     - Wire Length Expand Ratio ........................... 6.02% [min:41021805]
     - Unrouted horizontals ........................................ 79.79% [75]
     - Unrouted verticals .......................................... 20.21% [19]
     - Done in .............................................. 4m 29.35s, 726.4Mb
     - Raw measurements ............................... 269.35s, +743884Kb/2.5Gb
  o  Checking Katana Database coherency.
  o  Driving Hurricane data-base.
     - Active AutoSegments .............................................. 788860
     - Active AutoContacts .............................................. 922308
     - AutoSegments ..................................................... 791697
     - AutoContacts ..................................................... 927982
     - Same Layer doglegs ............................................... 791697
     - Done in .................................................. 2.56s, 0 bytes
     - Raw measurements ................................... 2.56157s, +0Kb/2.5Gb
  o  Deleting ToolEngine<Katana> from Cell <test_issuer>
Comment 26 Jean-Paul Chaput 2020-07-22 13:47:08 BST
(In reply to Luke Kenneth Casson Leighton from comment #25)
> Jean-Paul: ta-daaaa :)
> https://ftp.libre-soc.org/2020-07-22_10-55.png
> 
> that's from last night.


Very nice. Almost completed whitout any optimization, that's a good omen.
I'm working on the "hierarchical" option. We will see which is best...
But the flat way seems still working.

Can you give me the total P&R time (as returned by the time command) ?
Comment 27 Luke Kenneth Casson Leighton 2020-07-22 14:00:05 BST
(In reply to Jean-Paul.Chaput from comment #26)
> (In reply to Luke Kenneth Casson Leighton from comment #25)
> > Jean-Paul: ta-daaaa :)
> > https://ftp.libre-soc.org/2020-07-22_10-55.png
> > 
> > that's from last night.
> 
> 
> Very nice. Almost completed whitout any optimization, that's a good omen.

indeed.  and it's soo preeettyyyy :)

> I'm working on the "hierarchical" option. We will see which is best...
> But the flat way seems still working.

yes - i think there's still some "configs" not connected? (what string
am i looking for to double-check that?)
 
> Can you give me the total P&R time (as returned by the time command) ?

ha, i have to re-run it.  bear in mind i am clamping the processor speed
on this laptop to 1ghz, to avoid the fan running permanently, pulling in
massive amounts of dust.
Comment 28 Jean-Paul Chaput 2020-07-22 15:12:43 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> (In reply to Jean-Paul.Chaput from comment #26)
> > (In reply to Luke Kenneth Casson Leighton from comment #25)
> > > Jean-Paul: ta-daaaa :)
> > > https://ftp.libre-soc.org/2020-07-22_10-55.png
> > > 
> > > that's from last night.
> > 
> > 
> > Very nice. Almost completed whitout any optimization, that's a good omen.
> 
> indeed.  and it's soo preeettyyyy :)
> 
> > I'm working on the "hierarchical" option. We will see which is best...
> > But the flat way seems still working.
> 
> yes - i think there's still some "configs" not connected? (what string
> am i looking for to double-check that?)

  To be used with profit, you need to understand the overall way the
  P & R algorithm and data works. This would need a not so short
  explanation. Besides I'm also testing a "routing driven" placement
  which is not comited yet. I will experiment, then send you back
  the right set of tuning parameters.
    And, yes, I really do need to write a documentation about how
  the P & R works and how it relates to the configuration parameters
  so people can play too ;-)


> > Can you give me the total P&R time (as returned by the time command) ?
> 
> ha, i have to re-run it.  bear in mind i am clamping the processor speed
> on this laptop to 1ghz, to avoid the fan running permanently, pulling in
> massive amounts of dust.

  Nice feature. What (Linux) software does that? I'm interested.
  Especially since my laptop seems to develop the habit of overheating
  in my backpack when put in "suspend to RAM". Now I'm using suspend
  to disk, praying there will be no memory corruption...

  Don't bother. But if you have the whole trace, I can get the
  numbers from there. Especially the placement time and the
  layer assignment step, which is *way* too slow, have to find
  out where I did put a quadratic thing inside...
Comment 29 Luke Kenneth Casson Leighton 2020-07-22 15:37:48 BST
(In reply to Jean-Paul.Chaput from comment #28)

> > yes - i think there's still some "configs" not connected? (what string
> > am i looking for to double-check that?)
> 
>   To be used with profit, you need to understand the overall way the
>   P & R algorithm and data works. This would need a not so short
>   explanation. Besides I'm also testing a "routing driven" placement
>   which is not comited yet. I will experiment, then send you back
>   the right set of tuning parameters.
>     And, yes, I really do need to write a documentation about how
>   the P & R works and how it relates to the configuration parameters
>   so people can play too ;-)

:)

ok.
 
> 
> > > Can you give me the total P&R time (as returned by the time command) ?
> > 
> > ha, i have to re-run it.  bear in mind i am clamping the processor speed
> > on this laptop to 1ghz, to avoid the fan running permanently, pulling in
> > massive amounts of dust.
> 
>   Nice feature. What (Linux) software does that?

"apt-get install cpufreqd cpufreq-utils" then edit /etc/cpufreqd.conf,
create (or use) a pre-existing config - i have created "Performance Low",
set it to maxfreq=20% then further down on [Rule] AC=on set it as
the required profile.

you need acpid also installed, for this to work... *i believe*.

but the only reason that works is because i do *not* run quotes standard
desktop window manager software quotes.  i run fvwm2, which over two
decades has been customised and added to (a manual systemtray program
for example, placed using a line in ~/.xinitrc)

so if you do happen to use gnome or kde it *might* interfere with the
above, i.e. you *might* have to look in whatever-control-panel-blah-blah
is in use, and there i can't help you.


> I'm interested.
>   Especially since my laptop seems to develop the habit of overheating
>   in my backpack when put in "suspend to RAM". Now I'm using suspend
>   to disk, praying there will be no memory corruption...

s2disk tends to work very well, i have found.  only "unreliable laptops"
tend to crash during resume, where s2ram tends not to come back to life.

 
>   Don't bother. But if you have the whole trace, I can get the
>   numbers from there. Especially the placement time and the
>   layer assignment step, which is *way* too slow, have to find
>   out where I did put a quadratic thing inside...

it finished just now

real    89m53.343s
user    87m32.070s
sys     2m22.477s

holy s*** the nohup.out file is 872 mb.

it will be at https://ftp.libre-soc.org/nohup.out.bz2 shortly - do not download it immediately, because i am still rsync'ing it up and compressing it at the same time.

ok it's done.

please let me know when you have it (intact) because at 67mb that's far
too large to leave on the server.
Comment 30 Jean-Paul Chaput 2020-07-22 16:03:36 BST
> so if you do happen to use gnome or kde it *might* interfere with the
> above, i.e. you *might* have to look in whatever-control-panel-blah-blah
> is in use, and there i can't help you.

  I'm under Xfce... I'm also using Emacs in Vi mode 8-) (to get shouted at
  by both sides).
 
> s2disk tends to work very well, i have found.  only "unreliable laptops"
> tend to crash during resume, where s2ram tends not to come back to life.

  I will try it if the standard hibernate fails...
 
> it finished just now
> 
> real    89m53.343s
> user    87m32.070s
> sys     2m22.477s
> 
> holy s*** the nohup.out file is 872 mb.

  That's "very verbode" mode for you! In this mode, for each routing
  event processed, it writes one line. As it do not put newline, you
  don't see it, but when redirected into a file. You got a very, very
  long line... You have about 700,000 track segments, so on average
  two events by segments, so at least 1.4M lines...
     This feature is useful to me when I got determinism problems,
  I can perform comparison and see exactly where and on which event
  the divergence occur.

> it will be at https://ftp.libre-soc.org/nohup.out.bz2 shortly - do not
> download it immediately, because i am still rsync'ing it up and compressing
> it at the same time.
> 
> ok it's done.
> 
> please let me know when you have it (intact) because at 67mb that's far
> too large to leave on the server.

Seems I got kicked out ;-)

--2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2
Resolving ftp.libre-soc.org (ftp.libre-soc.org)... 46.235.227.77, 2a00:1098:82:f::1
Connecting to ftp.libre-soc.org (ftp.libre-soc.org)|46.235.227.77|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2020-07-22 16:50:40 ERROR 403: Forbidden.
Comment 31 Luke Kenneth Casson Leighton 2020-07-22 17:05:07 BST
(In reply to Jean-Paul.Chaput from comment #30)
> > so if you do happen to use gnome or kde it *might* interfere with the
> > above, i.e. you *might* have to look in whatever-control-panel-blah-blah
> > is in use, and there i can't help you.
> 
>   I'm under Xfce... I'm also using Emacs in Vi mode 8-) (to get shouted at
>   by both sides).

niiice :)

xfce is still a "full desktop" that uses parts of gnome2 low-level infrastructure so it maaay still interfere, and/or there may be somewhere in xfce4 control panel.

only by using a 25-year-old window manager (fvwm2) and running it with
"startx" do i actually get full control over what i want, with no interference.


>      This feature is useful to me when I got determinism problems,
>   I can perform comparison and see exactly where and on which event
>   the divergence occur.

yeh no makes sense

> Seems I got kicked out ;-)
> 
> --2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2

# chmod ugo+r ./nohup.out.bz2

try again
Comment 32 Jean-Paul Chaput 2020-07-22 17:11:11 BST
> > Seems I got kicked out ;-)
> > 
> > --2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2
> 
> # chmod ugo+r ./nohup.out.bz2
> 
> try again

  Got it. You can remove...
Comment 33 Jean-Paul Chaput 2020-07-23 12:01:19 BST
> > --2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2
> 
> # chmod ugo+r ./nohup.out.bz2
> 
> try again

OK. When looking at the log file, I did see that you did make the P&R
twice... As it is deterministic, you get twice the same result, *but*
very strangely, the second run is much slower than the first.

Runs:
   Place       GlobR   BDetR   LAssign   DetR   Destroy   Total
1    394 + 2 +    58 +    34 +     685 +  270 +       3   1446 (24 minutes)
2   1496 + 8 +   226 +   137 +    2205 + 1010 +       8   5090 (84 minutes)

You can find those times by searching for 'Done in' in the log file.

There may be a flaw in the Makefile system. As the routage fails a
"failed" status is returned to the calling rule, so it may start again
the P&R. Did you just made a "make lvx" or "make layout; make lvx" ?
I'm also curious about why so different runtimes. Was your computer
much more loaded the second time? Or did you not throttle the CPU the
first time?

You can see that LAssign (Layer Assignment) takes more times than
the whole placement. This is not normal considering what it does.
So, if I find what's wrong we can win almost 10 minutes over 24...
Comment 34 Luke Kenneth Casson Leighton 2020-07-23 12:20:21 BST
(In reply to Jean-Paul.Chaput from comment #33)
> > > --2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2
> > 
> > # chmod ugo+r ./nohup.out.bz2
> > 
> > try again
> 
> OK. When looking at the log file, I did see that you did make the P&R
> twice... 

i did?  i didn't!  however i did run "make cgt" in a separate window
in order to get the... no wait, that was before doing this run.

wasn't me, boss

> As it is deterministic, you get twice the same result, *but*
> very strangely, the second run is much slower than the first.
> 
> Runs:
>    Place       GlobR   BDetR   LAssign   DetR   Destroy   Total
> 1    394 + 2 +    58 +    34 +     685 +  270 +       3   1446 (24 minutes)
> 2   1496 + 8 +   226 +   137 +    2205 + 1010 +       8   5090 (84 minutes)
> 
> You can find those times by searching for 'Done in' in the log file.
> 
> There may be a flaw in the Makefile system. As the routage fails a
> "failed" status is returned to the calling rule, so it may start again
> the P&R. Did you just made a "make lvx" or "make layout; make lvx" ?

"make layout" and in a separate window i had run "make cgt" - *before*
starting this run.

> I'm also curious about why so different runtimes. Was your computer
> much more loaded the second time? 

not really.  it is 8-core dual hyper-threaded

> Or did you not throttle the CPU the
> first time?

once it's set up it's a pain to change, so no change.
 
> You can see that LAssign (Layer Assignment) takes more times than
> the whole placement. This is not normal considering what it does.
> So, if I find what's wrong we can win almost 10 minutes over 24...

and if i let it run at 5ghz that saves time, too.

btw one other reason i really want to do sub-cell layouts is to have
the possibility of parallel make.

l.
Comment 35 Luke Kenneth Casson Leighton 2020-07-25 13:18:50 BST
okaay jean-paul, about the floor-plan layout version:

i have renamed all of the operands so that they now have the following
prefix format:

* oper_i_alu_PIPENAME{N}_{field}
* oper_i_ldst_ldst{N}_{field}

where earlier i recommended to put all I/O on the *bottom* (SOUTH) of each pipeline (except LD/ST which would specially have the Memory interface on NORTH) i thought about this a bit more, and realised that the opcode expansion is going to be too many wires.

oper_i_alu_alu0 and oper_i_alu_logical0 for example, the 32-bit incoming instruction is expanded to *ONE HUNDRED AND THIRTY* wires because it contains, for example, the expansion of the immediate field out to its full 64-bit.

i will do something about this... but not right now.

so what i figured was: those operand wires could come in at the *side* (RIGHT)
down at the bottom (SOUTH) part of the RIGHT (left?) side.

LDST:

    Memory (PortInterface)
                 ^
                 |
             +---|---------+
             |             |
             |  +-----+    |
             |  |     |    |
             | pipe2 pipe3 |
             |  ^     |    |
             |  |     v    |
 oper_i_alu->--pipe1 pipe4 |
             |  |     |    |
             +--|-----|----+
                ^     v
              IN regs OUT


the oper_i_alu_* needs to "propagate" through the pipeline in synchronisation
with the IN regs data, therefore it is sensible to have oper_i_alu_* be close to IN data/regs, rather than be on the opposite side.

therefore, if IN is better placed on the *right* of OUT at the SOUTH side, then oper_i_alu_* should _also_ go on the RIGHT side, close to the bottom.

many (most) of these Pipelines i expect to be quite "thin".  ALU0 for example, or TRAP0, or Branch0.

however MUL will be extremely "fat" (take up literally 50% of the entire width of the layout).

the "thin" ALUs are what concerns me, hence the idea of bringing in the oper_i_* signals in at the "side".

what this in turn means is that there will be some clear delineation / separation between the pipelines, potentially needing those "channels" you mentioned, because those oper_i signals will have to be routed *between* the pipelines (which are laid out in a horizontal line, just like in the CDC 6600 diagram).

https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-multiple_function_units.png

one other alternative is to have, exactly as is shown *in* that diagram,
complete separation between input and output on pipelines: however this
means that one of the register WRITE and READ port Buses need routing *round*
(to the top).  i am not keen on that.

in the arrangement where pipeline data goes in and out of the same side
(and the pipeline doubles back on itself), both the WRITE and READ regfile
ports may be very close to the pipelines.


fascinatingly, the full auto-route version actually *mixes* the FAST regfile,
TRAP0 pipeline and BRANCH0 pipeline, interspersed in some regularly
patterned "blobs", side-by-side with each other.

this because TRAP0 and BRANCH0 both need significant access to the FAST
regfile.

anyway.

yes.

summary of idea: oper_i_alu_* field signals on EAST or WEST, and i've
named them conveniently so that they can be searched for, easily, with
pattern-matching that prefix.
Comment 36 Jean-Paul Chaput 2020-07-26 23:23:18 BST
Thanks for the hints.

Just to let you know, I'm working on the floorplan, so I'm studying closely
the structure of the issuer netlist, as created after going through Yosys.

I've noticed something "unusual" (in my experience in hierarchical ASICs),
some models have blocks and a few standard cells (relatively speaking).
When processed, flat, this is not a problem, but if we want to place
each block, then the top level, the placement of those stray cells
may be difficult (meaning: far from optimal).
Comment 37 Luke Kenneth Casson Leighton 2020-07-27 00:12:36 BST
(In reply to Jean-Paul.Chaput from comment #36)
> Thanks for the hints.
> 
> Just to let you know, I'm working on the floorplan, so I'm studying closely
> the structure of the issuer netlist, as created after going through Yosys.

"show (modulename)" helps there.


> I've noticed something "unusual" (in my experience in hierarchical ASICs),
> some models have blocks and a few standard cells (relatively speaking).
> When processed, flat, this is not a problem, but if we want to place
> each block, then the top level, the placement of those stray cells
> may be difficult (meaning: far from optimal).

ta-daaa, nooow you know why i suggested the "group placement" idea :)

just like in the alu16 example, it contains:

* a add16 cell
* a sub16 cell
* some gates that mux add16 and sub16

i.e. not:

* a add16 cell
* a sub16 cell
* another cell containing the mux
* wires-only between those 3 cells.

it is basically pretty normal to do this, particularly for pipeline layouts.

for example we have:

* a combinatorial stage1 cell
* a combinatorial stage2 cell etc etc
* some gates that include registers bu t NOT in their own cell(s)

i.e. the source code was basically given a list of combinatorial blocks as modules, told what the API is, and *in that parent* throws some register latches together in a for-loop, dropping both the sub-modules and the registers into its context.

in this way you regularly get very large blocks mixed in with "a few gates".

this is why i suggested the modification to allow a *list* of cells to be placed in a Box.

and also as you see in alu16 i created that function which tracks down the connections between cells by pattern matching on the *net* name rather than have to name the cells explicitly.

changing the design to put everything into submodules, this would significantly reduce code clarity.

however if it proves absolutely necessary it *MAY* be possible to automate that, at the python level.

it would require a change to the entire codebase which is hundreds of modules so would be a last resort.
Comment 38 Luke Kenneth Casson Leighton 2020-07-27 00:33:02 BST
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/alu_hier.py;h=59bca26e358051b9579a9686833c8a9c93a1e393;hb=8c398d8c100be10b21cc2c193b39a112cc331dc1#l188

example.

see those 4 submodules, add, sub, mul, shift? really clear, nice modular code, right?

and further down, you can see ready/valid signalling to manage them, yes?

whilst the add, sub etc. end up as individual big cells, the ready/valid logic is *in this module as well* and that is where the dozens of "unmanaged" cells end up coming from which are not part if any "child" module.

to try to _create_ a submodule manually just to "contain" these, it is not a goid idea.

firstly, it makes a mess of the code clarity.

secondly, some of those cells would be best placed in between say add and sub, whilst others are best placed between shift and mul.  others, near *some* inputs, others near *some* outputs.

or.. whatever.  you get the general idea.

one way to cheat is to simply put the child blocks in the parent in such a way that these "floating" cells simply have very little choice about where else they can be placed.  right against the left edge, for example, or make the child the exact same width as the parent.

the ideal thing however, the "base" to work from, is to allow etesian.place to taje a *list* of cells to be placed.

a second improvement would be to be able to specify a *list* of abutment boxes to etesian place, which are unioned together to create a complex area beyond a rectangle.

however the list of boxes can be (partly) done by calling etesian.place multiple tines with different batches of cells.
Comment 39 Jean-Paul Chaput 2020-07-27 10:30:23 BST
> * a add16 cell
> * a sub16 cell
> * some gates that mux add16 and sub16
> 
> i.e. not:
> 
> * a add16 cell
> * a sub16 cell
> * another cell containing the mux
> * wires-only between those 3 cells.
> 
> it is basically pretty normal to do this, particularly for pipeline layouts.
> 
> for example we have:
> 
> * a combinatorial stage1 cell
> * a combinatorial stage2 cell etc etc
> * some gates that include registers bu t NOT in their own cell(s)
> 
> i.e. the source code was basically given a list of combinatorial blocks as
> modules, told what the API is, and *in that parent* throws some register
> latches together in a for-loop, dropping both the sub-modules and the
> registers into its context.
> 
> in this way you regularly get very large blocks mixed in with "a few gates".

I understand very well your need for the clearer possible code.
I think it should not be needed that you isolate the few additional gates
in a special (and artificial) block.

My concern is that, if we take the ADD/SUB, the blocks will be
dwarfing the stray cells, and we get very sub-optimal placement.

Etesian is already capable of using ADD then SUB as big placed blocks
and then placing the remaining few cells around them, where we left
some free space, so inside a non-square area. That is, a square area,
minus the area of the already placed blocks. This is what is done in
an earlier layout experiment with doAlu16.

We may get away with it, *if* the stray cells are clearly a vector
and we can find back easily their matrix structure. But even then,
it has to be a very very regular structure or we will loose a big
amount of free space if the row or columns are "dented".
This is what you hinted in your signal naming scheme, I will see
what I can do.
Comment 40 Luke Kenneth Casson Leighton 2020-07-27 11:00:26 BST
(In reply to Jean-Paul.Chaput from comment #39)
> > * a add16 cell
> > * a sub16 cell
> > * some gates that mux add16 and sub16
> > 
> > i.e. not:
> > 
> > * a add16 cell
> > * a sub16 cell
> > * another cell containing the mux
> > * wires-only between those 3 cells.
> > 
> > it is basically pretty normal to do this, particularly for pipeline layouts.
> > 
> > for example we have:
> > 
> > * a combinatorial stage1 cell
> > * a combinatorial stage2 cell etc etc
> > * some gates that include registers bu t NOT in their own cell(s)
> > 
> > i.e. the source code was basically given a list of combinatorial blocks as
> > modules, told what the API is, and *in that parent* throws some register
> > latches together in a for-loop, dropping both the sub-modules and the
> > registers into its context.
> > 
> > in this way you regularly get very large blocks mixed in with "a few gates".
> 
> I understand very well your need for the clearer possible code.
> I think it should not be needed that you isolate the few additional gates
> in a special (and artificial) block.
> 
> My concern is that, if we take the ADD/SUB, the blocks will be
> dwarfing the stray cells, and we get very sub-optimal placement.

this is why the place-by-list.  if there is huge space, and the number
of stray cells relatively small, the priority is to get them roughly
in the right place, not to get them super-efficiently packed.

setting 10% or even 50% extra etesian space on something that contains
only 50 stray cells, when the alternative is that they are placed miles
away from the (large) sub-cells, this is way better, even though the
Etesian place was never originally designed for such tiny placement.

> We may get away with it, *if* the stray cells are clearly a vector
> and we can find back easily their matrix structure. 

ah.  right.  so this was where i began exploring an alternative to matrix
design, in alu16.py

written in python, i created a recursive "net-analyser" subroutine.

its parameters give a starting *net* - not a cell pattern-match - a *net*
pattern-match list - and it will loop on the following:

* find all cells connected to the net-pattern.
* find all nets connected to those cells
* if there are no new nets, stop.
* otherwise:
   A) add the cells found to the "accumulated result"
   B) continue recursively searching with the NEWLY FOUND nets
      to find MORE cells

it is a Graph walking algorithm, basically, identifying related cells given
net names.

in this way you *do not* need to know *anything* about the names of the
cells that are connected *to* the nets.  they can change any time, and
you just don't care, we can continue developing the HDL and *not* worry about having to throw away massive amounts of coriolis2 hand-crafted layout code (because there isn't any).

what you care about is "what cells are connected to o[15]" and the
function will give you that answer.

*this* allows to pick up the "stray cells" because you obviously know the Inputs/Output NET names of the sub-block (ADD16, SUB16), and can run a for-loop on them, calling this function and accumulating everything associated with them.

what we do not want to then have to do is to do manual placement on those stray cell groups, having identified them all (or appropriate subsets), we just need to be able to do a small "rough" placement / grouping, in an area that will be mostly routing wires, in between one of the sub-blocks and another sub-block, or between any given sub-block and the I/O of the parent.
Comment 41 Luke Kenneth Casson Leighton 2020-07-29 14:22:51 BST
jean-paul i just checked something to be possible in yosys: to be able
to flatten individual modules rather than all of it (top).

this works fine.

so, to support this: if the YOSYS_FLATTEN can take, instead of a "yes/no"
(might need a new Makefile parameter, YOSYS_FLATTEN_LIST), the following:

          YOSYS_FLATTEN_LIST=`cat to_flatten.txt`

and "to_flatten.txt" to contain at least:

fast
cr
xer
slow
int
pdecode2
alu0
branch0
cr0
trap0
ldst0

and probably many more (basically the list of everything for which a top-level
block is to be written) this will get rid of many of the problems of
"dangling nets" without having to have a full flatten.

btw one other way is for that YOSYS_FLATTEN_LIST to be the output from
a python script that actually notices what's been declared as being
sub-cells (top level hierarchy) rather than have a separately-maintained
file that could get out of sync.
Comment 42 Jean-Paul Chaput 2020-07-29 15:18:08 BST
(In reply to Luke Kenneth Casson Leighton from comment #41)
> jean-paul i just checked something to be possible in yosys: to be able
> to flatten individual modules rather than all of it (top).
> 
> this works fine.
> 
> so, to support this: if the YOSYS_FLATTEN can take, instead of a "yes/no"
> (might need a new Makefile parameter, YOSYS_FLATTEN_LIST), the following:
> 
>           YOSYS_FLATTEN_LIST=`cat to_flatten.txt`
> 
> and "to_flatten.txt" to contain at least:
> 
> fast
> cr
> xer
> slow
> int
> pdecode2
> alu0
> branch0
> cr0
> trap0
> ldst0
> 
> and probably many more (basically the list of everything for which a
> top-level
> block is to be written) this will get rid of many of the problems of
> "dangling nets" without having to have a full flatten.
> 
> btw one other way is for that YOSYS_FLATTEN_LIST to be the output from
> a python script that actually notices what's been declared as being
> sub-cells (top level hierarchy) rather than have a separately-maintained
> file that could get out of sync.

Good catch. I will integrate that, and try to implement a reliable way
of generating the list.

Still working through the placement of the issuer. The placement of the
IO pins of each block is a lengthy and tedious work... Have to solve
shortage of length along some sides. Hope to have something tomorrow.
Comment 43 Luke Kenneth Casson Leighton 2020-07-29 15:56:44 BST
(In reply to Jean-Paul.Chaput from comment #42)

> Good catch. I will integrate that, and try to implement a reliable way
> of generating the list.

cool.
 
> Still working through the placement of the issuer. The placement of the
> IO pins of each block is a lengthy and tedious work... Have to solve
> shortage of length along some sides.


in https://bugs.libre-soc.org/show_bug.cgi?id=199#c35 i described that i
put a prefix in front of names for the operand data, to make that easier.

it should _not_ be laborious, *at all*, to identify the signals.
it should be extremely easy, one line, one pattern.

also: yes as in https://bugs.libre-soc.org/show_bug.cgi?id=199#c35 if you
try to put everything on SOUTH it is pretty much guaranteed not to be
enough space, hence i suggested putting oper_i_* on EAST (or WEST).

for alu0 (and others), the pattern you probably discovered already is:

* oper_i_alu_**** (these are the ones i suggested to bring in on E/W)
* issue_i
* busy_o / done_o
* shadown_i
* rdmaskn
* rd_go / rd_rel
* wr_go / wr_rel
* srcNN_i
* destNN_o and their matching *_ok

please remember: *anything* that is not "convenient" (like this big list
which cannot be easily identified with a couple of wildcard matches, this
means "i got something wrong", ok?


> Hope to have something tomorrow.

feel free to commit regularly even unfinished work for review.  that would
give me a chance to "fix" things that are clearly taking time.
Comment 44 Luke Kenneth Casson Leighton 2020-07-29 16:25:10 BST
commit 2c2a9a9ddf07f77ebfb06abb1e7691d462549c19 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Jul 29 16:19:08 2020 +0100

    bit of a big change: add prefixes "cu_" to all CompUnit management signals
    also change go/rel to go_i and rel_o at the same time

i've just sorted that so now e.g. alu0.vst should be more like this:

entity alu0 is
  port ( clk                    : in bit
       ; cu_go_die_i               : in bit
       ; cu_ issue_i                : in bit
       ; oper_i_imm_data_imm_ok : in bit
       ; oper_i_invert_a        : in bit
....
....
       ; rst                    : in bit
       ; cu_shadown_i              : in bit
       ; src3_i                 : in bit
....
       ; cu_busy_o                 : out bit
....
       ; vdd                    : linkage bit
       ; vss                    : linkage bit
       );
end alu0;

so now you should need only these search patterns:

- cu_*
- oper_i_*
- src*_i
- dest*_o
- *_ok

and that should cover everything.  save time! :)



i pushed a new test_issuer.il to experiment9/non_generated

commit 2747314b7ba8cb04f8ca50b1ed016a66831a18b0
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Jul 29 15:20:08 2020 +0000

    updated test_issuer.il to include new names
Comment 45 Luke Kenneth Casson Leighton 2020-07-29 22:14:20 BST
http://www.aholme.co.uk/6502/Main.htm

apparently there is an algorithm called "SubGemini" which solves the
recursive netlist walking issue: finding sub-circuit instances in a
larger circuit.

1. Miles Ohlrich, Carl Ebeling, Eka Ginting and Lisa Sather. "SubGemini: Identifying Subcircuits Using a Fast Subgraph Isomorphism Algorithm," In Proceedings of the 30th IEEE/ACM Design Automation Conference, June 1993.
Comment 46 Jean-Paul Chaput 2020-07-30 16:51:12 BST
(In reply to Luke Kenneth Casson Leighton from comment #45)
> http://www.aholme.co.uk/6502/Main.htm
> 
> apparently there is an algorithm called "SubGemini" which solves the
> recursive netlist walking issue: finding sub-circuit instances in a
> larger circuit.
> 
> 1. Miles Ohlrich, Carl Ebeling, Eka Ginting and Lisa Sather. "SubGemini:
> Identifying Subcircuits Using a Fast Subgraph Isomorphism Algorithm," In
> Proceedings of the 30th IEEE/ACM Design Automation Conference, June 1993.

  I will look into it if needs be. Thanks for the tips.

  I'm almost done for the P&R of the sub-blocks of test_issuer, and now
  it would be very helpful if you could provide me with a rough floorplan
  of the blocks:

  * fus (maybe an ordering of the FUs, but not mandatory).
  * int
  * fast
  * pdecode2
  * l0

  Other blocks at core level, like the priority pickers are too small
  to be taken into account as "blocks to place separately".

  And maybe some hint about the big busses...

  I'm finishing this because I'm stubborn, but it is already clear at 99%
  that it will gives results *much* worse than the "flat" approach:

  As we place each block indepandantly, we create huge contention points
  at the border of most blocks due to the amount of large buses. Then we
  have to route those buses *between* the blocks, forcing us to push
  them farther apart, not even talking about the capacitance/drive problem.
  Moreover, the box a block can stray too far from a square factor if we
  want the placer to work (that is an AR between 0.5 and 2.0). There are
  exceptions, but that's the general idea. It would be a problem for the
  clock tree as it's depth may vary between blocks of different sizes.
  And lastly, to reduce the size of the channels, we would need a careful
  analysis of where to place the buses (and "combing" the bits to avoid
  to "flip" a whole bus), which is a lengthy task.
   So, if we compare a "flat" block with maybe up to 20% of margin space
  and the sum of blocks at 5% to 20% of free space plus channels, the
  winner is clear. Staf wins again.
   The "good" block level is the core, I think.
Comment 47 Luke Kenneth Casson Leighton 2020-07-30 17:37:11 BST
On Thu, Jul 30, 2020 at 4:51 PM bugzilla-daemon--- via libre-soc-bugs <libre-soc-bugs@lists.libre-riscv.org> wrote:
>
> https://bugs.libre-soc.org/show_bug.cgi?id=199
>
> --- Comment #46 from Jean-Paul.Chaput@lip6.fr ---
> (In reply to Luke Kenneth Casson Leighton from comment #45)
> > http://www.aholme.co.uk/6502/Main.htm
> >
> > apparently there is an algorithm called "SubGemini" which solves the
> > recursive netlist walking issue: finding sub-circuit instances in a
> > larger circuit.
> >
> > 1. Miles Ohlrich, Carl Ebeling, Eka Ginting and Lisa Sather. "SubGemini:
> > Identifying Subcircuits Using a Fast Subgraph Isomorphism Algorithm," In
> > Proceedings of the 30th IEEE/ACM Design Automation Conference, June 1993.
>
>   I will look into it if needs be. Thanks for the tips.

i mention it because it is likely an actually *designed* and properly
researched version of that recursive netlist-cell-netlist-cell algorithm
i described and implemented.


>   I'm almost done for the P&R of the sub-blocks of test_issuer, and now
>   it would be very helpful if you could provide me with a rough floorplan
>   of the blocks:
>
>   * fus (maybe an ordering of the FUs, but not mandatory).

all in a line, left-right, all the same "height".

the ordering will need to be worked out, based on how close they are to their respective register files.  at some point this could be determined by a 1-Dimensional algorithm which optimises them however right now that's not a high priority.  once you have committed something i can take a look and experiment.

each of the FUs should definitely be flattened though: alu0, logical0, etc.

* all registers (srcN_i, destN_i, *_ok) should be on SOUTH.
* oper_i_* should be on WEST (or EAST, your choice)

one addition: for ldst0 the port interface (pi) to data bus should be on NORTH.

>   * int
>   * fast

each of these flattened (and spr, and xer, and cr as well, they are all regfiles) - i expect all of their inputs and outputs to be on the NORTH side.

>   * pdecode2

flattened.  raw_opcode_in would be on one side (from imem): LOTS of signals go out, and this i know is a problem that needs to be solved - but iteratively.


>   * l0

again flattened: Port Interface (pi) to go on SOUTH (so that LDST can attach to it) and the Wishbone D-Bus on NORTH which will go out of the whole block.

>   Other blocks at core level, like the priority pickers are too small
>   to be taken into account as "blocks to place separately".
>
>   And maybe some hint about the big busses...

a clear space in between the FUs and L0 (top half), and the regfiles below them (bottom half).  priority pickers _should_ end up placed arbitrarily in that same middle space.

pdecode should probably be right in the middle, either at the top or pretty much dead centre, and the i-bus come in at the top middle as well (aka imem).

l0 should definitely be at the top, somewhere along the top edge, with the d-bus coming into it, and its SOUTH port connected directly to the NORTH of the ldst0.


>   I'm finishing this because I'm stubborn, but it is already clear at 99%
>   that it will gives results *much* worse than the "flat" approach:

it does however show clearly the places where routing does not "work"?


>   As we place each block indepandantly, we create huge contention points
>   at the border of most blocks due to the amount of large buses. Then we
>   have to route those buses *between* the blocks, forcing us to push
>   them farther apart, 

yes.  192 wires in some cases.  i have a plan to reduce that to only 32 but it requires quite a bit of work: each FU will have its *own* decoder and receive *only* the 32-bit instruction.


> not even talking about the capacitance/drive problem.

hmmm.


>   Moreover, the box a block can stray too far from a square factor if we
>   want the placer to work (that is an AR between 0.5 and 2.0). 

ah.  this i was expecting - idealistically - to "work" i.e. not be a problem.  the "long" ones (alu0 for example, or spr0), i expected it to be possible to auto-Place them efficiently even if they were long-ratio rectangles.

if this becomes a problem then potentially we can look at merging some of them together, if they have similar enough register profiles.


> There are
>   exceptions, but that's the general idea. It would be a problem for the
>   clock tree as it's depth may vary between blocks of different sizes.
>   And lastly, to reduce the size of the channels, we would need a careful
>   analysis of where to place the buses (and "combing" the bits to avoid
>   to "flip" a whole bus), which is a lengthy task.
>    So, if we compare a "flat" block with maybe up to 20% of margin space
>   and the sum of blocks at 5% to 20% of free space plus channels, the
>   winner is clear. Staf wins again.

:)

it is more to be able to point, clearly, "here is the regfile, here is the logical pipeline" etc.

but....

when we add the GPU version of the DIV/RSQRT/SQRT, and add the *MULTIPLE* IEEE754 FPUs, this will make the layout ***TEN*** times larger than it currently is.

at that point any bus space inefficiencies will be absolutely dwarfed by the size of *TWO* partitioned FP64 multiplier blocks and so on.

*one* of the 64-bit DIV/RSQRT/SQRT pipelines takes the size up from 75,000 to **200,000* cells, all on its own!  when you were on holiday i experimented and i managed to get it down to "only" 130,000 cells.


then, when we go to multi-issue it becomes even *more* interesting.  remember for the GPU version (single-core) we are expecting a size of around 300,000 to 400,000 cells.

at that point any hope of iterative development in a reasonable timeframe is out the window.  hence this exploration.
Comment 48 Luke Kenneth Casson Leighton 2020-07-30 17:40:09 BST
btw i must apologise, i had brought out too many signals for the test_issuer.il.  i have uploaded a new version, and you should find that cu_shadown_i and cu_go_die_i are all now "dropped" i.e. are internally set (within alu0 etc.)

some other signals that should not have been brought out are also dropped.
Comment 49 Luke Kenneth Casson Leighton 2020-08-02 21:44:19 BST
argh.

overnight i just realised something, jean-paul.

trying to put the oper_i on the side of each pipeline is completely pointless *unless* decoder2 is in the middle and the pipelines are staggered like stairs:

#### ## ## ## ####
#### ## ## ## ####
#### ##    ## ####
####    dec   ####

so the *really* small pipelines are at the apex, the medium sized ones either side, abd the really big ones hard left or hard right (mul will be one of those)

the staggered approach basically gives oper_i_* a chance to come in horizontally into the corner...

... *WITHOUT* needing to do a right angle turn to get there.

if oper_i* has to turn from vertical horizontal to get into the side of each pipeline then the vertical channel between pipelines has to be as wide as if you had made the pipeline itself that wide.

which is pointless.

a "staggered" step-up step-down "A Frame" layout with decode2 in the middle will do it.

we need some sort of ASCII art diagram, really, don't we, which lays this out.

can you commit what you have so far and i will take a look and add a quick diagram?
Comment 50 Luke Kenneth Casson Leighton 2020-08-02 22:50:00 BST
Created attachment 94 [details]
staggered pipeline quick drawing
Comment 51 Jean-Paul Chaput 2020-08-03 23:55:47 BST
(In reply to Luke Kenneth Casson Leighton from comment #50)
> Created attachment 94 [details]
> staggered pipeline quick drawing

  I've just commited a basic demonstrator for the recursive block
  management. It is far from perfect and surely will exhibit bugs...
  But I think you can start playing with it. The example is a very
  bad placement (all the FUs in line).

  I did write a very quick GUIDELINES.rst to explain some of the
  most important points in floorplaning.

  You can extract basic gates statistics with the "stats" plugin.
  ( Misc ==> Alpha ==> Statistics ) It will help you see which
  blocks are big and where do the gates comes from.

  I will now focus on the flat approach and to add the missing
  features that we absolutely need for the TSMC run.

  It may be best to rebuild from scratch Coriolis as I moved
  around a little the Python cumulus modules.
Comment 52 Luke Kenneth Casson Leighton 2020-08-03 23:59:01 BST
(In reply to Jean-Paul.Chaput from comment #51)
> (In reply to Luke Kenneth Casson Leighton from comment #50)
> > Created attachment 94 [details]
> > staggered pipeline quick drawing
> 
>   I've just commited a basic demonstrator for the recursive block
>   management. It is far from perfect and surely will exhibit bugs...
>   But I think you can start playing with it. The example is a very
>   bad placement (all the FUs in line).

that's what i thought would work.

>   I did write a very quick GUIDELINES.rst to explain some of the
>   most important points in floorplaning.

appreciated.

thank you. will try tomorrow.

>   You can extract basic gates statistics with the "stats" plugin.
>   ( Misc ==> Alpha ==> Statistics ) It will help you see which
>   blocks are big and where do the gates comes from.

ok.
 
>   I will now focus on the flat approach and to add the missing
>   features that we absolutely need for the TSMC run.
> 
>   It may be best to rebuild from scratch Coriolis as I moved
>   around a little the Python cumulus modules.

yes i saw this:

d6b90a70
by Jean-Paul Chaput at 2020-08-02T18:15:15+02:00
Enable Etesian to compute AB of fixed width.
Comment 53 Luke Kenneth Casson Leighton 2020-08-05 15:01:20 BST
jean-paul i'm running the doDesign.py and adding div0.  it says
"div0 has 12 badly placed pins" but doesn't say what they are.


Python stack trace:
#0 in                  __init__() at .../install/lib/python2.7/dist-packages/crlcore/helpers/io.py:167
#1 in               placeIoPins() at .../dist-packages/cumulus/plugins/alpha/block/block.py:411
#2 in                     build() at .../dist-packages/cumulus/plugins/alpha/block/block.py:501
#3 in                     build() at .../dist-packages/cumulus/plugins/alpha/block/block.py:496
#4 in                scriptMain() at /home/lkcl/soclayout/experiments9/doDesign.py:666
#5 in                  <module>() at .../coriolis-2.x/Linux.x86_64/Release.Shared/install/bin/cgt:201
Comment 54 Luke Kenneth Casson Leighton 2020-08-05 15:02:11 BST
ok yep it's in the debug output.

[ERROR] Side.place(IoPin): Pin "src1_i(58).0" is outside north or south abutment box side.
        (x:2910l, xAB: [0l:2900l])
                          IoPin.place() N/S @<Point 2910l 0l> "src1_i(58).0" of "<id:67204 Net "src1_i(58)" e-- LOGICAL i--- (IN)>".

[ERROR] Side.place(IoPin): Pin "src1_i(59).0" is outside north or south abutment box side.
        (x:2960l, xAB: [0l:2900l])
                          IoPin.place() N/S @<Point 2960l 0l> "src1_i(59).0" of "<id:67203 Net "src1_i(59)" e-- LOGICAL i--- (IN)>".
Comment 55 Luke Kenneth Casson Leighton 2020-08-05 15:19:20 BST
https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/23
Comment 56 Luke Kenneth Casson Leighton 2020-08-05 15:24:15 BST
https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/24
Comment 57 Jean-Paul Chaput 2020-08-07 12:43:58 BST
(In reply to Luke Kenneth Casson Leighton from comment #55)
> https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/23

I've just made pushed various commit that *may* solve the problem.
I'm not totally sure as there are still off-side pins that prevents
me to go all the way.

I took the liberty to beautify a little doDesign.py, hopefully respecting
you Python style. I know that for IO pin specification, the list is not
PEP8 compliant, but I think that a "spreadsheet" presentation with clearly
aligned columns is better to immediately spot missing parameters or errors.

I also removed the utils module, as now Coriolis should supply equivalent
features.
Comment 58 Luke Kenneth Casson Leighton 2020-08-08 11:24:21 BST
(In reply to Jean-Paul.Chaput from comment #57)
> (In reply to Luke Kenneth Casson Leighton from comment #55)
> > https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/23
> 
> I've just made pushed various commit that *may* solve the problem.
> I'm not totally sure as there are still off-side pins that prevents
> me to go all the way.
> 
> I took the liberty to beautify a little doDesign.py, hopefully respecting
> you Python style. I know that for IO pin specification, the list is not
> PEP8 compliant, but I think that a "spreadsheet" presentation with clearly
> aligned columns is better to immediately spot missing parameters or errors.

what i would like, there, is wildcard matching e.g. starts with "oper_i" or ends with "_ok" and the pincount to be obtained from the object.

this will reduce 30-40 lines per block down to *five* and at the same time greatly increase clarity.


> 
> I also removed the utils module, as now Coriolis should supply equivalent
> features.

ah excellent, glad you liked it.

Config is neat, ehn? :)


the number of oper_* pins is far too large.  this is the "expansion" of the instruction for convenience.  an example is the 64 bit immediate.

basically i am going to have to do "subset instruction decoders" that are *inside* mul0, *inside* alu0 and so on.

i have to find time to do that.

until then i think we stop with the floorplan version and go with "flat".
Comment 59 Luke Kenneth Casson Leighton 2020-08-11 15:52:54 BST
https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/25

"Katana BUG" after updating to a different setup for INT and FAST
regfiles.  the read/write Bus is now done using a MUX followed by
OR tree.
Comment 60 Jean-Paul Chaput 2020-08-11 23:02:40 BST
(In reply to Luke Kenneth Casson Leighton from comment #59)
> https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/25
> 
> "Katana BUG" after updating to a different setup for INT and FAST
> regfiles.  the read/write Bus is now done using a MUX followed by
> OR tree.

  The problem was due to an incorrect use of the CfgCache object,
  so the specific parameters where not taken into account.
  I put pushed the correction in commit ee3bd54fdf0d788c8227380daa6afd8f787e7074
  
  Basically, the priority level was not set (default is not high
  enough to override) and anyway they were not applied.
  To apply the parameters, either explicitly call cfg.apply() or
  use the "with" construction.

  So, you did get a layout *without* the placer trying to evenly
  spread the free space, so it mostly ended up in the top right
  corner.

  As for the "looping" bug, it's a misnomer (my bad). This is not
  a bug in the normal sense. It means that the router is repeatedly
  ripping up one segment so that the algorithm has reached a dead end.
  And I have to analyse the trace to see how to avoid getting in
  that state. 

  Now it's almost ok (less then 20 failed segments).
Comment 61 Jean-Paul Chaput 2020-08-11 23:25:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #58)
> (In reply to Jean-Paul.Chaput from comment #57)
> > (In reply to Luke Kenneth Casson Leighton from comment #55)

> what i would like, there, is wildcard matching e.g. starts with "oper_i" or
> ends with "_ok" and the pincount to be obtained from the object.
> 
> this will reduce 30-40 lines per block down to *five* and at the same time
> greatly increase clarity.

  I will keep that in mind and try an implementation when I find time.


> > I also removed the utils module, as now Coriolis should supply equivalent
> > features.
> 
> ah excellent, glad you liked it.
> 
> Config is neat, ehn? :)

  Yes. I did upgrade my Python knowledge here... Always the dilemma
  of whether taking time to properly learn or develop using what I
  already master. I suppose I'm still missing lots of features.


> the number of oper_* pins is far too large.  this is the "expansion" of the
> instruction for convenience.  an example is the 64 bit immediate.
> 
> basically i am going to have to do "subset instruction decoders" that are
> *inside* mul0, *inside* alu0 and so on.
> 
> i have to find time to do that.

  For a floorplanned approach, reducing the number of wires between
  blocks is very important. But it is almost an art...
Comment 62 Luke Kenneth Casson Leighton 2020-08-24 15:16:02 BST
hi jean-paul, we've still got a few routes not linked up.  there's actually
very few, it's quite amazing.  can you take a look?  i'm just re-running "make layout" (we're dropping the hierarchichal effort for now), if i run into difficulties i'll let you know straight away

btw just so you're aware: the current interfaces are *wishbone* interfaces, they're in no way intended for public external exposure via pins.  not unless we absolutely have to bring them out, that is.

we still have to fit a set of GPIO and i am currently sorting out Litex SIM to be able to at least have something even if it is utterly basic.
Comment 63 Luke Kenneth Casson Leighton 2020-08-24 16:18:09 BST
only 6 routes not connected!  not half bad.

  o  Routing did not complete, unrouted segments:
      1| &<id:2093507 Horizontal subckt_1724_dec2.subckt_233_dec.subckt_368_dec31.abc_417294_new_n1323 METAL4 [6720l 13385l] [6780l 13385l] 2l rpD:3 -----CG-----------tb- [6717.5l:6782.5l] 65l 0-----t>
      2| &<id:2871318 Horizontal subckt_1722_core.subckt_3904_fus.subckt_7_shiftrot0.subckt_1061_alu_shift_rot0.subckt_0_pipe1_107.subckt_470_main_111.subckt_12_rotator.left_mask_mask(19) METAL2 [4455l 3100l] [4520l 3100l] 2l rpD:4 -----CG-----------tt- [4454l:4521l] 67l 0-----t>
      3| &<id:4453698 Horizontal subckt_1724_dec2.subckt_233_dec.subckt_368_dec31.dec_sub24_upd(1) METAL2 [6325l 12845l] [6385l 12845l] 2l rpD:2 -----CG-----w-----tt- [6324l:6386l] 62l 1----st>
      4| &<id:4489710 Horizontal subckt_1724_dec2.subckt_233_dec.subckt_368_dec31.dec_sub18_sgl_pipe METAL2 [6350l 13505l] [6445l 13505l] 2l rpD:6 -----CG------A----tt- [6349l:6446l] 97l 1----s->
      5| &<id:4514902 Horizontal subckt_1722_core.subckt_3904_fus.subckt_7_shiftrot0.subckt_1061_alu_shift_rot0.subckt_0_pipe1_107.subckt_470_main_111.subckt_12_rotator.left_mask_shift(3) METAL2 [4415l 3145l] [4490l 3145l] 2l rpD:6 -----CG------A----tt- [4414l:4491l] 77l 0----s->
      6| &<id:4529534 Horizontal subckt_1722_core.subckt_3904_fus.subckt_7_shiftrot0.subckt_1061_alu_shift_rot0.subckt_0_pipe1_107.subckt_470_main_111.subckt_12_rotator.left_mask_mask(31) METAL2 [4410l 3100l] [4515l 3100l] 2l rpD:4 -----CG------A----tt- [4409l:4516l] 107l 1----s->
Comment 64 Jean-Paul Chaput 2020-08-24 21:29:09 BST
(In reply to Luke Kenneth Casson Leighton from comment #62)
> hi jean-paul, we've still got a few routes not linked up.  there's actually
> very few, it's quite amazing.  can you take a look?  i'm just re-running
> "make layout" (we're dropping the hierarchichal effort for now), if i run
> into difficulties i'll let you know straight away

  I'm rebuilding it on my side as a I wrote this... I will look for
  the fine tuning.

> btw just so you're aware: the current interfaces are *wishbone* interfaces,
> they're in no way intended for public external exposure via pins.  not
> unless we absolutely have to bring them out, that is.

  OK. Just to be clear, you mean that:
  
  1. They are extraneous and we can avoid putting them out.

  *or*

  2. Its the interface *for now* and it will change in the future to
     some normalized standard.

  If its the second case, to have a better approximation, we must keep
  the wishbone interface exported for now.


> we still have to fit a set of GPIO and i am currently sorting out Litex SIM
> to be able to at least have something even if it is utterly basic.

  OK.

  Some update from my side, I'm implementing a high fanout net synthesis
  algorithm which is roughly placement driven. As I'm experimenting
  various ways as I go, I do not commit because of very ephemeral
  stages.
    I'm also working with Staf on the TSMC 180nm portage to Coriolis.
    And, I'm (again) in vacation until Sunday 6, September, so the
  advance of the work will depend on the (bad) weather...
    By the way, the port 922 is filtered by the ISP of my current
  accommodation so I get them though my home box (after transforming
  the commits into patches). Writing this make me think I should
  have just made a ssh tunnel. Stupid of me.
Comment 65 Luke Kenneth Casson Leighton 2020-08-24 22:18:47 BST
(In reply to Jean-Paul.Chaput from comment #64)

> 
>   2. Its the interface *for now* and it will change in the future to
>      some normalized standard.

the wishbone interfaces will be connected internally to a bus.  on that bus will be peripherals.  the peripherals will be connected to IO pads.

the wishbone bus itself will *not* be exposed via IO pads.

>   If its the second case, to have a better approximation, we must keep
>   the wishbone interface exported for now.
> 
> 
> > we still have to fit a set of GPIO and i am currently sorting out Litex SIM
> > to be able to at least have something even if it is utterly basic.
> 
>   OK.
> 
>   Some update from my side, I'm implementing a high fanout net synthesis
>   algorithm which is roughly placement driven.

nice.

> As I'm experimenting
>   various ways as I go, I do not commit because of very ephemeral
>   stages.
>     I'm also working with Staf on the TSMC 180nm portage to Coriolis.

ah good.

>     And, I'm (again) in vacation until Sunday 6, September, so the
>   advance of the work will depend on the (bad) weather...

not enough sun last time? :)

>     By the way, the port 922 is filtered by the ISP of my current
>   accommodation so I get them though my home box (after transforming
>   the commits into patches). Writing this make me think I should
>   have just made a ssh tunnel. Stupid of me.

doh.

do you have a friend or colleague who can make one?

or, you know, there is only one file changed, test_issuer.il it can be got by the git.libre-soc.org website.
Comment 66 Jean-Paul Chaput 2020-09-15 22:42:34 BST
I did implement it in alliance-check-toolkit commit 76c4f45 and modificated
experiment9 accordingly in commit c362610. I stick to the list approach.

It seems complicated to me to guess that list automatically. I think it
should be done at nMignen level, as only the designer know where to stop.
For example, currently we do not flatten the same level of hierarchy all
accross the design.

(In reply to Jean-Paul.Chaput from comment #42)
> (In reply to Luke Kenneth Casson Leighton from comment #41)
> > jean-paul i just checked something to be possible in yosys: to be able
> > to flatten individual modules rather than all of it (top).
> > 
> > this works fine.
> > 
> > so, to support this: if the YOSYS_FLATTEN can take, instead of a "yes/no"
> > (might need a new Makefile parameter, YOSYS_FLATTEN_LIST), the following:
> > 
> >           YOSYS_FLATTEN_LIST=`cat to_flatten.txt`
> > 
> > and "to_flatten.txt" to contain at least:
> > 
> > fast
> > cr
> > xer
> > slow
> > int
> > pdecode2
> > alu0
> > branch0
> > cr0
> > trap0
> > ldst0
> > 
> > and probably many more (basically the list of everything for which a
> > top-level
> > block is to be written) this will get rid of many of the problems of
> > "dangling nets" without having to have a full flatten.
> > 
> > btw one other way is for that YOSYS_FLATTEN_LIST to be the output from
> > a python script that actually notices what's been declared as being
> > sub-cells (top level hierarchy) rather than have a separately-maintained
> > file that could get out of sync.
> 
> Good catch. I will integrate that, and try to implement a reliable way
> of generating the list.
> 
> Still working through the placement of the issuer. The placement of the
> IO pins of each block is a lengthy and tedious work... Have to solve
> shortage of length along some sides. Hope to have something tomorrow.
Comment 67 Luke Kenneth Casson Leighton 2020-09-15 22:52:33 BST
(In reply to Jean-Paul.Chaput from comment #66)
> I did implement it in alliance-check-toolkit commit 76c4f45 and modificated
> experiment9 accordingly in commit c362610. I stick to the list approach.

star.  i will recompile and see how it goes.

> 
> It seems complicated to me to guess that list automatically. I think it
> should be done at nMignen level, as only the designer know where to stop.

ah sadly, nmigen itself quite "dumb".  its job is "take AST, turn it into
yosys ilang file".

that's *literally* it (!).

now... what i _could_ do is, from the original python class hierarchy, get it to auto-generate a yosys script, that would work, hmmm.

then after the ilang file is generated, run a command to create the yosys
script then post-process it.
Comment 68 Jean-Paul Chaput 2020-09-15 23:30:00 BST
(In reply to Luke Kenneth Casson Leighton from comment #67)
> (In reply to Jean-Paul.Chaput from comment #66)
> > I did implement it in alliance-check-toolkit commit 76c4f45 and modificated
> > experiment9 accordingly in commit c362610. I stick to the list approach.
> 
> star.  i will recompile and see how it goes.
> 
> > 
> > It seems complicated to me to guess that list automatically. I think it
> > should be done at nMignen level, as only the designer know where to stop.
> 
> ah sadly, nmigen itself quite "dumb".  its job is "take AST, turn it into
> yosys ilang file".
> 
> that's *literally* it (!).
> 
> now... what i _could_ do is, from the original python class hierarchy, get
> it to auto-generate a yosys script, that would work, hmmm.

  I'm not familiar with nMigen, but can't you put print statements
  when instanciating a model/class just to fill the flatten file?

  By the way, the loop is still there (will investigate after the
  buffering is ok). You may notice that a *lot* of buffer is used.
  It is a general trend as wires becomes longer and longer in SoCs.
  And I still need to also bufferise long wires. So the result is
  that more segments gets unrouteds (about 50).
Comment 69 Luke Kenneth Casson Leighton 2020-09-16 00:31:04 BST
(In reply to Jean-Paul.Chaput from comment #68)
> (In reply to Luke Kenneth Casson Leighton from comment #67)
> > (In reply to Jean-Paul.Chaput from comment #66)
> > > I did implement it in alliance-check-toolkit commit 76c4f45 and modificated
> > > experiment9 accordingly in commit c362610. I stick to the list approach.
> > 
> > star.  i will recompile and see how it goes.
> > 
> > > 
> > > It seems complicated to me to guess that list automatically. I think it
> > > should be done at nMignen level, as only the designer know where to stop.
> > 
> > ah sadly, nmigen itself quite "dumb".  its job is "take AST, turn it into
> > yosys ilang file".
> > 
> > that's *literally* it (!).
> > 
> > now... what i _could_ do is, from the original python class hierarchy, get
> > it to auto-generate a yosys script, that would work, hmmm.
> 
>   I'm not familiar with nMigen, but can't you put print statements
>   when instanciating a model/class just to fill the flatten file?

not at all!  it is literally, "call this function, it outputs a yosys ilang file"

the order in which the modules are added is particularly involved.

i may need to create an Abstract Syntax Tree walker to get the names.

it should not be too hard.



> 
>   By the way, the loop is still there (will investigate after the
>   buffering is ok). You may notice that a *lot* of buffer is used.
>   It is a general trend as wires becomes longer and longer in SoCs.
>   And I still need to also bufferise long wires. So the result is
>   that more segments gets unrouteds (about 50).

ahh i wondered about buffering.

i apologise because this design, because it is intended to be multi issue out-of-order, there are as you have seen *10* different execution engines, and *17* integer register file fan-out for reading!

also when we have the Dependency Matrices, although only 1 bit will be driven, there coukd be up to 40 DFFs (or SR NAND latches) on a single row.

only one of those will change HI/LO on any one clock cycle but it is still a lot of things to drive from one source.

just so you know in advance and are not so freaked out :)
Comment 70 Luke Kenneth Casson Leighton 2020-09-19 19:17:50 BST
i added an experimental option to yosys to disable "memory_map" and it
resulted in this:

  File "/home/lkcl/alliance-check-toolkit/bin/blif2vst.py", line 67, in <module>
    cell = CRL.Blif.load( options.cellName )
hurricane.HurricaneError: [ERROR] No .model or cell named <$mem> has been found.

which i am sure is a "good thing" really.  of course... now that model
for <$mem> is actually needed, how can it be created? what do they look like?
Comment 71 Luke Kenneth Casson Leighton 2020-09-19 23:45:12 BST
     - <Box 0l 0l 18300l 18300l>
     - GCell grid: [366x366]
  o  Converting <ls180> into Coloquinte.

finally corrected enough errors to have the layout - including
litex peripherals - start compiling.

i cut out the litex BIOS as a ROM and made the SRAM only 512 bytes.
this is enough to not trigger yosys to go "mental" with the whole
SRAM-is-actually-DFFs thing.
Comment 72 Luke Kenneth Casson Leighton 2020-09-21 11:08:52 BST
Created attachment 106 [details]
zoom-in on dense areas of routing

jean-paul the routing density of yellow (and green, and..) is spectacularly
high in some areas.  is this of concern?  this is with the new "routing"
algorithm.
Comment 73 Jean-Paul Chaput 2020-09-22 20:55:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #72)
> Created attachment 106 [details]
> zoom-in on dense areas of routing
> 
> jean-paul the routing density of yellow (and green, and..) is spectacularly
> high in some areas.  is this of concern?  this is with the new "routing"
> algorithm.

  If by the new routing algorithm, you mean the buffer insertion,
  in fact, it should slightly decrease the density (the inserted
  buffer increase the area).
    As long as the routing complete, this kind of density is not
  a problem. It reflect the highly connected nature of the netlist
  and the effectiveness of the routing.
Comment 74 Luke Kenneth Casson Leighton 2020-09-22 21:05:09 BST
(In reply to Jean-Paul.Chaput from comment #73)

>   If by the new routing algorithm, you mean the buffer insertion,

yes, what you added recently

>   in fact, it should slightly decrease the density (the inserted
>   buffer increase the area).

interesting.

>     As long as the routing complete, this kind of density is not
>   a problem.

ahh ok.

> It reflect the highly connected nature of the netlist
>   and the effectiveness of the routing.

well, we are still around 17 incomplete routes (out of 100,000+)
err no i'm out by an order of magnitude on that

     - Track Segment Completion Ratio ........................ 100% [1020210+18]
     - Wire Length Completion Ratio ....................... 100% [61786063+1360]
     - Wire Length Expand Ratio ........................... 4.07% [min:59367975]
     - Unrouted horizontals ........................................ 88.89% [16]
     - Unrouted verticals ........................................... 11.11% [2]
     - Done in ............................................... 7m 4.86s, 846.8Mb

one *million* tracks!
Comment 75 Jean-Paul Chaput 2020-09-22 21:56:21 BST
(In reply to Luke Kenneth Casson Leighton from comment #74)
> (In reply to Jean-Paul.Chaput from comment #73)
> >     As long as the routing complete, this kind of density is not
> >   a problem.
> 
> ahh ok.

  The idea is to achieve the highest density possible *without*
  overflowing. This is a tricky objective...
 
> well, we are still around 17 incomplete routes (out of 100,000+)
> err no i'm out by an order of magnitude on that

  To be accurate, it is not "routes" but "segments" :-)
  After I finish the buffering algorithm, I go back to analyse that.
  I mostly need to refine the computation of the provisional density
  of a GCell to correctly guide the global routing.
  I get confirmation that in recent design (28nm and below)
  you have between 50% and 80% of buffers!

>      - Track Segment Completion Ratio ........................ 100% [1020210+18]
>      - Wire Length Completion Ratio ....................... 100% [61786063+1360]
>      - Wire Length Expand Ratio ........................... 4.07% [min:59367975]
>      - Unrouted horizontals ........................................ 88.89% [16]
>      - Unrouted verticals ........................................... 11.11% [2]
>      - Done in ............................................... 7m 4.86s, 846.8Mb
> 
> one *million* tracks!

  Yes, that's starting to be a good test bench for the P&R !
  (remember, segments, not tracks ;-)

  If you want to have an hint about the inner working of the detailed
  router I can send you a chapter I did write for one of my students.
Comment 76 Luke Kenneth Casson Leighton 2020-09-26 00:52:06 BST
jean-paul, staf may have been in touch with you already: i have the
IO connections done, now, with signals in the form:

gpio_i[16]
gpio_o[16]
gpio_oe[16]

which is the "standard" way of declaring IO connections where the
direction of the IOpad needs to be controlled by the ASIC.

i took a look at pxlib and these look like they match with piot_px.
but... there's nothing in ioring.py declaration about the direction,
despite e.g. the AM2901 example clearly having bi-directional pads.

how does that work?  i've located the AM2901 example which i can see
does have "inout" ports, however it's not obvious to me how the
"q3_from_pads" and "q3_to_pads" get turned into... you with me?


ah!

cumulus plugin IoPadConf.

entries in the chip dictionary, "pads.instances".

so anything in that entry will be explicitly declared (bi-directional)
and we can set the pin name ("GPIOA0") and set the in, out and oe.

however if we _don't_ set an entry in pads.instances, cumulus plugin
will "auto-detect" the pad type based on the name and so on.


Staf if you make something like pxlib, we can set the cell library
name using a config option in ioring.py

we have one example here - line 6 - where the pad cell library
is set to "pxlib":

https://git.libre-soc.org/?p=soclayout.git;a=blob;f=experiments4/coriolis2/ioring.py;hb=HEAD


but, Jean-Paul: Staf would like to be able to allow people to set the
drive strength.  this would be something that should go into additional
options in IoPadConf in the cumulus plugin.
Comment 77 Jean-Paul Chaput 2020-09-30 14:52:22 BST
Hello Luke,

I'm finally starting to review your work, but I seems to be lacking
the pinparse.py module (I just made a git pull). Did I miss something
or is it really missing?
Comment 78 Luke Kenneth Casson Leighton 2020-09-30 15:15:46 BST
(In reply to Jean-Paul.Chaput from comment #77)
> Hello Luke,
> 
> I'm finally starting to review your work, but I seems to be lacking
> the pinparse.py module (I just made a git pull). Did I miss something
> or is it really missing?

"make pinmux" after "git submodule --init --remote".
Comment 79 Jean-Paul Chaput 2020-09-30 15:36:42 BST
(In reply to Luke Kenneth Casson Leighton from comment #78)
> (In reply to Jean-Paul.Chaput from comment #77)
> > Hello Luke,
> > 
> > I'm finally starting to review your work, but I seems to be lacking
> > the pinparse.py module (I just made a git pull). Did I miss something
> > or is it really missing?
> 
> "make pinmux" after "git submodule --init --remote".

Just to troll a little bit, the command was:

    git submodule update --init --remote

And the URL needed to be changed into:

    ssh://gitolite3@libre-riscv.org:922/pinmux.git

As port 22 is closed to me ;-) . I pushed the commit.
Comment 80 Luke Kenneth Casson Leighton 2020-09-30 16:38:04 BST
(In reply to Jean-Paul.Chaput from comment #79)

> Just to troll a little bit, the command was:
> 
>     git submodule update --init --remote

ah thank you.  i added a build.sh so as not to lose info

> And the URL needed to be changed into:
> 
>     ssh://gitolite3@libre-riscv.org:922/pinmux.git
> 
> As port 22 is closed to me ;-) . I pushed the commit.

sorry, i have an entry in .ssh/config to cover that :)
Comment 81 Jean-Paul Chaput 2020-09-30 16:42:04 BST
I did take a quick look at your work. And, of course, I have lot of comments...

* First, as I'm currently re-implementing the "block" plugin for HFNS
  support and all other features, it does not support (yet) the full
  chip (with chip I/O cells).

* So, to check the full chip, you have to use the old plugin.

* That old plugin (and the future one for that matter) requires a specific
  top hierarchy:

     CHIP
       |
       +----> I/O Pad
       |
       +----> I/O Pad
       |
       +----> I/O Pad
       |
       \----> CORONA
                |
                \----> CORE (aka test_issuer + jtag).

  The chip level contains the I/O pads and the CORONA.
  The CORONA contains *one* instance of the CORE.
  The CORE contains the flat design.

  I did notice that there are cells added obviously coming from the JTAG,
  they must be put below the CORE level.

* Why that intermediary CORONA level ? It's purpose is to isolate the
  CORE which may be in symbolic layout from the I/O pad which usually
  are supplied by the foundry and are real layout.

* I have a plugin that automatically generate the CHIP+CORONA level
  from the CORE. As long as the I/O pads can be deduced from the
  nets.
Comment 82 Luke Kenneth Casson Leighton 2020-09-30 17:07:39 BST
(In reply to Jean-Paul.Chaput from comment #81)
> I did take a quick look at your work. And, of course, I have lot of
> comments...

hooray!

>   I did notice that there are cells added obviously coming from the JTAG,
>   they must be put below the CORE level.

you mean like this?

      CHIP
        |
        +----> I/O Pad
        |
        +----> I/O Pad
        |
        +----> I/O Pad
        |
        \----> CORONA
                 |
                 +----> CORE (aka test_issuer).
                 |      ^
                 |      |
                 |      |
                 \----> jtag
 
actually it is more like... ahhh... i will have to draw it, it is
to do with how Staf has arranged the IO Pad JTAG testing.

give me a few minutes.

> * Why that intermediary CORONA level ? It's purpose is to isolate the
>   CORE which may be in symbolic layout from the I/O pad which usually
>   are supplied by the foundry and are real layout.
> 
> * I have a plugin that automatically generate the CHIP+CORONA level
>   from the CORE. As long as the I/O pads can be deduced from the
>   nets.

yes, i believe i have semi-worked-out how that works.  copied from
the AM2901 example, which i initially copied for experiments4
Comment 83 Jean-Paul Chaput 2020-09-30 17:35:56 BST
(In reply to Luke Kenneth Casson Leighton from comment #82)
> (In reply to Jean-Paul.Chaput from comment #81)
> > I did take a quick look at your work. And, of course, I have lot of
> > comments... 
> you mean like this?
> 
>       CHIP
>         |
>         +----> I/O Pad
>         |
>         +----> I/O Pad
>         |
>         +----> I/O Pad
>         |
>         \----> CORONA
>                  |
>                  +----> CORE (aka test_issuer).
>                  |      ^
>                  |      |
>                  |      |
>                  \----> jtag
>  
> actually it is more like... ahhh... i will have to draw it, it is
> to do with how Staf has arranged the IO Pad JTAG testing.

  I may have been too fast in my reading. You want to use ls180
  as the core. In that case, it is correct. Sorry.
Comment 84 Luke Kenneth Casson Leighton 2020-09-30 17:43:15 BST
Created attachment 108 [details]
how jtag is organised

ok so here is how it is organised.

* the top level "core" is now called "ls180" where "test_issuer" contains
  the PowerISA.

* ls180 contains:

  - test_issuer
  - JTAG
  - Litex "stuff" mostly wishbone bus management and arbiters
  - peripherals

* JTAG has three "things" it controls:

  - IOPad testing
  - DMI
  - Wishbone
  - (we may add PLL "counter" here although it will probably be part of Core)

* Litex provides peripherals however the IO for UART and GPIO need to
  be routed *through JTAG*

* Therefore, there are some wires for UART distinguished by prefix
  "_pads_" and "_core_" such as "uart_pads_tx" and "uart_core_rx" which
  go **IN** to ls180 then come **OUT** of ls180

* these double-routed wires go SPECIFICALLY into the JTAG "IOPad testing"
  side.

  this is so that the JTAG core can "re-route" the IO Pads, taking them
  AWAY from ls180 and SPECIFICALLY connecting them DIRECTLY to the
  (real) actual IOPads through the Corona.

  this is done with MUXes, and they are, strictly speaking, not part
  of the "core".

  although, if they are not, it would be a royal nuisance to have to
  split them out into a separate file.


i *think*.... the main thing is that the JTAG clock should not be
associated with (connected to or controlled by) the pxlib "ck" lines
on the IO pads:


  # this is good

  p_sys_clk_0 : pck_px
  port map ( pad  => sys_clk
           , ck   => cki
           , vdde => vdde
           , vddi => vddi
           , vsse => vsse
           , vssi => vssi
           );


  # this is bad

  p_jtag_tck : pi_px       <- nooo! must be of type pck_px!
  port map ( ck   => cki       <- baaaad, noooo!
           , pad  => jtag_tck 
           , t    => jtag_tck_core
           , vdde => vdde
           , vddi => vddi
           , vsse => vsse
           , vssi => vssi
           );


therefore if similar to this:

   env.setCLOCK("sys_clk")

we could have:

   env.setOUTSIDEOFCORONACLOCK("jtag_tck")

and it to be treated completely differently, then that would i think work.

we also would need some of the PLL clock wires similarly.  we cannot
have them under the control of sys_clk.

is this unreasonably complex, to work out which clock is associated
with what, without splitting them into separate files?
Comment 85 Jacob Lifshay 2020-09-30 17:45:29 BST
(In reply to Luke Kenneth Casson Leighton from comment #80)
> (In reply to Jean-Paul.Chaput from comment #79)
> 
> > Just to troll a little bit, the command was:
> > 
> >     git submodule update --init --remote
> 
> ah thank you.  i added a build.sh so as not to lose info
> 
> > And the URL needed to be changed into:
> > 
> >     ssh://gitolite3@libre-riscv.org:922/pinmux.git
> > 
> > As port 22 is closed to me ;-) . I pushed the commit.
> 
> sorry, i have an entry in .ssh/config to cover that :)

It should be https://git.libre-soc.org/git/pinmux.git for the submodule, since the url *should* be useable by the general public, not just by those who have ssh access.
Comment 86 Luke Kenneth Casson Leighton 2020-09-30 17:47:27 BST
(In reply to Jean-Paul.Chaput from comment #83)

> >                  +----> CORE (aka test_issuer).

>   I may have been too fast in my reading. You want to use ls180
>   as the core. In that case, it is correct. Sorry.

yes, sorry, you have been focussing, i left it until you are out of
"big development mode".  ls180 is the name of the cell which contains
Litex peripherals, JTAG, and also test_issuer.

(btw latest commits on coriolis2, track completion ratio 100%!  w00t!
now i will try putting the full core back in, see what happens.
will have to expand the ioring to match)
Comment 87 Luke Kenneth Casson Leighton 2020-09-30 17:49:27 BST
(In reply to Jacob Lifshay from comment #85)

> It should be https://git.libre-soc.org/git/pinmux.git for the submodule,
> since the url *should* be useable by the general public, not just by those
> who have ssh access.

wark-wark, good catch jacob, sorted.
Comment 88 Staf Verhaegen 2020-09-30 18:20:39 BST
> * Litex provides peripherals however the IO for UART and GPIO need to
>   be routed *through JTAG*

Actually all IO should be routed through JTAG, thus also SDRAM, PWM, SPI, ...
This allows standardized PCB testing without tester need to write power programs etc.
Comment 89 Luke Kenneth Casson Leighton 2020-09-30 19:07:09 BST
(In reply to Staf Verhaegen from comment #88)
> > * Litex provides peripherals however the IO for UART and GPIO need to
> >   be routed *through JTAG*
> 
> Actually all IO should be routed through JTAG, thus also SDRAM, PWM, SPI, ...
> This allows standardized PCB testing without tester need to write power
> programs etc.

ngggghargh ok :)  i was hoping to get away with just GPIO and UART, to
at least "prove the concept" of the IOpad cells.  i picked GPIO and UART
because UART has one In-only and one Out-only IOpad, and GPIO is In-and-Out.

i set the precedent (parameterised functions) which make this possible,
so i should be able to add the others in 1-2 days.
Comment 90 Luke Kenneth Casson Leighton 2020-09-30 19:28:49 BST
https://libre-soc.org/180nm_Oct2020/2020-09-30_19-13.png

hmmm, jean-paul: some of the pins are coming in from almost 100% the
opposite side.  it seems that there is no... "weighted influence" on
where the cells associated with the I/O should be placed.

could this be solved algorithmically (with a "this I/O pad pin please
give it 5% weighting to put its cells closer to the pin" style algorithm)

or

should i simply go, "ok so we defined the IO pads, great, let's create
a Cell for ls180 with the I/O defined to be in specific places".

maybe even create a barrier ring just like in experiments7?
Comment 91 Luke Kenneth Casson Leighton 2020-09-30 19:34:10 BST
the P&R completed for the whole chip, including the PowerISA core this time.
it is 4x bigger @ 26,000 x 26,000 lambda

nohup.out is here: https://ftp.libre-soc.org/nohup.out.bz2

3 horiz not routed, 2 vertical, but 7,000 unrouted segments.

could the introduction of a clock tree (USE_CLOCKTREE=yes in Makefile)
have an effect?
Comment 92 Jean-Paul Chaput 2020-09-30 20:37:53 BST
(In reply to Luke Kenneth Casson Leighton from comment #90)
> https://libre-soc.org/180nm_Oct2020/2020-09-30_19-13.png
> 
> hmmm, jean-paul: some of the pins are coming in from almost 100% the
> opposite side.  it seems that there is no... "weighted influence" on
> where the cells associated with the I/O should be placed.
> 
> could this be solved algorithmically (with a "this I/O pad pin please
> give it 5% weighting to put its cells closer to the pin" style algorithm)

  I just completed the whole chip P&R, I used the bba238c commit.
  Everything seems to have gone fine. I/O pads are more or less well
  placed (no more than the length of the side). It is already done,
  I/O pad should attract the cells they are connected to. But it is
  a weak influence compared to their connexion inside the chip.
  But I only get 35K gates for the core, is this normal ?
Comment 93 Luke Kenneth Casson Leighton 2020-09-30 20:48:46 BST
(In reply to Jean-Paul.Chaput from comment #92)
> (In reply to Luke Kenneth Casson Leighton from comment #90)
> > https://libre-soc.org/180nm_Oct2020/2020-09-30_19-13.png
> > 
> > hmmm, jean-paul: some of the pins are coming in from almost 100% the
> > opposite side.  it seems that there is no... "weighted influence" on
> > where the cells associated with the I/O should be placed.
> > 
> > could this be solved algorithmically (with a "this I/O pad pin please
> > give it 5% weighting to put its cells closer to the pin" style algorithm)
> 
>   I just completed the whole chip P&R, I used the bba238c commit.
>   Everything seems to have gone fine. 

this is the "test" which takes a lot shorter time, core module is cut out.

> I/O pads are more or less well
>   placed (no more than the length of the side). It is already done,
>   I/O pad should attract the cells they are connected to. But it is
>   a weak influence compared to their connexion inside the chip.
>   But I only get 35K gates for the core, is this normal ?

no, need to copy non_generated/full_core_ls180.il to ls180.il and also edit ioring.py to change to larger size (see comments)

this is what has 7000 unrouted segments.

i have a run going increased chip.core by 1000 lambda
Comment 94 Jean-Paul Chaput 2020-09-30 21:05:14 BST
(In reply to Luke Kenneth Casson Leighton from comment #91)
> the P&R completed for the whole chip, including the PowerISA core this time.
> it is 4x bigger @ 26,000 x 26,000 lambda
> 
> nohup.out is here: https://ftp.libre-soc.org/nohup.out.bz2
> 
> 3 horiz not routed, 2 vertical, but 7,000 unrouted segments.

  No no. 7 unrouted segments (4 Verticals, 3 Horizontals),
  7075 is the cumulated length of those segments, that is
  7075 lambdas.

> could the introduction of a clock tree (USE_CLOCKTREE=yes in Makefile)
> have an effect?

  Yes... The H-Tree consume more routing resources than a "default"
  routing. Nevertheless, it is not that much saturated as show the
  global routing which converge in 2 iterations only.
    That will be fixed after the HFNS, because the HFNS will also
  increase the problem.
Comment 95 Luke Kenneth Casson Leighton 2020-09-30 21:44:07 BST
(In reply to Jean-Paul.Chaput from comment #94)

>   No no. 7 unrouted segments (4 Verticals, 3 Horizontals),
>   7075 is the cumulated length of those segments, that is
>   7075 lambdas.

ahh :)

> > could the introduction of a clock tree (USE_CLOCKTREE=yes in Makefile)
> > have an effect?
> 
>   Yes... The H-Tree consume more routing resources than a "default"
>   routing. Nevertheless, it is not that much saturated as show the
>   global routing which converge in 2 iterations only.

ok.

>     That will be fixed after the HFNS, because the HFNS will also
>   increase the problem.

understood.

in the meantime increased chip.core to 27000x27000 there is only 1 segment unrouted.  trying 27500x27500 now.
Comment 96 Jean-Paul Chaput 2020-10-01 10:38:54 BST
(In reply to Luke Kenneth Casson Leighton from comment #84)
> Created attachment 108 [details]
> how jtag is organised
> 
> ok so here is how it is organised.
> 
> * the top level "core" is now called "ls180" where "test_issuer" contains
>   the PowerISA.
> 
> * ls180 contains:
> 
>   - test_issuer
>   - JTAG
>   - Litex "stuff" mostly wishbone bus management and arbiters
>   - peripherals
> 
> * JTAG has three "things" it controls:
> 
>   - IOPad testing
>   - DMI
>   - Wishbone
>   - (we may add PLL "counter" here although it will probably be part of Core)
> 
> * Litex provides peripherals however the IO for UART and GPIO need to
>   be routed *through JTAG*
> 
> * Therefore, there are some wires for UART distinguished by prefix
>   "_pads_" and "_core_" such as "uart_pads_tx" and "uart_core_rx" which
>   go **IN** to ls180 then come **OUT** of ls180
> 
> * these double-routed wires go SPECIFICALLY into the JTAG "IOPad testing"
>   side.
> 
>   this is so that the JTAG core can "re-route" the IO Pads, taking them
>   AWAY from ls180 and SPECIFICALLY connecting them DIRECTLY to the
>   (real) actual IOPads through the Corona.
> 
>   this is done with MUXes, and they are, strictly speaking, not part
>   of the "core".
> 
>   although, if they are not, it would be a royal nuisance to have to
>   split them out into a separate file.

    Not sure I correctly see how the wiring is organized (I'm a visual
    guy, I understand more quickly schematics). Nevertheless, the
    strategy here is:

    * Put everything that contains gates under ls180. It will be flat
      placed & routed. So *below* ls180, you can have whatever
      hierarchy you like.

    * Complex wiring between pads and core (ls180) will take place
      inside the "corona" level (like the 'oe' signal for all pads
      part of the same bus).

    * Connexion betweens "corona" and I/O pads (that is, at "chip"
      level) go *almost* straight (in our case, they go straight
      because both side are symbolic layout).

    * And finally, a dedicated "router" connect the I/O pad ring
      connectors together, including corners.

  NOTE: This partitionning also has the advantage to *not* include
        the I/O pad area inside the general routing area, which
        allows me to remove all special cases related to I/O pads
        from Katana (and they are hellish).

> i *think*.... the main thing is that the JTAG clock should not be
> associated with (connected to or controlled by) the pxlib "ck" lines
> on the IO pads:
> 
> 
>   # this is good
> 
>   p_sys_clk_0 : pck_px
>   port map ( pad  => sys_clk
>            , ck   => cki
>            , vdde => vdde
>            , vddi => vddi
>            , vsse => vsse
>            , vssi => vssi
>            );
> 
> 
>   # this is bad
> 
>   p_jtag_tck : pi_px       <- nooo! must be of type pck_px!
>   port map ( ck   => cki       <- baaaad, noooo!
>            , pad  => jtag_tck 
>            , t    => jtag_tck_core
>            , vdde => vdde
>            , vddi => vddi
>            , vsse => vsse
>            , vssi => vssi
>            );
> 
> 
> therefore if similar to this:
> 
>    env.setCLOCK("sys_clk")
> 
> we could have:
> 
>    env.setOUTSIDEOFCORONACLOCK("jtag_tck")
> 
> and it to be treated completely differently, then that would i think work.
> 
> we also would need some of the PLL clock wires similarly.  we cannot
> have them under the control of sys_clk.
> 
> is this unreasonably complex, to work out which clock is associated
> with what, without splitting them into separate files?

  I understand. To avoid putting extra specifications, the fact that
  there is different clocks could be directly guessed from the I/O pad
  netlist (at "chip" level). I think it is not very difficult to
  create disjoined parts in the I/O pad ring.
    The big question is, will each part be made of an assembly of pxlib
  pads *or* will Staf add some other kind with different wiring
  strategies? Before I upgrade the I/O pad "router" I would need a
  clear understanding of the I/O pads physical interface.
Comment 97 Jean-Paul Chaput 2020-10-01 10:41:47 BST
Me again...

I recall you asking about multiple VDD/VSS but can't find back the
comment. Could you post it again?

And, yes, this bug is really too long, we can move to 307...
Comment 98 Staf Verhaegen 2020-10-01 14:34:06 BST
(In reply to Jean-Paul.Chaput from comment #96)

>   I understand. To avoid putting extra specifications, the fact that
>   there is different clocks could be directly guessed from the I/O pad
>   netlist (at "chip" level). I think it is not very difficult to
>   create disjoined parts in the I/O pad ring.
>     The big question is, will each part be made of an assembly of pxlib
>   pads *or* will Staf add some other kind with different wiring
>   strategies? Before I upgrade the I/O pad "router" I would need a
>   clear understanding of the I/O pads physical interface.

I suppose in the end my IO library will need to be used or does pxlib handle ESD protection and IO voltage level shifting ?
What different kind of wiring strategies are you thinking about ?

Also does this need urgent feedback as I would like to first finish the standard cell layout.
Comment 99 Luke Kenneth Casson Leighton 2020-10-01 15:16:29 BST
(In reply to Staf Verhaegen from comment #98)

> I suppose in the end my IO library will need to be used or does pxlib handle
> ESD protection and IO voltage level shifting ?

yes, here i ask if it can handle level-shifting.  based on the VST
(public API) it does appear to have external VSS/VDD and internal VSS/VDD,
what we do not know is: what is the range(s) on each.
https://bugs.libre-soc.org/show_bug.cgi?id=506#c5

> What different kind of wiring strategies are you thinking about ?
> 
> Also does this need urgent feedback as I would like to first finish the
> standard cell layout.

jean-paul is in a meeting right now, free in the afternoon. i suspect
that this is coriolis-specific enhancements and that the full pxlib
API will not - in any way - need modification, so you are fine, Staf.

the only thing which would be nice "in the future" (not now) which needs
a modification is some options to set:

* output drive current
* enable/disable pullup/pulldown
* set "mode" (CMOS, TTL i.e. float or "drive-hi + drive-lo")
* enable/disable Schottky / Schmidtt trigger (and its speed) for de-bounce

these are the kinds of sophisticated things that an STM32F or ATSAM has
which make a huge difference to the marketability of an Embedded Controller.

but... definitely not now.  pxlib's API - unmodified - is perfect for this
test ASIC.
Comment 100 Jean-Paul Chaput 2020-10-01 23:17:25 BST
> I suppose in the end my IO library will need to be used or does pxlib handle
> ESD protection and IO voltage level shifting ?
> What different kind of wiring strategies are you thinking about ?
> 
> Also does this need urgent feedback as I would like to first finish the
> standard cell layout.

  the pxlib has two power voltages:

  * vdde / vsse ([e]xternal) for the I/O pads (3.3v in our case)
  * vddi / vssi ([i]nternal) for the core.
  
  By the way, that means that the "outside" must provide both power
  voltage of 3.3v and 1.8v. Is this the usual way?

  The input pad have ESD protection.

  But, what we must be sure of, is the interface. If I divert from
  HFNS to chip/corona creation, what's inside the I/O pads is not
  important. From what you (Staf) said, that seems ok, but it would
  be better if you can confirm.
Comment 101 Staf Verhaegen 2020-10-02 09:34:41 BST
> But, what we must be sure of, is the interface.

Still am not 100% sure what exactly you mean with interface here and want to avoid any possible misinterpretation.

> If I divert from HFNS to chip/corona creation, what's inside the I/O pads
> is not important. From what you (Staf) said, that seems ok, but it would
> be better if you can confirm.

My IO cells will have pins that will be connected directly to pins of the CORONA.
The bonding pad is included in the IO cell.
Comment 102 Jean-Paul Chaput 2020-10-02 10:00:04 BST
(In reply to Staf Verhaegen from comment #101)
> > But, what we must be sure of, is the interface.
> 
> Still am not 100% sure what exactly you mean with interface here and want to
> avoid any possible misinterpretation.
> 
> > If I divert from HFNS to chip/corona creation, what's inside the I/O pads
> > is not important. From what you (Staf) said, that seems ok, but it would
> > be better if you can confirm.
> 
> My IO cells will have pins that will be connected directly to pins of the
> CORONA.
> The bonding pad is included in the IO cell.

  OK. Same structure as pxlib. Luke said your pads could be used as
  a direct replacement of pxlib. Is that so? pxlib have an unusual
  way of generating core clock(s).

  * pck_px     : external_ck (pad) ==> pad_ring_ck (ck)
  * pvddeck_px : pad_ring_ck (ck)  ==> ck_core (cko)
  * pvsseck_px : idem
  * pvddick_px : idem
  * pvssick_px : idem

  Name in parenthesis are the real pad names, vs their meaning.

  I can write plugins to manage different clock arrangements, like
  I do for AMS 350nm for example.
Comment 103 Staf Verhaegen 2020-10-02 11:23:08 BST
(In reply to Jean-Paul.Chaput from comment #102)
> (In reply to Staf Verhaegen from comment #101)
> > > But, what we must be sure of, is the interface.
> > 
> > Still am not 100% sure what exactly you mean with interface here and want to
> > avoid any possible misinterpretation.
> > 
> > > If I divert from HFNS to chip/corona creation, what's inside the I/O pads
> > > is not important. From what you (Staf) said, that seems ok, but it would
> > > be better if you can confirm.
> > 
> > My IO cells will have pins that will be connected directly to pins of the
> > CORONA.
> > The bonding pad is included in the IO cell.
> 
>   OK. Same structure as pxlib. Luke said your pads could be used as
>   a direct replacement of pxlib. Is that so? pxlib have an unusual
>   way of generating core clock(s).

Never wanted to imply my IO would be a full drop-in replacement for pxlib. What I wanted to say is that pxlib could be used to setup up HDL for top block and it should not be much work to convert it later to use my IO library.

>   * pck_px     : external_ck (pad) ==> pad_ring_ck (ck)
>   * pvddeck_px : pad_ring_ck (ck)  ==> ck_core (cko)
>   * pvsseck_px : idem
>   * pvddick_px : idem
>   * pvssick_px : idem

In my library no special clock cell is present. It is a digital input cell that needs to be connected to the clock pins in the CORONA as any other IO.
I don't have latching functionality inside my IO cells so I don't need to distribute the clock over the IO ring. I assume all this functionality is implemented inside CORONA.
Comment 104 Luke Kenneth Casson Leighton 2020-10-02 11:45:25 BST
(In reply to Staf Verhaegen from comment #103)

> Never wanted to imply my IO would be a full drop-in replacement for pxlib.

ahh ok.  then the coriolis2 plugins would need to be customised to match
and understand it.
Comment 105 Luke Kenneth Casson Leighton 2020-10-02 11:50:44 BST
(In reply to Jean-Paul.Chaput from comment #100)
> > I suppose in the end my IO library will need to be used or does pxlib handle
> > ESD protection and IO voltage level shifting ?
> > What different kind of wiring strategies are you thinking about ?
> > 
> > Also does this need urgent feedback as I would like to first finish the
> > standard cell layout.
> 
>   the pxlib has two power voltages:
> 
>   * vdde / vsse ([e]xternal) for the I/O pads (3.3v in our case)

out of interest can it go down to 1.8v?


>   * vddi / vssi ([i]nternal) for the core.

so it _does_ have level-shifting (because otherwise it would not cope with
the voltage difference between 1.8 and 3.3)


>   By the way, that means that the "outside" must provide both power
>   voltage of 3.3v and 1.8v. Is this the usual way?

yes it is pretty standard.

it is also standard to have not just different voltages (separate IO from
core) it is also standard to:

* have completely different LDO / DCDC PMIC supplies
* have *multiple* IO Voltage domains

this because load and frequency fluctuations can have an adverse impact
on other IO pads running at completely different frequencies.
Comment 106 Staf Verhaegen 2020-10-02 12:57:48 BST
> >   * vdde / vsse ([e]xternal) for the I/O pads (3.3v in our case)
> 
> out of interest can it go down to 1.8v?

I should be able to go down to 1.8V but I don't currently want to commit to it for the prototype tape-out.