Bug 199 - Layout using coriolis2 main core, 180nm
Summary: Layout using coriolis2 main core, 180nm
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Hardware Layout (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Jean-Paul.Chaput
URL:
Depends on: 200
Blocks: 138 204
  Show dependency treegraph
 
Reported: 2020-03-02 16:57 GMT by Luke Kenneth Casson Leighton
Modified: 2020-08-11 23:25 BST (History)
1 user (show)

See Also:
NLnet milestone: NLNet.2019.Coriolis2
total budget (EUR) for completion of task and all subtasks: 9000
budget (EUR) for completion of task (excludes budget allocated to subtasks): 0
parent task for budget allocation: 138
child tasks for budget allocation:


Attachments
Patch to create experiments9 (140.89 KB, application/x-bzip)
2020-06-30 09:22 BST, Jean-Paul.Chaput
Details
Makefile for nmutil with install dir (298 bytes, text/plain)
2020-06-30 12:35 BST, Jean-Paul.Chaput
Details
staggered pipeline quick drawing (153.28 KB, image/jpeg)
2020-08-02 22:50 BST, Luke Kenneth Casson Leighton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-03-02 16:57:18 GMT
do layout for single-core 180nm ASIC including 1st level cache.
also peripherals: minimum priority is SDRAM 32 bit, 16550 UART, JTAG and SPI. secondary priorities are 64 bit SDRAM, GPIO, PWM, EINT, QSPI, SDMMC, RGBTTL, I2C and the pinmux.
package is to be QFP, maximum around 200 pins only including power and ground.

https://libre-soc.org/3d_gpu/layouts/coriolis2_180nm/
Comment 2 Luke Kenneth Casson Leighton 2020-06-25 16:28:25 BST
here's the diagram and page containing notes:
https://libre-soc.org/3d_gpu/layouts/coriolis2_180nm/

the first thing to note: there are quite a lot of Register Files
and there are quite a lot of Function Units.  therefore, as
there are quite a lot of unique Register File Ports, there are
also quite a lot of PriorityPickers (exactly one PP for *each* port).

i would recommend that every Function Unit's inputs and outputs
be on the same "side", because those inputs and outputs ultimately
have to go to the Regfiles, which is a single location.

the only exception to this is the LDSTCompUnit, which has the
additional connectivity to L0CacheBuffer, through which access
to Memory is attained.

LDSTCompUnit can have the memory access on the opposite side of
the registers.
Comment 3 Luke Kenneth Casson Leighton 2020-06-25 17:32:57 BST
(In reply to Luke Kenneth Casson Leighton from comment #2)

> i would recommend that every Function Unit's inputs and outputs
> be on the same "side", because those inputs and outputs ultimately
> have to go to the Regfiles, which is a single location.

 
so, the data which goes through the pipeline will "loop back".

when this is expanded and there are 15 to 28 Function Units, i would recommend having very "narrow" Function Units that have half of their pipeline stages going one direction, turn round, then half of them come back again.

when we include DIV and MUL this will almost certainly need to be done.  these will be very big Function Units as large as all other Function Units combined.

at that point a kind of "tree" will be needed that fans out the "thin" Function Units like branches, all of them leading back to the point where the PriorityPickers are.

the Regfile Broadcast Buses will be quite large (a lot of wires back and forth).  i do not yet have a handle on the relative sizes, here.

however from the internals, we have INTRegs which is 3R2W and whilst the data is 64 bit the "address" remember is in *unary* so is 32 bits, one bit for each register.

therefore there are 96 x 5 wires going in and out of the INT regfile.

* for the RA Read Port the 6 bit data fans out to i think 5 Function Units.
* likewise for RB
* for the RT Write Port i think it is 3 fan-in

basically these relationships between Regfiles and Function Units are multiple fan-in and multiple fan-out each being Broadcast Buses, each Bus being managed by a PriorityPicker.
Comment 4 Jean-Paul.Chaput 2020-06-26 16:41:56 BST
Coriolis commit b48f9b4 fixes:

* soclayout/experiments6, we can generate the fpmul64 example without
  using Yosys flatten, the vst should now be correct. The synthesis
  gives ~21K gates and the P&R takes a little above 3 minutes.
  So perfectly manageable.

* soclayout/experiment9, invalid syntax in port map (no right hand
  signal...).

Additional commit f3dd4bc fixes:

* Incomplete hierarchical save (Cumulus rsave plugin).

About the size of test_issuer:

It appears that most of the cells are in the "mem" module, that is
93% of them (for a total of 844321). It seems wrong to me.
IMHO, two possibilities here:

1. The real complexity of the "mem" module was drastically
   underestimated.

2. The way the nMignen code of "mem" is written trick Yosys in
   doing very unoptimzed things.

Note that with such an unbalance in the size of the modules / FU,
a realistic placement makes little sense.

Anyway, I strongly suggest a review of that module to, at least,
understand and justify such a size.
Comment 5 Luke Kenneth Casson Leighton 2020-06-26 16:49:18 BST
(In reply to Jean-Paul.Chaput from comment #4)
> Coriolis commit b48f9b4 fixes:
> 
> * soclayout/experiments6, we can generate the fpmul64 example without
>   using Yosys flatten, the vst should now be correct. The synthesis
>   gives ~21K gates and the P&R takes a little above 3 minutes.
>   So perfectly manageable.

fantastic.

> 
> * soclayout/experiment9, invalid syntax in port map (no right hand
>   signal...).

hmmm... will take a look later, am in the middle of sorting out other
memory-bus stuff.
 
> Additional commit f3dd4bc fixes:
> 
> * Incomplete hierarchical save (Cumulus rsave plugin).
> 
> About the size of test_issuer:
> 
> It appears that most of the cells are in the "mem" module, that is
> 93% of them (for a total of 844321). It seems wrong to me.

yep, it is initialised with 262144 bits.  that means that somewhere the "mem" instance is being passed an address range of (1<<18) where it should only be around 1<<6 for these purposes.

i'll take a look now.

l.
Comment 6 Luke Kenneth Casson Leighton 2020-06-26 17:21:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)

> yep, it is initialised with 262144 bits.  that means that somewhere the
> "mem" instance is being passed an address range of (1<<18) where it should
> only be around 1<<6 for these purposes.
> 
> i'll take a look now.

ok that's down to a more sane (hard-coded) 32 entries so the initialisation
is still large (32*64 bits) however it's not 2^18 bits.

btw you'll need to git pull on nmutil as well as soc, there's some modifications
to the RecordObject class (which will give different names to some signals,
however those names now include the parent object name)
Comment 7 Jean-Paul.Chaput 2020-06-30 09:22:04 BST
Created attachment 71 [details]
Patch to create experiments9

As the port 922 is filtered by my ISP in vacation, I directly provide the
commit as a patch here.

This is very preliminary work. The router do not complete yet, it reaches
only 99.9%. The coriolis2/settings.py contains some parameters variations
to tweak the router and compare the different outcomes. I will use it to
find geomtric cases where the router takes bad decisions and correct them.

A note about nMigen and a potential annoying bug at installation. nMigen
is listed as a dependency so setup tools will try to install it.
But if it is not installed *prior* to the Libre-SOC repositories, the
m-lab version will be pulled from the Python archives. Which is not
what we need. And if you don't install in system directories, like I
do, you even get two versions... One good and one bad...
Comment 8 Luke Kenneth Casson Leighton 2020-06-30 10:09:54 BST
(In reply to Jean-Paul.Chaput from comment #7)
> Created attachment 71 [details]
> Patch to create experiments9

got it.

> As the port 922 is filtered by my ISP in vacation, I directly provide the
> commit as a patch here.

applied and pushed, thank you jean-paul
 
> This is very preliminary work. The router do not complete yet, it reaches
> only 99.9%. The coriolis2/settings.py contains some parameters variations
> to tweak the router and compare the different outcomes. I will use it to
> find geomtric cases where the router takes bad decisions and correct them.

i will run it later and see how it looks.  i am fascinated to see how far
it gets.

> 
> A note about nMigen and a potential annoying bug at installation. nMigen
> is listed as a dependency so setup tools will try to install it.
> But if it is not installed *prior* to the Libre-SOC repositories, the
> m-lab version will be pulled from the Python archives.

yes.  this is normal because of the reliance on pip3 (via setuptools)
one solution: we remove all dependencies and expect people to install them 
manually (by way of a script / Makefile).
Comment 9 Jean-Paul.Chaput 2020-06-30 11:56:53 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)

> i will run it later and see how it looks.  i am fascinated to see how far
> it gets.

  Far. It's almost OK. I did it with only 5% of space margin and
  without using METAL6. And, if you run in graphic mode, its 
  fascinating how the placer seems to find back the blocks...
  I add a picture in attachement.


> > A note about nMigen and a potential annoying bug at installation. nMigen
> > is listed as a dependency so setup tools will try to install it.
> > But if it is not installed *prior* to the Libre-SOC repositories, the
> > m-lab version will be pulled from the Python archives.
> 
> yes.  this is normal because of the reliance on pip3 (via setuptools)
> one solution: we remove all dependencies and expect people to install them 
> manually (by way of a script / Makefile).

  I just wanted to hint at this potential problem so people don't
  loose too much time next they encounter it. I slightly modified
  the top Makefile so I can install in a non-system directory,
  that is, the Coriolis install tree. This way I keep a system like
  directory tree but requiring only user permission.
    As a system administrator, I'm very very reluctant to directly
  install things in the system tree as root. Because after some
  time you loose track of what has been installed or not.
  So only packaged things (rpm, deb) sould go there as the packager
  keeps track for you.
Comment 10 Luke Kenneth Casson Leighton 2020-06-30 12:01:05 BST
(In reply to Jean-Paul.Chaput from comment #9)
> (In reply to Luke Kenneth Casson Leighton from comment #8)
> 
> > i will run it later and see how it looks.  i am fascinated to see how far
> > it gets.
> 
>   Far. It's almost OK. I did it with only 5% of space margin and
>   without using METAL6. And, if you run in graphic mode, its 
>   fascinating how the placer seems to find back the blocks...
>   I add a picture in attachement.

missed :)
 
> > > A note about nMigen and a potential annoying bug at installation. nMigen
> > > is listed as a dependency so setup tools will try to install it.
> > > But if it is not installed *prior* to the Libre-SOC repositories, the
> > > m-lab version will be pulled from the Python archives.
> > 
> > yes.  this is normal because of the reliance on pip3 (via setuptools)
> > one solution: we remove all dependencies and expect people to install them 
> > manually (by way of a script / Makefile).
> 
>   I just wanted to hint at this potential problem so people don't
>   loose too much time next they encounter it. I slightly modified
>   the top Makefile so I can install in a non-system directory,

can you send me that so i can take a look?

btw i tried fpmul64 - experiment6 - and yosys is locked up solid 100% CPU
indefinitely in "clean".  which is particularly odd given that it worked
perfectly well last time i tried it.  which admittedly was with "flatten".
Comment 11 Jean-Paul.Chaput 2020-06-30 12:31:49 BST
(In reply to Jean-Paul.Chaput from comment #9)
> (In reply to Luke Kenneth Casson Leighton from comment #8)
> 
> > i will run it later and see how it looks.  i am fascinated to see how far
> > it gets.
> 
>   Far. It's almost OK. I did it with only 5% of space margin and
>   without using METAL6. And, if you run in graphic mode, its 
>   fascinating how the placer seems to find back the blocks...
>   I add a picture in attachement.

    Zut! pdf file is too big (1.1M).
Comment 12 Jean-Paul.Chaput 2020-06-30 12:35:58 BST
Created attachment 72 [details]
Makefile for nmutil with install dir

Very basic patch to set where to install. The install dir may be guessed in a much smarter way...
Comment 13 Luke Kenneth Casson Leighton 2020-06-30 13:55:55 BST
(In reply to Jean-Paul.Chaput from comment #12)
> Created attachment 72 [details]
> Makefile for nmutil with install dir
> 
> Very basic patch to set where to install. The install dir may be guessed in
> a much smarter way...

oh ok i get it.  just python3 setup.py develop --install-dir={somewhere}.
ok that makes sense.

i _believe_ this may be what virtualenv does in a transparent way
(except not everyone loves virtualenv)
Comment 14 Jean-Paul.Chaput 2020-06-30 14:04:44 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)
> i _believe_ this may be what virtualenv does in a transparent way
> (except not everyone loves virtualenv)

  The less layers of install tools, the better (so I can patch them
  more easily) (IMHO).
Comment 15 Luke Kenneth Casson Leighton 2020-07-16 01:25:37 BST
jean-paul, i see you're back from holiday in the rainy lovely beaches.

i have pushed a couple of updates to test_issuer.il one of which added (then removed) the div unit.  i also, back in issuer.py, provided an option in the code to add pipeline types, so mul can be added etc. by changing one line.

you will need to git pull all soc repositories however *do not* update nmigen right now as there are issues outstanding with it.

i would if there is time very much like to do at least a top level hierarchical layout, regardless but also because there will be space unused.

the reason is that when it comes to showing people the layout, it is possible to point and say, "this is the Logical pipeline" and so on.

to help with that, i would like to be able to set the width but not height or height but not width when doing the area calculation.

what can then be done is:

* run all pipeline layouts with the exact same height (large height)

* get a series of varied widths back for each pipeline (some of them will be very thin, some like MUL will be fat)

* lay them out in a row

* have the regfiles below them, placed optimally closest to the pipelines that need them

based on the widths of the pipelines and the widths of the regfiles it may even be practical to use an algorithm that works out the shortest paths, in 1D.

what's your thoughts, is this reasonable?

then also this would help identify the areas which are not routing, because it is less gates.  also it would speed up layout time.
Comment 16 Jean-Paul.Chaput 2020-07-16 13:20:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #15)
> jean-paul, i see you're back from holiday in the rainy lovely beaches.

  Sadly, yes.

> i have pushed a couple of updates to test_issuer.il one of which added (then
> removed) the div unit.  i also, back in issuer.py, provided an option in the
> code to add pipeline types, so mul can be added etc. by changing one line.
> 
> you will need to git pull all soc repositories however *do not* update
> nmigen right now as there are issues outstanding with it.

  Maybe too late, I just did it a couple of days ago. But I can easily
  roll back if you give me a commit hash to stick to.

> i would if there is time very much like to do at least a top level
> hierarchical layout, regardless but also because there will be space unused.
> 
> the reason is that when it comes to showing people the layout, it is
> possible to point and say, "this is the Logical pipeline" and so on.

  I understand very well, it makes much easier to comment layout,
  but maybe not the most efficient.

> to help with that, i would like to be able to set the width but not height
> or height but not width when doing the area calculation.
> 
> what can then be done is:
> 
> * run all pipeline layouts with the exact same height (large height)
> 
> * get a series of varied widths back for each pipeline (some of them will be
> very thin, some like MUL will be fat)
> 
> * lay them out in a row
> 
> * have the regfiles below them, placed optimally closest to the pipelines
> that need them
> 
> based on the widths of the pipelines and the widths of the regfiles it may
> even be practical to use an algorithm that works out the shortest paths, in
> 1D.

  Making blocs with fixed height or width is easy. The problems lays in
  the top assembly. In ASIC terminology, it's the floorplan (Placement of
  the top level blocks). Coriolis has no real support for that yet.

  To achieve that quickly we may try to create blocks that are directly
  connectable side by side. Meaning that the connectors are exactly at
  the same position & layer on each sides of both blocks. This is ok for
  2 pins nets, but if there are more, we have to route though a block a
  net (it can be done with minimum fuss also).

  Normally we should use a block & channel routing (routing space between
  the blocks).

> what's your thoughts, is this reasonable?

  I will try a first flat run to get a feel about runtime and memory size.
  Then I will see if we must break it. Note that, the ASIC IBM benchmarks
  supplied for the ISPD contest are completely flat (no block whatsoever,
  up to 1 million gates).

> then also this would help identify the areas which are not routing, because
> it is less gates.  also it would speed up layout time.

  breaking the design in smaller block would certainly reduce the P&R time
  and help solve problem one block at a time. As Staf did put some time ago,
  for big ASICs, the maximum run time should be "one night".
Comment 17 Luke Kenneth Casson Leighton 2020-07-16 16:44:54 BST
(In reply to Jean-Paul.Chaput from comment #16)
> (In reply to Luke Kenneth Casson Leighton from comment #15)

>   I understand very well, it makes much easier to comment layout,
>   but maybe not the most efficient.

not a problem, this is a test chip.



> > based on the widths of the pipelines and the widths of the regfiles it may
> > even be practical to use an algorithm that works out the shortest paths, in
> > 1D.
> 
>   Making blocs with fixed height or width is easy. 

> The problems lays in
>   the top assembly. In ASIC terminology, it's the floorplan (Placement of
>   the top level blocks). Coriolis has no real support for that yet.

that's ok.  the alu16 example showed how to do it.

what i do not want, is, to have different width and height pipelines, which if we add even just one new function to one pipeline the entire layout must be redone.

the current system, you call a function and it tells you the width *and* height estimate needed to route that block, and they are squares (appx).

i would like the estimate system to be able to set a fixed height, and it to tell me the width.

then the placement of all pipelines cab be lined up.

the inputs and outputs will all be on one side (SOUTH) to connect to register files.


>   To achieve that quickly we may try to create blocks that are directly
>   connectable side by side.

ok so the pipelines except for LDST have *ZERO* connectivity to anything other that the register file Buses.

this is a VERY deliberate hard rule that has been set.

there is NO interconnection between pipelines.

therefore the floorplan is:

* pipelines in a row at the top
* Register Buses and "Priority Pickers" in the middle
* Register Files at the bottom
* Decoder to the side, connected to the Fast Regfile (to get the Program Counter).

so it is very regularly organised.

therefore for most Regfiles, all the ports can be NORTH.

> Meaning that the connectors are exactly at
>   the same position & layer on each sides of both blocks. This is ok for
>   2 pins nets, but if there are more, we have to route though a block a
>   net (it can be done with minimum fuss also).

the only one which might need pins on different sides is FAST Regs because it contains the Program Counter, the MSR (sets 64 bit mode, User mode, LE/BE etc).

oh, and LDST of course, the memory connection comes out NORTH but registers are SOUTH.

everything else is extremely regular.

>   Normally we should use a block & channel routing (routing space between
>   the blocks).

nice.
 
> > what's your thoughts, is this reasonable?
> 
>   I will try a first flat run to get a feel about runtime and memory size.
>   Then I will see if we must break it. Note that, the ASIC IBM benchmarks
>   supplied for the ISPD contest are completely flat (no block whatsoever,
>   up to 1 million gates).

mad :)

well given that there are things to fix i would prefer that you are able to run and debug on a fast loop.
Comment 18 Jean-Paul.Chaput 2020-07-21 12:50:56 BST
Hello Luke,

I'm starting to work on:
    soc/src/soc/simple/test/test_issuer.py

And, of course, I get problems. First, I updated all the libre-soc git
repositories and re-installed them. Then I got into various errors
with nMigen. The latest being:

Traceback (most recent call last):
  File "./test_issuer.py", line 21, in <module>
    from soc.simple.test.test_core import (setup_regs, check_regs,
  File ".../soc/src/soc/simple/test/test_core.py", line 28, in <module>
    from soc.fu.alu.test.test_pipe_caller import ALUTestCase
  File ".../soc/src/soc/fu/alu/test/test_pipe_caller.py", line 5, in <module>
    from nmigen.sim.cxxsim import Simulator
ModuleNotFoundError: No module named 'nmigen.sim.cxxsim'

Currently I use nMigen d714d78 (HEAD of 14/07/2020).
I did git update, but even with the latest one, I can't locate any nmigen.sim.cxxsim module... What do I do wrong. Could you pinpoint me
the working commit of nMignen ?

Best regards,
Comment 19 Luke Kenneth Casson Leighton 2020-07-21 13:50:32 BST
(In reply to Jean-Paul.Chaput from comment #18)
> Hello Luke,
> 
> I'm starting to work on:
>     soc/src/soc/simple/test/test_issuer.py
> 
> And, of course, I get problems. First, I updated all the libre-soc git
> repositories and re-installed them. Then I got into various errors
> with nMigen. The latest being:
> 
> Traceback (most recent call last):
>   File "./test_issuer.py", line 21, in <module>
>     from soc.simple.test.test_core import (setup_regs, check_regs,
>   File ".../soc/src/soc/simple/test/test_core.py", line 28, in <module>
>     from soc.fu.alu.test.test_pipe_caller import ALUTestCase
>   File ".../soc/src/soc/fu/alu/test/test_pipe_caller.py", line 5, in <module>
>     from nmigen.sim.cxxsim import Simulator
> ModuleNotFoundError: No module named 'nmigen.sim.cxxsim'

ah right.  yes.  that relies on the nmigen cxx_sim branch, apologies.
let me sort that with a try/except Import.
Comment 20 Luke Kenneth Casson Leighton 2020-07-21 13:54:13 BST
jean-paul: git pull on soc.

commit 3c425869fd36a73a040e07b58050069c6022b0de (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Tue Jul 21 13:53:28 2020 +0100

    make cxxsim optional and print warning
Comment 21 Jean-Paul.Chaput 2020-07-21 14:59:33 BST
Sorry to bother you... Same player shoot again.

Traceback (most recent call last):
  File "./test_issuer.py", line 36, in <module>
    from soc.simulator.test_sim import (GeneralTestCases, AttnTestCase)
  File ".../soc/src/soc/simulator/test_sim.py", line 3, in <module>
    from nmigen.test.utils import FHDLTestCase
ModuleNotFoundError: No module named 'nmigen.test'
Comment 22 Luke Kenneth Casson Leighton 2020-07-21 15:15:59 BST
(In reply to Jean-Paul.Chaput from comment #21)
> Sorry to bother you... Same player shoot again.
> 
> Traceback (most recent call last):
>   File "./test_issuer.py", line 36, in <module>
>     from soc.simulator.test_sim import (GeneralTestCases, AttnTestCase)
>   File ".../soc/src/soc/simulator/test_sim.py", line 3, in <module>
>     from nmigen.test.utils import FHDLTestCase
> ModuleNotFoundError: No module named 'nmigen.test'

sigh it's likely nmigen.test has been removed (or moved) in the past day
or so. sorted:

commit 4d8a7e65660df9e41a061997631763d51dbe2124 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Tue Jul 21 15:14:00 2020 +0100

    spurious imports of FHDLTestCase, should be from nmutil


generally, keeping "up-to-date" with absolute latest nmigen is inadviseable
without coordinating: it's a moving target.
Comment 23 Jean-Paul.Chaput 2020-07-21 18:03:38 BST
(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Jean-Paul.Chaput from comment #21)
> > Sorry to bother you... Same player shoot again. 
> commit 4d8a7e65660df9e41a061997631763d51dbe2124 (HEAD -> master,
> origin/master)
> Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
> Date:   Tue Jul 21 15:14:00 2020 +0100
> 
>     spurious imports of FHDLTestCase, should be from nmutil

  Got it working.
 
> generally, keeping "up-to-date" with absolute latest nmigen is inadviseable
> without coordinating: it's a moving target.

  I totally agree. My update policy is to stick to a version as long
  as it works. Then, when it do not, update to the newest possible.
  So I make leaps between "very old" and "very new". Maybe I did miss
  it but, I think you should keep track of the latest "compatible"
  nMigen version, and maybe put it in a doc file at the root of the
  soc repository. So this way people would quickly now which one
  to install.
Comment 24 Luke Kenneth Casson Leighton 2020-07-21 19:50:21 BST
(In reply to Jean-Paul.Chaput from comment #23)
> >     spurious imports of FHDLTestCase, should be from nmutil
> 
>   Got it working.

excellent

> > generally, keeping "up-to-date" with absolute latest nmigen is inadviseable
> > without coordinating: it's a moving target.
> 
>   I totally agree. My update policy is to stick to a version as long
>   as it works. Then, when it do not, update to the newest possible.
>   So I make leaps between "very old" and "very new". Maybe I did miss
>   it but, I think you should keep track of the latest "compatible"
>   nMigen version, 

i am... except... well it's complicated, i am helping whitequark debug
cxxsim and also working on the processor: cxxsim should offer up to a *100*
times increase in simulation performance so is worth pursuing.

>   and maybe put it in a doc file at the root of the
>   soc repository. So this way people would quickly know which one
>   to install.

well i think we're good, now.  we did have a point where gtkwave wasn't
working, that i believe is fixed now.  and the spurious import is ok...
probably in the clear, now.

btw do do a "git pull" on soclayout, i just updated non_generated/test_issuer.il

i have removed two read ports on the fast regfile which is so ridiculously
large (20% of the gate area) that it justified the effort.

these were reading the PC and the MSR (Machine Status Register) and i decided
to pass them as "immediates" to Branch and Trap, respectively, rather than
have the CompUnits read them a *second* time from *another* Fast Regfile port.
Comment 25 Luke Kenneth Casson Leighton 2020-07-22 11:00:05 BST
Jean-Paul: ta-daaaa :)
https://ftp.libre-soc.org/2020-07-22_10-55.png

that's from last night.

....
....

  o  Configuration of ToolEngine<Etesian> for Cell <test_issuer>
     - Cell Gauge ...................................................... <sxlib>
     - Place Effort .......................................................... 2
     - Update Conf ........................................................... 2
     - Spreading Conf ........................................................ 1
     - Routing driven .................................................... false
     - Space Margin ......................................................... 5%
     - Aspect Ratio ....................................................... 100%
     - Bloat model .................................................... disabled
  o  Erasing previous placement of <test_issuer>
  o  Creating abutment box (margin:5% aspect ratio:100% g-length:66242.3)
     - Bloat space margin: 0%.
     - <Box 0l 0l 13200l 13200l>
     - GCell grid: [264x264]
  o  Converting <test_issuer> into Coloquinte.
     - H-pitch .............................................................. 5l
     - V-pitch .............................................................. 5l
     - Converting 88436 instances
     - Building RoutingPads (transhierarchical) ...
     - Converting 88746 nets

....
....

     - Track Segment Completion Ratio ....................... 99.99% [685064+94]
     - Wire Length Completion Ratio ..................... 99.98% [43489516+6675]
     - Wire Length Expand Ratio ........................... 6.02% [min:41021805]
     - Unrouted horizontals ........................................ 79.79% [75]
     - Unrouted verticals .......................................... 20.21% [19]
     - Done in .............................................. 4m 29.35s, 726.4Mb
     - Raw measurements ............................... 269.35s, +743884Kb/2.5Gb
  o  Checking Katana Database coherency.
  o  Driving Hurricane data-base.
     - Active AutoSegments .............................................. 788860
     - Active AutoContacts .............................................. 922308
     - AutoSegments ..................................................... 791697
     - AutoContacts ..................................................... 927982
     - Same Layer doglegs ............................................... 791697
     - Done in .................................................. 2.56s, 0 bytes
     - Raw measurements ................................... 2.56157s, +0Kb/2.5Gb
  o  Deleting ToolEngine<Katana> from Cell <test_issuer>
Comment 26 Jean-Paul.Chaput 2020-07-22 13:47:08 BST
(In reply to Luke Kenneth Casson Leighton from comment #25)
> Jean-Paul: ta-daaaa :)
> https://ftp.libre-soc.org/2020-07-22_10-55.png
> 
> that's from last night.


Very nice. Almost completed whitout any optimization, that's a good omen.
I'm working on the "hierarchical" option. We will see which is best...
But the flat way seems still working.

Can you give me the total P&R time (as returned by the time command) ?
Comment 27 Luke Kenneth Casson Leighton 2020-07-22 14:00:05 BST
(In reply to Jean-Paul.Chaput from comment #26)
> (In reply to Luke Kenneth Casson Leighton from comment #25)
> > Jean-Paul: ta-daaaa :)
> > https://ftp.libre-soc.org/2020-07-22_10-55.png
> > 
> > that's from last night.
> 
> 
> Very nice. Almost completed whitout any optimization, that's a good omen.

indeed.  and it's soo preeettyyyy :)

> I'm working on the "hierarchical" option. We will see which is best...
> But the flat way seems still working.

yes - i think there's still some "configs" not connected? (what string
am i looking for to double-check that?)
 
> Can you give me the total P&R time (as returned by the time command) ?

ha, i have to re-run it.  bear in mind i am clamping the processor speed
on this laptop to 1ghz, to avoid the fan running permanently, pulling in
massive amounts of dust.
Comment 28 Jean-Paul.Chaput 2020-07-22 15:12:43 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> (In reply to Jean-Paul.Chaput from comment #26)
> > (In reply to Luke Kenneth Casson Leighton from comment #25)
> > > Jean-Paul: ta-daaaa :)
> > > https://ftp.libre-soc.org/2020-07-22_10-55.png
> > > 
> > > that's from last night.
> > 
> > 
> > Very nice. Almost completed whitout any optimization, that's a good omen.
> 
> indeed.  and it's soo preeettyyyy :)
> 
> > I'm working on the "hierarchical" option. We will see which is best...
> > But the flat way seems still working.
> 
> yes - i think there's still some "configs" not connected? (what string
> am i looking for to double-check that?)

  To be used with profit, you need to understand the overall way the
  P & R algorithm and data works. This would need a not so short
  explanation. Besides I'm also testing a "routing driven" placement
  which is not comited yet. I will experiment, then send you back
  the right set of tuning parameters.
    And, yes, I really do need to write a documentation about how
  the P & R works and how it relates to the configuration parameters
  so people can play too ;-)


> > Can you give me the total P&R time (as returned by the time command) ?
> 
> ha, i have to re-run it.  bear in mind i am clamping the processor speed
> on this laptop to 1ghz, to avoid the fan running permanently, pulling in
> massive amounts of dust.

  Nice feature. What (Linux) software does that? I'm interested.
  Especially since my laptop seems to develop the habit of overheating
  in my backpack when put in "suspend to RAM". Now I'm using suspend
  to disk, praying there will be no memory corruption...

  Don't bother. But if you have the whole trace, I can get the
  numbers from there. Especially the placement time and the
  layer assignment step, which is *way* too slow, have to find
  out where I did put a quadratic thing inside...
Comment 29 Luke Kenneth Casson Leighton 2020-07-22 15:37:48 BST
(In reply to Jean-Paul.Chaput from comment #28)

> > yes - i think there's still some "configs" not connected? (what string
> > am i looking for to double-check that?)
> 
>   To be used with profit, you need to understand the overall way the
>   P & R algorithm and data works. This would need a not so short
>   explanation. Besides I'm also testing a "routing driven" placement
>   which is not comited yet. I will experiment, then send you back
>   the right set of tuning parameters.
>     And, yes, I really do need to write a documentation about how
>   the P & R works and how it relates to the configuration parameters
>   so people can play too ;-)

:)

ok.
 
> 
> > > Can you give me the total P&R time (as returned by the time command) ?
> > 
> > ha, i have to re-run it.  bear in mind i am clamping the processor speed
> > on this laptop to 1ghz, to avoid the fan running permanently, pulling in
> > massive amounts of dust.
> 
>   Nice feature. What (Linux) software does that?

"apt-get install cpufreqd cpufreq-utils" then edit /etc/cpufreqd.conf,
create (or use) a pre-existing config - i have created "Performance Low",
set it to maxfreq=20% then further down on [Rule] AC=on set it as
the required profile.

you need acpid also installed, for this to work... *i believe*.

but the only reason that works is because i do *not* run quotes standard
desktop window manager software quotes.  i run fvwm2, which over two
decades has been customised and added to (a manual systemtray program
for example, placed using a line in ~/.xinitrc)

so if you do happen to use gnome or kde it *might* interfere with the
above, i.e. you *might* have to look in whatever-control-panel-blah-blah
is in use, and there i can't help you.


> I'm interested.
>   Especially since my laptop seems to develop the habit of overheating
>   in my backpack when put in "suspend to RAM". Now I'm using suspend
>   to disk, praying there will be no memory corruption...

s2disk tends to work very well, i have found.  only "unreliable laptops"
tend to crash during resume, where s2ram tends not to come back to life.

 
>   Don't bother. But if you have the whole trace, I can get the
>   numbers from there. Especially the placement time and the
>   layer assignment step, which is *way* too slow, have to find
>   out where I did put a quadratic thing inside...

it finished just now

real    89m53.343s
user    87m32.070s
sys     2m22.477s

holy s*** the nohup.out file is 872 mb.

it will be at https://ftp.libre-soc.org/nohup.out.bz2 shortly - do not download it immediately, because i am still rsync'ing it up and compressing it at the same time.

ok it's done.

please let me know when you have it (intact) because at 67mb that's far
too large to leave on the server.
Comment 30 Jean-Paul.Chaput 2020-07-22 16:03:36 BST
> so if you do happen to use gnome or kde it *might* interfere with the
> above, i.e. you *might* have to look in whatever-control-panel-blah-blah
> is in use, and there i can't help you.

  I'm under Xfce... I'm also using Emacs in Vi mode 8-) (to get shouted at
  by both sides).
 
> s2disk tends to work very well, i have found.  only "unreliable laptops"
> tend to crash during resume, where s2ram tends not to come back to life.

  I will try it if the standard hibernate fails...
 
> it finished just now
> 
> real    89m53.343s
> user    87m32.070s
> sys     2m22.477s
> 
> holy s*** the nohup.out file is 872 mb.

  That's "very verbode" mode for you! In this mode, for each routing
  event processed, it writes one line. As it do not put newline, you
  don't see it, but when redirected into a file. You got a very, very
  long line... You have about 700,000 track segments, so on average
  two events by segments, so at least 1.4M lines...
     This feature is useful to me when I got determinism problems,
  I can perform comparison and see exactly where and on which event
  the divergence occur.

> it will be at https://ftp.libre-soc.org/nohup.out.bz2 shortly - do not
> download it immediately, because i am still rsync'ing it up and compressing
> it at the same time.
> 
> ok it's done.
> 
> please let me know when you have it (intact) because at 67mb that's far
> too large to leave on the server.

Seems I got kicked out ;-)

--2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2
Resolving ftp.libre-soc.org (ftp.libre-soc.org)... 46.235.227.77, 2a00:1098:82:f::1
Connecting to ftp.libre-soc.org (ftp.libre-soc.org)|46.235.227.77|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2020-07-22 16:50:40 ERROR 403: Forbidden.
Comment 31 Luke Kenneth Casson Leighton 2020-07-22 17:05:07 BST
(In reply to Jean-Paul.Chaput from comment #30)
> > so if you do happen to use gnome or kde it *might* interfere with the
> > above, i.e. you *might* have to look in whatever-control-panel-blah-blah
> > is in use, and there i can't help you.
> 
>   I'm under Xfce... I'm also using Emacs in Vi mode 8-) (to get shouted at
>   by both sides).

niiice :)

xfce is still a "full desktop" that uses parts of gnome2 low-level infrastructure so it maaay still interfere, and/or there may be somewhere in xfce4 control panel.

only by using a 25-year-old window manager (fvwm2) and running it with
"startx" do i actually get full control over what i want, with no interference.


>      This feature is useful to me when I got determinism problems,
>   I can perform comparison and see exactly where and on which event
>   the divergence occur.

yeh no makes sense

> Seems I got kicked out ;-)
> 
> --2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2

# chmod ugo+r ./nohup.out.bz2

try again
Comment 32 Jean-Paul.Chaput 2020-07-22 17:11:11 BST
> > Seems I got kicked out ;-)
> > 
> > --2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2
> 
> # chmod ugo+r ./nohup.out.bz2
> 
> try again

  Got it. You can remove...
Comment 33 Jean-Paul.Chaput 2020-07-23 12:01:19 BST
> > --2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2
> 
> # chmod ugo+r ./nohup.out.bz2
> 
> try again

OK. When looking at the log file, I did see that you did make the P&R
twice... As it is deterministic, you get twice the same result, *but*
very strangely, the second run is much slower than the first.

Runs:
   Place       GlobR   BDetR   LAssign   DetR   Destroy   Total
1    394 + 2 +    58 +    34 +     685 +  270 +       3   1446 (24 minutes)
2   1496 + 8 +   226 +   137 +    2205 + 1010 +       8   5090 (84 minutes)

You can find those times by searching for 'Done in' in the log file.

There may be a flaw in the Makefile system. As the routage fails a
"failed" status is returned to the calling rule, so it may start again
the P&R. Did you just made a "make lvx" or "make layout; make lvx" ?
I'm also curious about why so different runtimes. Was your computer
much more loaded the second time? Or did you not throttle the CPU the
first time?

You can see that LAssign (Layer Assignment) takes more times than
the whole placement. This is not normal considering what it does.
So, if I find what's wrong we can win almost 10 minutes over 24...
Comment 34 Luke Kenneth Casson Leighton 2020-07-23 12:20:21 BST
(In reply to Jean-Paul.Chaput from comment #33)
> > > --2020-07-22 16:50:30--  https://ftp.libre-soc.org/nohup.out.bz2
> > 
> > # chmod ugo+r ./nohup.out.bz2
> > 
> > try again
> 
> OK. When looking at the log file, I did see that you did make the P&R
> twice... 

i did?  i didn't!  however i did run "make cgt" in a separate window
in order to get the... no wait, that was before doing this run.

wasn't me, boss

> As it is deterministic, you get twice the same result, *but*
> very strangely, the second run is much slower than the first.
> 
> Runs:
>    Place       GlobR   BDetR   LAssign   DetR   Destroy   Total
> 1    394 + 2 +    58 +    34 +     685 +  270 +       3   1446 (24 minutes)
> 2   1496 + 8 +   226 +   137 +    2205 + 1010 +       8   5090 (84 minutes)
> 
> You can find those times by searching for 'Done in' in the log file.
> 
> There may be a flaw in the Makefile system. As the routage fails a
> "failed" status is returned to the calling rule, so it may start again
> the P&R. Did you just made a "make lvx" or "make layout; make lvx" ?

"make layout" and in a separate window i had run "make cgt" - *before*
starting this run.

> I'm also curious about why so different runtimes. Was your computer
> much more loaded the second time? 

not really.  it is 8-core dual hyper-threaded

> Or did you not throttle the CPU the
> first time?

once it's set up it's a pain to change, so no change.
 
> You can see that LAssign (Layer Assignment) takes more times than
> the whole placement. This is not normal considering what it does.
> So, if I find what's wrong we can win almost 10 minutes over 24...

and if i let it run at 5ghz that saves time, too.

btw one other reason i really want to do sub-cell layouts is to have
the possibility of parallel make.

l.
Comment 35 Luke Kenneth Casson Leighton 2020-07-25 13:18:50 BST
okaay jean-paul, about the floor-plan layout version:

i have renamed all of the operands so that they now have the following
prefix format:

* oper_i_alu_PIPENAME{N}_{field}
* oper_i_ldst_ldst{N}_{field}

where earlier i recommended to put all I/O on the *bottom* (SOUTH) of each pipeline (except LD/ST which would specially have the Memory interface on NORTH) i thought about this a bit more, and realised that the opcode expansion is going to be too many wires.

oper_i_alu_alu0 and oper_i_alu_logical0 for example, the 32-bit incoming instruction is expanded to *ONE HUNDRED AND THIRTY* wires because it contains, for example, the expansion of the immediate field out to its full 64-bit.

i will do something about this... but not right now.

so what i figured was: those operand wires could come in at the *side* (RIGHT)
down at the bottom (SOUTH) part of the RIGHT (left?) side.

LDST:

    Memory (PortInterface)
                 ^
                 |
             +---|---------+
             |             |
             |  +-----+    |
             |  |     |    |
             | pipe2 pipe3 |
             |  ^     |    |
             |  |     v    |
 oper_i_alu->--pipe1 pipe4 |
             |  |     |    |
             +--|-----|----+
                ^     v
              IN regs OUT


the oper_i_alu_* needs to "propagate" through the pipeline in synchronisation
with the IN regs data, therefore it is sensible to have oper_i_alu_* be close to IN data/regs, rather than be on the opposite side.

therefore, if IN is better placed on the *right* of OUT at the SOUTH side, then oper_i_alu_* should _also_ go on the RIGHT side, close to the bottom.

many (most) of these Pipelines i expect to be quite "thin".  ALU0 for example, or TRAP0, or Branch0.

however MUL will be extremely "fat" (take up literally 50% of the entire width of the layout).

the "thin" ALUs are what concerns me, hence the idea of bringing in the oper_i_* signals in at the "side".

what this in turn means is that there will be some clear delineation / separation between the pipelines, potentially needing those "channels" you mentioned, because those oper_i signals will have to be routed *between* the pipelines (which are laid out in a horizontal line, just like in the CDC 6600 diagram).

https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-multiple_function_units.png

one other alternative is to have, exactly as is shown *in* that diagram,
complete separation between input and output on pipelines: however this
means that one of the register WRITE and READ port Buses need routing *round*
(to the top).  i am not keen on that.

in the arrangement where pipeline data goes in and out of the same side
(and the pipeline doubles back on itself), both the WRITE and READ regfile
ports may be very close to the pipelines.


fascinatingly, the full auto-route version actually *mixes* the FAST regfile,
TRAP0 pipeline and BRANCH0 pipeline, interspersed in some regularly
patterned "blobs", side-by-side with each other.

this because TRAP0 and BRANCH0 both need significant access to the FAST
regfile.

anyway.

yes.

summary of idea: oper_i_alu_* field signals on EAST or WEST, and i've
named them conveniently so that they can be searched for, easily, with
pattern-matching that prefix.
Comment 36 Jean-Paul.Chaput 2020-07-26 23:23:18 BST
Thanks for the hints.

Just to let you know, I'm working on the floorplan, so I'm studying closely
the structure of the issuer netlist, as created after going through Yosys.

I've noticed something "unusual" (in my experience in hierarchical ASICs),
some models have blocks and a few standard cells (relatively speaking).
When processed, flat, this is not a problem, but if we want to place
each block, then the top level, the placement of those stray cells
may be difficult (meaning: far from optimal).
Comment 37 Luke Kenneth Casson Leighton 2020-07-27 00:12:36 BST
(In reply to Jean-Paul.Chaput from comment #36)
> Thanks for the hints.
> 
> Just to let you know, I'm working on the floorplan, so I'm studying closely
> the structure of the issuer netlist, as created after going through Yosys.

"show (modulename)" helps there.


> I've noticed something "unusual" (in my experience in hierarchical ASICs),
> some models have blocks and a few standard cells (relatively speaking).
> When processed, flat, this is not a problem, but if we want to place
> each block, then the top level, the placement of those stray cells
> may be difficult (meaning: far from optimal).

ta-daaa, nooow you know why i suggested the "group placement" idea :)

just like in the alu16 example, it contains:

* a add16 cell
* a sub16 cell
* some gates that mux add16 and sub16

i.e. not:

* a add16 cell
* a sub16 cell
* another cell containing the mux
* wires-only between those 3 cells.

it is basically pretty normal to do this, particularly for pipeline layouts.

for example we have:

* a combinatorial stage1 cell
* a combinatorial stage2 cell etc etc
* some gates that include registers bu t NOT in their own cell(s)

i.e. the source code was basically given a list of combinatorial blocks as modules, told what the API is, and *in that parent* throws some register latches together in a for-loop, dropping both the sub-modules and the registers into its context.

in this way you regularly get very large blocks mixed in with "a few gates".

this is why i suggested the modification to allow a *list* of cells to be placed in a Box.

and also as you see in alu16 i created that function which tracks down the connections between cells by pattern matching on the *net* name rather than have to name the cells explicitly.

changing the design to put everything into submodules, this would significantly reduce code clarity.

however if it proves absolutely necessary it *MAY* be possible to automate that, at the python level.

it would require a change to the entire codebase which is hundreds of modules so would be a last resort.
Comment 38 Luke Kenneth Casson Leighton 2020-07-27 00:33:02 BST
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/alu_hier.py;h=59bca26e358051b9579a9686833c8a9c93a1e393;hb=8c398d8c100be10b21cc2c193b39a112cc331dc1#l188

example.

see those 4 submodules, add, sub, mul, shift? really clear, nice modular code, right?

and further down, you can see ready/valid signalling to manage them, yes?

whilst the add, sub etc. end up as individual big cells, the ready/valid logic is *in this module as well* and that is where the dozens of "unmanaged" cells end up coming from which are not part if any "child" module.

to try to _create_ a submodule manually just to "contain" these, it is not a goid idea.

firstly, it makes a mess of the code clarity.

secondly, some of those cells would be best placed in between say add and sub, whilst others are best placed between shift and mul.  others, near *some* inputs, others near *some* outputs.

or.. whatever.  you get the general idea.

one way to cheat is to simply put the child blocks in the parent in such a way that these "floating" cells simply have very little choice about where else they can be placed.  right against the left edge, for example, or make the child the exact same width as the parent.

the ideal thing however, the "base" to work from, is to allow etesian.place to taje a *list* of cells to be placed.

a second improvement would be to be able to specify a *list* of abutment boxes to etesian place, which are unioned together to create a complex area beyond a rectangle.

however the list of boxes can be (partly) done by calling etesian.place multiple tines with different batches of cells.
Comment 39 Jean-Paul.Chaput 2020-07-27 10:30:23 BST
> * a add16 cell
> * a sub16 cell
> * some gates that mux add16 and sub16
> 
> i.e. not:
> 
> * a add16 cell
> * a sub16 cell
> * another cell containing the mux
> * wires-only between those 3 cells.
> 
> it is basically pretty normal to do this, particularly for pipeline layouts.
> 
> for example we have:
> 
> * a combinatorial stage1 cell
> * a combinatorial stage2 cell etc etc
> * some gates that include registers bu t NOT in their own cell(s)
> 
> i.e. the source code was basically given a list of combinatorial blocks as
> modules, told what the API is, and *in that parent* throws some register
> latches together in a for-loop, dropping both the sub-modules and the
> registers into its context.
> 
> in this way you regularly get very large blocks mixed in with "a few gates".

I understand very well your need for the clearer possible code.
I think it should not be needed that you isolate the few additional gates
in a special (and artificial) block.

My concern is that, if we take the ADD/SUB, the blocks will be
dwarfing the stray cells, and we get very sub-optimal placement.

Etesian is already capable of using ADD then SUB as big placed blocks
and then placing the remaining few cells around them, where we left
some free space, so inside a non-square area. That is, a square area,
minus the area of the already placed blocks. This is what is done in
an earlier layout experiment with doAlu16.

We may get away with it, *if* the stray cells are clearly a vector
and we can find back easily their matrix structure. But even then,
it has to be a very very regular structure or we will loose a big
amount of free space if the row or columns are "dented".
This is what you hinted in your signal naming scheme, I will see
what I can do.
Comment 40 Luke Kenneth Casson Leighton 2020-07-27 11:00:26 BST
(In reply to Jean-Paul.Chaput from comment #39)
> > * a add16 cell
> > * a sub16 cell
> > * some gates that mux add16 and sub16
> > 
> > i.e. not:
> > 
> > * a add16 cell
> > * a sub16 cell
> > * another cell containing the mux
> > * wires-only between those 3 cells.
> > 
> > it is basically pretty normal to do this, particularly for pipeline layouts.
> > 
> > for example we have:
> > 
> > * a combinatorial stage1 cell
> > * a combinatorial stage2 cell etc etc
> > * some gates that include registers bu t NOT in their own cell(s)
> > 
> > i.e. the source code was basically given a list of combinatorial blocks as
> > modules, told what the API is, and *in that parent* throws some register
> > latches together in a for-loop, dropping both the sub-modules and the
> > registers into its context.
> > 
> > in this way you regularly get very large blocks mixed in with "a few gates".
> 
> I understand very well your need for the clearer possible code.
> I think it should not be needed that you isolate the few additional gates
> in a special (and artificial) block.
> 
> My concern is that, if we take the ADD/SUB, the blocks will be
> dwarfing the stray cells, and we get very sub-optimal placement.

this is why the place-by-list.  if there is huge space, and the number
of stray cells relatively small, the priority is to get them roughly
in the right place, not to get them super-efficiently packed.

setting 10% or even 50% extra etesian space on something that contains
only 50 stray cells, when the alternative is that they are placed miles
away from the (large) sub-cells, this is way better, even though the
Etesian place was never originally designed for such tiny placement.

> We may get away with it, *if* the stray cells are clearly a vector
> and we can find back easily their matrix structure. 

ah.  right.  so this was where i began exploring an alternative to matrix
design, in alu16.py

written in python, i created a recursive "net-analyser" subroutine.

its parameters give a starting *net* - not a cell pattern-match - a *net*
pattern-match list - and it will loop on the following:

* find all cells connected to the net-pattern.
* find all nets connected to those cells
* if there are no new nets, stop.
* otherwise:
   A) add the cells found to the "accumulated result"
   B) continue recursively searching with the NEWLY FOUND nets
      to find MORE cells

it is a Graph walking algorithm, basically, identifying related cells given
net names.

in this way you *do not* need to know *anything* about the names of the
cells that are connected *to* the nets.  they can change any time, and
you just don't care, we can continue developing the HDL and *not* worry about having to throw away massive amounts of coriolis2 hand-crafted layout code (because there isn't any).

what you care about is "what cells are connected to o[15]" and the
function will give you that answer.

*this* allows to pick up the "stray cells" because you obviously know the Inputs/Output NET names of the sub-block (ADD16, SUB16), and can run a for-loop on them, calling this function and accumulating everything associated with them.

what we do not want to then have to do is to do manual placement on those stray cell groups, having identified them all (or appropriate subsets), we just need to be able to do a small "rough" placement / grouping, in an area that will be mostly routing wires, in between one of the sub-blocks and another sub-block, or between any given sub-block and the I/O of the parent.
Comment 41 Luke Kenneth Casson Leighton 2020-07-29 14:22:51 BST
jean-paul i just checked something to be possible in yosys: to be able
to flatten individual modules rather than all of it (top).

this works fine.

so, to support this: if the YOSYS_FLATTEN can take, instead of a "yes/no"
(might need a new Makefile parameter, YOSYS_FLATTEN_LIST), the following:

          YOSYS_FLATTEN_LIST=`cat to_flatten.txt`

and "to_flatten.txt" to contain at least:

fast
cr
xer
slow
int
pdecode2
alu0
branch0
cr0
trap0
ldst0

and probably many more (basically the list of everything for which a top-level
block is to be written) this will get rid of many of the problems of
"dangling nets" without having to have a full flatten.

btw one other way is for that YOSYS_FLATTEN_LIST to be the output from
a python script that actually notices what's been declared as being
sub-cells (top level hierarchy) rather than have a separately-maintained
file that could get out of sync.
Comment 42 Jean-Paul.Chaput 2020-07-29 15:18:08 BST
(In reply to Luke Kenneth Casson Leighton from comment #41)
> jean-paul i just checked something to be possible in yosys: to be able
> to flatten individual modules rather than all of it (top).
> 
> this works fine.
> 
> so, to support this: if the YOSYS_FLATTEN can take, instead of a "yes/no"
> (might need a new Makefile parameter, YOSYS_FLATTEN_LIST), the following:
> 
>           YOSYS_FLATTEN_LIST=`cat to_flatten.txt`
> 
> and "to_flatten.txt" to contain at least:
> 
> fast
> cr
> xer
> slow
> int
> pdecode2
> alu0
> branch0
> cr0
> trap0
> ldst0
> 
> and probably many more (basically the list of everything for which a
> top-level
> block is to be written) this will get rid of many of the problems of
> "dangling nets" without having to have a full flatten.
> 
> btw one other way is for that YOSYS_FLATTEN_LIST to be the output from
> a python script that actually notices what's been declared as being
> sub-cells (top level hierarchy) rather than have a separately-maintained
> file that could get out of sync.

Good catch. I will integrate that, and try to implement a reliable way
of generating the list.

Still working through the placement of the issuer. The placement of the
IO pins of each block is a lengthy and tedious work... Have to solve
shortage of length along some sides. Hope to have something tomorrow.
Comment 43 Luke Kenneth Casson Leighton 2020-07-29 15:56:44 BST
(In reply to Jean-Paul.Chaput from comment #42)

> Good catch. I will integrate that, and try to implement a reliable way
> of generating the list.

cool.
 
> Still working through the placement of the issuer. The placement of the
> IO pins of each block is a lengthy and tedious work... Have to solve
> shortage of length along some sides.


in https://bugs.libre-soc.org/show_bug.cgi?id=199#c35 i described that i
put a prefix in front of names for the operand data, to make that easier.

it should _not_ be laborious, *at all*, to identify the signals.
it should be extremely easy, one line, one pattern.

also: yes as in https://bugs.libre-soc.org/show_bug.cgi?id=199#c35 if you
try to put everything on SOUTH it is pretty much guaranteed not to be
enough space, hence i suggested putting oper_i_* on EAST (or WEST).

for alu0 (and others), the pattern you probably discovered already is:

* oper_i_alu_**** (these are the ones i suggested to bring in on E/W)
* issue_i
* busy_o / done_o
* shadown_i
* rdmaskn
* rd_go / rd_rel
* wr_go / wr_rel
* srcNN_i
* destNN_o and their matching *_ok

please remember: *anything* that is not "convenient" (like this big list
which cannot be easily identified with a couple of wildcard matches, this
means "i got something wrong", ok?


> Hope to have something tomorrow.

feel free to commit regularly even unfinished work for review.  that would
give me a chance to "fix" things that are clearly taking time.
Comment 44 Luke Kenneth Casson Leighton 2020-07-29 16:25:10 BST
commit 2c2a9a9ddf07f77ebfb06abb1e7691d462549c19 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Jul 29 16:19:08 2020 +0100

    bit of a big change: add prefixes "cu_" to all CompUnit management signals
    also change go/rel to go_i and rel_o at the same time

i've just sorted that so now e.g. alu0.vst should be more like this:

entity alu0 is
  port ( clk                    : in bit
       ; cu_go_die_i               : in bit
       ; cu_ issue_i                : in bit
       ; oper_i_imm_data_imm_ok : in bit
       ; oper_i_invert_a        : in bit
....
....
       ; rst                    : in bit
       ; cu_shadown_i              : in bit
       ; src3_i                 : in bit
....
       ; cu_busy_o                 : out bit
....
       ; vdd                    : linkage bit
       ; vss                    : linkage bit
       );
end alu0;

so now you should need only these search patterns:

- cu_*
- oper_i_*
- src*_i
- dest*_o
- *_ok

and that should cover everything.  save time! :)



i pushed a new test_issuer.il to experiment9/non_generated

commit 2747314b7ba8cb04f8ca50b1ed016a66831a18b0
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Jul 29 15:20:08 2020 +0000

    updated test_issuer.il to include new names
Comment 45 Luke Kenneth Casson Leighton 2020-07-29 22:14:20 BST
http://www.aholme.co.uk/6502/Main.htm

apparently there is an algorithm called "SubGemini" which solves the
recursive netlist walking issue: finding sub-circuit instances in a
larger circuit.

1. Miles Ohlrich, Carl Ebeling, Eka Ginting and Lisa Sather. "SubGemini: Identifying Subcircuits Using a Fast Subgraph Isomorphism Algorithm," In Proceedings of the 30th IEEE/ACM Design Automation Conference, June 1993.
Comment 46 Jean-Paul.Chaput 2020-07-30 16:51:12 BST
(In reply to Luke Kenneth Casson Leighton from comment #45)
> http://www.aholme.co.uk/6502/Main.htm
> 
> apparently there is an algorithm called "SubGemini" which solves the
> recursive netlist walking issue: finding sub-circuit instances in a
> larger circuit.
> 
> 1. Miles Ohlrich, Carl Ebeling, Eka Ginting and Lisa Sather. "SubGemini:
> Identifying Subcircuits Using a Fast Subgraph Isomorphism Algorithm," In
> Proceedings of the 30th IEEE/ACM Design Automation Conference, June 1993.

  I will look into it if needs be. Thanks for the tips.

  I'm almost done for the P&R of the sub-blocks of test_issuer, and now
  it would be very helpful if you could provide me with a rough floorplan
  of the blocks:

  * fus (maybe an ordering of the FUs, but not mandatory).
  * int
  * fast
  * pdecode2
  * l0

  Other blocks at core level, like the priority pickers are too small
  to be taken into account as "blocks to place separately".

  And maybe some hint about the big busses...

  I'm finishing this because I'm stubborn, but it is already clear at 99%
  that it will gives results *much* worse than the "flat" approach:

  As we place each block indepandantly, we create huge contention points
  at the border of most blocks due to the amount of large buses. Then we
  have to route those buses *between* the blocks, forcing us to push
  them farther apart, not even talking about the capacitance/drive problem.
  Moreover, the box a block can stray too far from a square factor if we
  want the placer to work (that is an AR between 0.5 and 2.0). There are
  exceptions, but that's the general idea. It would be a problem for the
  clock tree as it's depth may vary between blocks of different sizes.
  And lastly, to reduce the size of the channels, we would need a careful
  analysis of where to place the buses (and "combing" the bits to avoid
  to "flip" a whole bus), which is a lengthy task.
   So, if we compare a "flat" block with maybe up to 20% of margin space
  and the sum of blocks at 5% to 20% of free space plus channels, the
  winner is clear. Staf wins again.
   The "good" block level is the core, I think.
Comment 47 Luke Kenneth Casson Leighton 2020-07-30 17:37:11 BST
On Thu, Jul 30, 2020 at 4:51 PM bugzilla-daemon--- via libre-soc-bugs <libre-soc-bugs@lists.libre-riscv.org> wrote:
>
> https://bugs.libre-soc.org/show_bug.cgi?id=199
>
> --- Comment #46 from Jean-Paul.Chaput@lip6.fr ---
> (In reply to Luke Kenneth Casson Leighton from comment #45)
> > http://www.aholme.co.uk/6502/Main.htm
> >
> > apparently there is an algorithm called "SubGemini" which solves the
> > recursive netlist walking issue: finding sub-circuit instances in a
> > larger circuit.
> >
> > 1. Miles Ohlrich, Carl Ebeling, Eka Ginting and Lisa Sather. "SubGemini:
> > Identifying Subcircuits Using a Fast Subgraph Isomorphism Algorithm," In
> > Proceedings of the 30th IEEE/ACM Design Automation Conference, June 1993.
>
>   I will look into it if needs be. Thanks for the tips.

i mention it because it is likely an actually *designed* and properly
researched version of that recursive netlist-cell-netlist-cell algorithm
i described and implemented.


>   I'm almost done for the P&R of the sub-blocks of test_issuer, and now
>   it would be very helpful if you could provide me with a rough floorplan
>   of the blocks:
>
>   * fus (maybe an ordering of the FUs, but not mandatory).

all in a line, left-right, all the same "height".

the ordering will need to be worked out, based on how close they are to their respective register files.  at some point this could be determined by a 1-Dimensional algorithm which optimises them however right now that's not a high priority.  once you have committed something i can take a look and experiment.

each of the FUs should definitely be flattened though: alu0, logical0, etc.

* all registers (srcN_i, destN_i, *_ok) should be on SOUTH.
* oper_i_* should be on WEST (or EAST, your choice)

one addition: for ldst0 the port interface (pi) to data bus should be on NORTH.

>   * int
>   * fast

each of these flattened (and spr, and xer, and cr as well, they are all regfiles) - i expect all of their inputs and outputs to be on the NORTH side.

>   * pdecode2

flattened.  raw_opcode_in would be on one side (from imem): LOTS of signals go out, and this i know is a problem that needs to be solved - but iteratively.


>   * l0

again flattened: Port Interface (pi) to go on SOUTH (so that LDST can attach to it) and the Wishbone D-Bus on NORTH which will go out of the whole block.

>   Other blocks at core level, like the priority pickers are too small
>   to be taken into account as "blocks to place separately".
>
>   And maybe some hint about the big busses...

a clear space in between the FUs and L0 (top half), and the regfiles below them (bottom half).  priority pickers _should_ end up placed arbitrarily in that same middle space.

pdecode should probably be right in the middle, either at the top or pretty much dead centre, and the i-bus come in at the top middle as well (aka imem).

l0 should definitely be at the top, somewhere along the top edge, with the d-bus coming into it, and its SOUTH port connected directly to the NORTH of the ldst0.


>   I'm finishing this because I'm stubborn, but it is already clear at 99%
>   that it will gives results *much* worse than the "flat" approach:

it does however show clearly the places where routing does not "work"?


>   As we place each block indepandantly, we create huge contention points
>   at the border of most blocks due to the amount of large buses. Then we
>   have to route those buses *between* the blocks, forcing us to push
>   them farther apart, 

yes.  192 wires in some cases.  i have a plan to reduce that to only 32 but it requires quite a bit of work: each FU will have its *own* decoder and receive *only* the 32-bit instruction.


> not even talking about the capacitance/drive problem.

hmmm.


>   Moreover, the box a block can stray too far from a square factor if we
>   want the placer to work (that is an AR between 0.5 and 2.0). 

ah.  this i was expecting - idealistically - to "work" i.e. not be a problem.  the "long" ones (alu0 for example, or spr0), i expected it to be possible to auto-Place them efficiently even if they were long-ratio rectangles.

if this becomes a problem then potentially we can look at merging some of them together, if they have similar enough register profiles.


> There are
>   exceptions, but that's the general idea. It would be a problem for the
>   clock tree as it's depth may vary between blocks of different sizes.
>   And lastly, to reduce the size of the channels, we would need a careful
>   analysis of where to place the buses (and "combing" the bits to avoid
>   to "flip" a whole bus), which is a lengthy task.
>    So, if we compare a "flat" block with maybe up to 20% of margin space
>   and the sum of blocks at 5% to 20% of free space plus channels, the
>   winner is clear. Staf wins again.

:)

it is more to be able to point, clearly, "here is the regfile, here is the logical pipeline" etc.

but....

when we add the GPU version of the DIV/RSQRT/SQRT, and add the *MULTIPLE* IEEE754 FPUs, this will make the layout ***TEN*** times larger than it currently is.

at that point any bus space inefficiencies will be absolutely dwarfed by the size of *TWO* partitioned FP64 multiplier blocks and so on.

*one* of the 64-bit DIV/RSQRT/SQRT pipelines takes the size up from 75,000 to **200,000* cells, all on its own!  when you were on holiday i experimented and i managed to get it down to "only" 130,000 cells.


then, when we go to multi-issue it becomes even *more* interesting.  remember for the GPU version (single-core) we are expecting a size of around 300,000 to 400,000 cells.

at that point any hope of iterative development in a reasonable timeframe is out the window.  hence this exploration.
Comment 48 Luke Kenneth Casson Leighton 2020-07-30 17:40:09 BST
btw i must apologise, i had brought out too many signals for the test_issuer.il.  i have uploaded a new version, and you should find that cu_shadown_i and cu_go_die_i are all now "dropped" i.e. are internally set (within alu0 etc.)

some other signals that should not have been brought out are also dropped.
Comment 49 Luke Kenneth Casson Leighton 2020-08-02 21:44:19 BST
argh.

overnight i just realised something, jean-paul.

trying to put the oper_i on the side of each pipeline is completely pointless *unless* decoder2 is in the middle and the pipelines are staggered like stairs:

#### ## ## ## ####
#### ## ## ## ####
#### ##    ## ####
####    dec   ####

so the *really* small pipelines are at the apex, the medium sized ones either side, abd the really big ones hard left or hard right (mul will be one of those)

the staggered approach basically gives oper_i_* a chance to come in horizontally into the corner...

... *WITHOUT* needing to do a right angle turn to get there.

if oper_i* has to turn from vertical horizontal to get into the side of each pipeline then the vertical channel between pipelines has to be as wide as if you had made the pipeline itself that wide.

which is pointless.

a "staggered" step-up step-down "A Frame" layout with decode2 in the middle will do it.

we need some sort of ASCII art diagram, really, don't we, which lays this out.

can you commit what you have so far and i will take a look and add a quick diagram?
Comment 50 Luke Kenneth Casson Leighton 2020-08-02 22:50:00 BST
Created attachment 94 [details]
staggered pipeline quick drawing
Comment 51 Jean-Paul.Chaput 2020-08-03 23:55:47 BST
(In reply to Luke Kenneth Casson Leighton from comment #50)
> Created attachment 94 [details]
> staggered pipeline quick drawing

  I've just commited a basic demonstrator for the recursive block
  management. It is far from perfect and surely will exhibit bugs...
  But I think you can start playing with it. The example is a very
  bad placement (all the FUs in line).

  I did write a very quick GUIDELINES.rst to explain some of the
  most important points in floorplaning.

  You can extract basic gates statistics with the "stats" plugin.
  ( Misc ==> Alpha ==> Statistics ) It will help you see which
  blocks are big and where do the gates comes from.

  I will now focus on the flat approach and to add the missing
  features that we absolutely need for the TSMC run.

  It may be best to rebuild from scratch Coriolis as I moved
  around a little the Python cumulus modules.
Comment 52 Luke Kenneth Casson Leighton 2020-08-03 23:59:01 BST
(In reply to Jean-Paul.Chaput from comment #51)
> (In reply to Luke Kenneth Casson Leighton from comment #50)
> > Created attachment 94 [details]
> > staggered pipeline quick drawing
> 
>   I've just commited a basic demonstrator for the recursive block
>   management. It is far from perfect and surely will exhibit bugs...
>   But I think you can start playing with it. The example is a very
>   bad placement (all the FUs in line).

that's what i thought would work.

>   I did write a very quick GUIDELINES.rst to explain some of the
>   most important points in floorplaning.

appreciated.

thank you. will try tomorrow.

>   You can extract basic gates statistics with the "stats" plugin.
>   ( Misc ==> Alpha ==> Statistics ) It will help you see which
>   blocks are big and where do the gates comes from.

ok.
 
>   I will now focus on the flat approach and to add the missing
>   features that we absolutely need for the TSMC run.
> 
>   It may be best to rebuild from scratch Coriolis as I moved
>   around a little the Python cumulus modules.

yes i saw this:

d6b90a70
by Jean-Paul Chaput at 2020-08-02T18:15:15+02:00
Enable Etesian to compute AB of fixed width.
Comment 53 Luke Kenneth Casson Leighton 2020-08-05 15:01:20 BST
jean-paul i'm running the doDesign.py and adding div0.  it says
"div0 has 12 badly placed pins" but doesn't say what they are.


Python stack trace:
#0 in                  __init__() at .../install/lib/python2.7/dist-packages/crlcore/helpers/io.py:167
#1 in               placeIoPins() at .../dist-packages/cumulus/plugins/alpha/block/block.py:411
#2 in                     build() at .../dist-packages/cumulus/plugins/alpha/block/block.py:501
#3 in                     build() at .../dist-packages/cumulus/plugins/alpha/block/block.py:496
#4 in                scriptMain() at /home/lkcl/soclayout/experiments9/doDesign.py:666
#5 in                  <module>() at .../coriolis-2.x/Linux.x86_64/Release.Shared/install/bin/cgt:201
Comment 54 Luke Kenneth Casson Leighton 2020-08-05 15:02:11 BST
ok yep it's in the debug output.

[ERROR] Side.place(IoPin): Pin "src1_i(58).0" is outside north or south abutment box side.
        (x:2910l, xAB: [0l:2900l])
                          IoPin.place() N/S @<Point 2910l 0l> "src1_i(58).0" of "<id:67204 Net "src1_i(58)" e-- LOGICAL i--- (IN)>".

[ERROR] Side.place(IoPin): Pin "src1_i(59).0" is outside north or south abutment box side.
        (x:2960l, xAB: [0l:2900l])
                          IoPin.place() N/S @<Point 2960l 0l> "src1_i(59).0" of "<id:67203 Net "src1_i(59)" e-- LOGICAL i--- (IN)>".
Comment 55 Luke Kenneth Casson Leighton 2020-08-05 15:19:20 BST
https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/23
Comment 56 Luke Kenneth Casson Leighton 2020-08-05 15:24:15 BST
https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/24
Comment 57 Jean-Paul.Chaput 2020-08-07 12:43:58 BST
(In reply to Luke Kenneth Casson Leighton from comment #55)
> https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/23

I've just made pushed various commit that *may* solve the problem.
I'm not totally sure as there are still off-side pins that prevents
me to go all the way.

I took the liberty to beautify a little doDesign.py, hopefully respecting
you Python style. I know that for IO pin specification, the list is not
PEP8 compliant, but I think that a "spreadsheet" presentation with clearly
aligned columns is better to immediately spot missing parameters or errors.

I also removed the utils module, as now Coriolis should supply equivalent
features.
Comment 58 Luke Kenneth Casson Leighton 2020-08-08 11:24:21 BST
(In reply to Jean-Paul.Chaput from comment #57)
> (In reply to Luke Kenneth Casson Leighton from comment #55)
> > https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/23
> 
> I've just made pushed various commit that *may* solve the problem.
> I'm not totally sure as there are still off-side pins that prevents
> me to go all the way.
> 
> I took the liberty to beautify a little doDesign.py, hopefully respecting
> you Python style. I know that for IO pin specification, the list is not
> PEP8 compliant, but I think that a "spreadsheet" presentation with clearly
> aligned columns is better to immediately spot missing parameters or errors.

what i would like, there, is wildcard matching e.g. starts with "oper_i" or ends with "_ok" and the pincount to be obtained from the object.

this will reduce 30-40 lines per block down to *five* and at the same time greatly increase clarity.


> 
> I also removed the utils module, as now Coriolis should supply equivalent
> features.

ah excellent, glad you liked it.

Config is neat, ehn? :)


the number of oper_* pins is far too large.  this is the "expansion" of the instruction for convenience.  an example is the 64 bit immediate.

basically i am going to have to do "subset instruction decoders" that are *inside* mul0, *inside* alu0 and so on.

i have to find time to do that.

until then i think we stop with the floorplan version and go with "flat".
Comment 59 Luke Kenneth Casson Leighton 2020-08-11 15:52:54 BST
https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/25

"Katana BUG" after updating to a different setup for INT and FAST
regfiles.  the read/write Bus is now done using a MUX followed by
OR tree.
Comment 60 Jean-Paul.Chaput 2020-08-11 23:02:40 BST
(In reply to Luke Kenneth Casson Leighton from comment #59)
> https://gitlab.lip6.fr/vlsi-eda/coriolis/-/issues/25
> 
> "Katana BUG" after updating to a different setup for INT and FAST
> regfiles.  the read/write Bus is now done using a MUX followed by
> OR tree.

  The problem was due to an incorrect use of the CfgCache object,
  so the specific parameters where not taken into account.
  I put pushed the correction in commit ee3bd54fdf0d788c8227380daa6afd8f787e7074
  
  Basically, the priority level was not set (default is not high
  enough to override) and anyway they were not applied.
  To apply the parameters, either explicitly call cfg.apply() or
  use the "with" construction.

  So, you did get a layout *without* the placer trying to evenly
  spread the free space, so it mostly ended up in the top right
  corner.

  As for the "looping" bug, it's a misnomer (my bad). This is not
  a bug in the normal sense. It means that the router is repeatedly
  ripping up one segment so that the algorithm has reached a dead end.
  And I have to analyse the trace to see how to avoid getting in
  that state. 

  Now it's almost ok (less then 20 failed segments).
Comment 61 Jean-Paul.Chaput 2020-08-11 23:25:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #58)
> (In reply to Jean-Paul.Chaput from comment #57)
> > (In reply to Luke Kenneth Casson Leighton from comment #55)

> what i would like, there, is wildcard matching e.g. starts with "oper_i" or
> ends with "_ok" and the pincount to be obtained from the object.
> 
> this will reduce 30-40 lines per block down to *five* and at the same time
> greatly increase clarity.

  I will keep that in mind and try an implementation when I find time.


> > I also removed the utils module, as now Coriolis should supply equivalent
> > features.
> 
> ah excellent, glad you liked it.
> 
> Config is neat, ehn? :)

  Yes. I did upgrade my Python knowledge here... Always the dilemma
  of whether taking time to properly learn or develop using what I
  already master. I suppose I'm still missing lots of features.


> the number of oper_* pins is far too large.  this is the "expansion" of the
> instruction for convenience.  an example is the 64 bit immediate.
> 
> basically i am going to have to do "subset instruction decoders" that are
> *inside* mul0, *inside* alu0 and so on.
> 
> i have to find time to do that.

  For a floorplanned approach, reducing the number of wires between
  blocks is very important. But it is almost an art...