329 – coriolis2 experiment layout for Dependency Matrices

Bug 329 - coriolis2 experiment layout for Dependency Matrices

Summary: coriolis2 experiment layout for Dependency Matrices

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Hardware Layout (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:

Reported:	2020-05-20 13:18 BST by Luke Kenneth Casson Leighton
Modified:	2020-06-06 13:25 BST (History)
CC List:	2 users (show)

See Also:	217
NLnet milestone:	NLNet.2019.02.029.Coriolis2
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:	203
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
quick preliminary test ilang from soc/scoreboard/test_mem_fu_matrix.py (341.91 KB, text/plain) 2020-05-20 13:18 BST, Luke Kenneth Casson Leighton	Details
fu-fu matrix screenshot (594.89 KB, image/png) 2020-05-20 13:33 BST, Luke Kenneth Casson Leighton	Details
fu-regs matrix screenshot (367.58 KB, image/png) 2020-05-20 13:33 BST, Luke Kenneth Casson Leighton	Details
combination of two matrices showing how they are used (321.41 KB, image/png) 2020-05-20 13:34 BST, Luke Kenneth Casson Leighton	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-05-20 13:18:35 BST

Created attachment 57 [details]
quick preliminary test ilang from soc/scoreboard/test_mem_fu_matrix.py

Created attachment 57 [details]
quick preliminary test ilang from soc/scoreboard/test_mem_fu_matrix.py

we need a quick example to demonstrate how the Dependency Matrices would
look so that coriolis2 can do the layout in a regular fashion.  this
even though they may convert from DFF to SRNAND, so the layout needs to
be dynamic, rather than explicit naming of cells.

also, some of the cells may be augmented by additional "setting" logic
that needs to go into every row (not yet written), so any explicit
manually-created layout that hunts for explicit cell names is going to
have to be redone, and not only redone, duplicated manual work several
times.

this is why the idea exists to identify sub-groups of cells by way of
netlist path analysis, and have them placed by Etesian in regular grid
fashion.  see https://bugs.libre-soc.org/show_bug.cgi?id=217#c86

Comment 1 Luke Kenneth Casson Leighton 2020-05-20 13:30:40 BST

jean-paul i have added this example to soclayout as experiment8

https://git.libre-soc.org/?p=soclayout.git;a=commitdiff;h=1866a4346044f65a23a0905d8e6ec9883a5f8070

this is a typical basic example: 

there are going to be approximately... 6 "pairs" of these matrices.
they are "pairs" because there are two types:

* FU-Regs - stands for Function Unit to Registers Matrix
* FU-FU   - stands for ... well the obvious

variants include:

* different numbers of "GO_READ" *and* "GO_WRITE" signals covering
  and protecting not only different register file ports but protecting
  entirely different register files altogether 
* Shadow Matrix
* Memory-Address-Conflict Matrix
* augmentation on the Registers to set multiple entries simultaneously

the last thing we need is to go full manual layout on these, with so
many variants.

Comment 2 Luke Kenneth Casson Leighton 2020-05-20 13:33:22 BST

Created attachment 58 [details]
fu-fu matrix screenshot

Comment 3 Luke Kenneth Casson Leighton 2020-05-20 13:33:51 BST

Created attachment 59 [details]
fu-regs matrix screenshot

Comment 4 Luke Kenneth Casson Leighton 2020-05-20 13:34:24 BST

Created attachment 60 [details]
combination of two matrices showing how they are used

Comment 5 Luke Kenneth Casson Leighton 2020-05-20 13:41:38 BST

i just also added a second one to experiment8 as an illustration, it is an FU-Regs Matrix with the following parameters:

* Number of Regs:                          4 (to keep it small as a demo)
* Number of Function Units:                3 (likewise)
* Number of Regfile Read ports protected:  2
* Number of Regfile Write ports protected: 2

these numbers can and will vary.  we have *four* Register Files to
protect with these Matrices.  five when we have FP support.

Comment 6 Jean-Paul Chaput 2020-05-22 12:13:06 BST

(In reply to Luke Kenneth Casson Leighton from comment #5)
> i just also added a second one to experiment8 as an illustration, it is an
> FU-Regs Matrix with the following parameters:
> 
> * Number of Regs:                          4 (to keep it small as a demo)
> * Number of Function Units:                3 (likewise)
> * Number of Regfile Read ports protected:  2
> * Number of Regfile Write ports protected: 2
> 
> these numbers can and will vary.  we have *four* Register Files to
> protect with these Matrices.  five when we have FP support.

Hello Luke,

I'm looking closely to the examples but I still have problem understanding the matrix
nature of the design. As a first I would like to concentrate on the FU-FU matrix, which,
if I understand well, manage the read/write dependencies between FUs and generate the
Go_Read & Go_Write signals towards the CU (and some other signals).

There is two "level" of matrixes:

1. The architectural level (that is close to what you do) and the one I cannot clearly
   guess. With 3 FU, is there a 3x3 matrix or 3 FU blocks only, and in the later case,
   it may not be a matrix but just a row or a column.
     But maybe I make confusion between FU and the dependency matrix of the FUs.

2. The layout (cell level) into which the cells of *one* FU (or whatever sub-block)
   are also arrayed in a matrix. As we may not put all the cells of a block in just
   one row.

So would it be possible to send examples where one block of the matrix (in the
sense of 1.) is clearly identified (best would be that it is put in a sub-block) ?

Or, if it is really inconvenient due to the way the design is described at nMigne level,
list me what I/O signals (which bit of vectors) are specific to one element of the
matrix ?

A slow learner.

PS: I still not completely understand the color coding of the scoreboard schematic.
    Is the size and position of the little blue/yellow/green squares inside the red one
    significant?

Comment 7 Luke Kenneth Casson Leighton 2020-05-22 14:35:08 BST

(In reply to Jean-Paul.Chaput from comment #6)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > i just also added a second one to experiment8 as an illustration, it is an
> > FU-Regs Matrix with the following parameters:
> > 
> > * Number of Regs:                          4 (to keep it small as a demo)
> > * Number of Function Units:                3 (likewise)
> > * Number of Regfile Read ports protected:  2
> > * Number of Regfile Write ports protected: 2
> > 
> > these numbers can and will vary.  we have *four* Register Files to
> > protect with these Matrices.  five when we have FP support.
> 
> Hello Luke,
> 
> I'm looking closely to the examples but I still have problem understanding
> the matrix
> nature of the design. As a first I would like to concentrate on the FU-FU
> matrix, which,
> if I understand well, manage the read/write dependencies between FUs and
> generate the
> Go_Read & Go_Write signals towards the CU (and some other signals).

that's a pretty good understanding.  it however covers a bit more scope
than is needed here.  this needs understanding of the relationship 
between readable_o and go_read_o (and writable_o and go_write_o).

first: all of those are bitvectors.  every bitvector element in *all*
of *_i and *_o refers to the same FU number.

* go_rd1[0] is for FU 0
* issue_i[0] is for FU 0
* wr_pend_i[0] is for FU 0
* readable_o[0] is ... for FU...0

etc.


the FU-FU Matrix generates the *desire* (the readiness) to be readABLE
and also writABLE.  many of these could potentially be raised simultaneously,
however we have limited numbers of Regfile read ports and write ports.

we therefore need to pick one, because Regfile ports are a limited (broadcast,
global bus) resource.

thus, these signals go through a "Priority Picker" to select *which one*
of the readable_o is to be sent to the Register File as a GO_READ, and
also to notify *one* Function Unit that the data being broadcast from
the Register File Read Ports is *specifically* for that Function Unit and
that Function Unit only.

likewise writable_o goes through a *separate* Priority Picker.

you can see that process in this diagram here:

https://libre-soc.org/3d_gpu/group_pick_rd_rel.jpg

so for the purposes of this experiment we focus on readable_o and writeable_o
only: go_read *OUT* and go_write *OUT* are out of scope.

however... and here is where it gets to be fun:

those go_read and go_write signals as **INPUTS** to the FU-FU Matrix tell
us something very important:

* they tell us that, right now, on this cycle, one FU (which is some distance
  away from the FU-FU Matrix) *IS* receiving its Read Regists

* that therefore, the FU-FU Cell's job, which is to protect and preserve
  the Directed Acyclic Graph of read-write hazards, is NO LONGER required
  to protect that Read Dependency

* that therefore, that FU-FU cell may CANCEL the Read Dependency by pulling
  the Latch low.

a similar process occurs for go_write, obviously.  in addition, note
that go_die_i also fires *all* go_writes and *all* go_reads, because
it is a cancel / reset signal for that particular bitvector-numbered FU.


therefore,

* readable_o bitvector comes out, goes through PriorityPicker (out of scope of
  this layout), go_read_o bitvector comes out of that

* go_read_i bitvector (same NETLIST as go_read_o) comes back *IN* to the FU-FU
  Matrix.


note for layout:

* go_read_i bitvector wires should definitely be laid out horizontally.
* likewise go_write_i and go_die_i bitvectors.  these on the LEFT.
* there are *multiple* go_read_i bitvectors - go_rd1_i, go_rd2_i, go_rd3_i...
* because i used bitvectors (including in the SR Latches), the Matrix is
  effectively split into a 1D array of 1D bitvector "managers".  these
  are named dm0, dm1.. etc.
* dm0, dm1, dm2 .... etc. need to be laid out *vertically* in order to
  accept the horizontal input from go_read_i and go_write_i
* issue_i, rd_pend_i and wr_pend_i are also best done i believe as
  horizontal signals.

so that is inputs.

regarding outputs:

* wr_wait_o and rd_wait_o on the other hand: these are what go into those
  GreatBigORGates (fur_x1/2/3/4) and i *believe*, because dm0..3 need to
  be laid out vertically, the readable_o and writable_o bitvectors need
  to come out at the BOTTOM.

* on the other hand, if instead fur_x1/2/3/4 are done as very very skinny
  cells, practically empty, using available free space inside dm0/1/2/3,
  readable_o and writable_o could just as easily come out at the RIGHT.

either way does not matter for outputs: the inputs do matter though.

summary:

* anything as input (*_i) comes in on the LEFT and is wired HORIZONTAL
* anything as output (*_o) should go out on the BOTTOM and be wired VERTICAL
  however it may work if it is RIGHT and HORIZONTAL.  this is up to you.


> There is two "level" of matrixes:
> 
> 1. The architectural level (that is close to what you do) and the one I
> cannot clearly
>    guess. With 3 FU, is there a 3x3 matrix or 3 FU blocks only, and in the
> later case,
>    it may not be a matrix but just a row or a column.
>      But maybe I make confusion between FU and the dependency matrix of the
> FUs.

yes.  FUs are located. nowhere near the FU-FU matrix.  an FU may be a FMAC
for example, which will be... 20,000 gates, and there will be... 4 of those.

the FU-FU Matrix is tiny by comparison, and centralised, and you absolutely
do not want the two to be amalgamated.

* FU does the job of *computing* results based on operands

* FU-FU Matrix does the job of preserving the Directed Acyclic Graph (DAG) of
  relationships *BETWEEN* FUs.


> 2. The layout (cell level) into which the cells of *one* FU (or whatever
> sub-block)
>    are also arrayed in a matrix. As we may not put all the cells of a block
> in just
>    one row.
> 
> So would it be possible to send examples where one block of the matrix (in
> the
> sense of 1.) is clearly identified (best would be that it is put in a
> sub-block) ?

yaa of course.  (it is already in a sub-block).  added to experiment8,
test_fu_fu_matrix.il.  which you can re-generated to any size you wish
to using this:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/scoremulti/fu_fu_matrix.py;hb=HEAD

and altering the n_fu_cols and n_fu_rows (make sure they are the same).


> 
> Or, if it is really inconvenient due to the way the design is described at
> nMigne level,
> list me what I/O signals (which bit of vectors) are specific to one element
> of the
> matrix ?

ok.

so because of the recursive nature of a Directed Acyclic Graph, what you
ask in both (1) *and* (2) is, strictly speaking, not possible.

i may have this the wrong way round, please forgive me for that:

* a row in the FU-FU Matrix expresses that one FU has a *BLOCK* on other FUs
  (and each element in that row says which one)

  this would be "outputs" on a node in the Directed Acyclic Graph of
  read-write dependencies

* a column in the FU-FU expresses that one FU is *BLOCKED BY* another FU
  (and each element in that column says which one)

  this would be "inputs" on a node in the Directed Acyclic Graph of
  read-write dependencies

therefore, technically, it is not possible to "divide" them from each other,
from an inter-relationship perspective.

however... for convenience, what i have had to do is to *actually* divide
them into a 1D array of 1D arrays.  this because of the simulation speed
of nmigen.

a previous version, i had *actual* 2D cells.  each SR Latch was a single
bit (not a bitvector).  this was so horribly slow i could not tolerate it.


> A slow learner.

this is a complex topic.

> PS: I still not completely understand the color coding of the scoreboard
> schematic.
>     Is the size and position of the little blue/yellow/green squares inside
> the red one
>     significant?

bear in mind this is for the FU-Regs Matrix image (see attachments, here,
or see p23)

scoreboard_mechanics.1.pdf - section 10.5 p23 "A scoreboard using Dep Matrix"

In this diagram

* a red box denotes that this entry can read or write those registers.
* A blue box denotes that it is possible for an instruction in that
  Unit to write to this register.
* A yellow box indicates that an instruction in this Unit can read
  this register.
* Finally, a green box indicates that this Unit can either read
  this register for normal activities, or write to this register if a store
  is being performed.

For each of the boxes a clocked set-reset flip-flop is used to gate state changes into the table so that subsequent instructions will see the state of the current register data-flow dependencies.

Comment 8 Luke Kenneth Casson Leighton 2020-05-22 14:53:57 BST

after a little thought, see the diagram in attachment "fu-fu matrix"
https://bugs.libre-soc.org/attachment.cgi?id=58

notice there is "Group1", "Group2", "Group3" at the bottom?

these are the Priority Pickers.

there are separate Register Files (and there are 5R2W ports in the original
CDC 6600)

therefore, in the original CDC 6600, they had 3x banks of Priority Pickers.

if we follow this arrangement, it means that readable_o and writable_o
definitely come out at the BOTTOM of the FU-FU Matrix.

Comment 9 Jean-Paul Chaput 2020-06-03 23:00:49 BST

(In reply to Luke Kenneth Casson Leighton from comment #8)
> after a little thought, see the diagram in attachment "fu-fu matrix"
> https://bugs.libre-soc.org/attachment.cgi?id=58

Hello Luke,

I'm starting to have automated matrix placement working for the FU-FU
matrix. But the first comparison I can make between matrix placement
and automated placement still got the later winning (by a little).

I have two little remarks about the FU-FU example you gave me:

1. It seems to me that the current example has 4 unit and 3 regs
   (inverted in comment #5).

2. Nothing is connected to gowr3_i, seems strange to me.

I did try to generate a bigger FU-FU, but got stuck here:

Traceback (most recent call last):
  File "fu_fu_matrix.py", line 169, in <module>
    test_fu_fu_matrix()
  File "fu_fu_matrix.py", line 166, in test_fu_fu_matrix
    run_simulation(dut, d_matrix_sim(dut), vcd_name='test_fu_fu_matrix.vcd')
  File "/usr/lib/python3.6/site-packages/nmigen/compat/sim/__init__.py", line 22, in run_simulation
    fragment.domains += ClockDomain("sync")
AttributeError: 'FUFUDepMatrix' object has no attribute 'domains'

I should have the latest nmigen, soc & nmutil repositry versions.

Would it be possible for you to generate a FU-FU matrix as close as possible
to the one intended in terms of number of FU and registers (the idea is to
have a representative amount of wires and gates).

Best regards,

PS1: I use the approach of "not getting to the bottom of things but start like
     a copying monkey", but always still felt a little bit like cheating. Now, I will
     do it proudly...

PS2: Maybe you have seen this... (personally, I'm at 110 characters)
     https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col

Comment 10 Luke Kenneth Casson Leighton 2020-06-03 23:18:28 BST

(In reply to Jean-Paul.Chaput from comment #9)
> (In reply to Luke Kenneth Casson Leighton from comment #8)
> > after a little thought, see the diagram in attachment "fu-fu matrix"
> > https://bugs.libre-soc.org/attachment.cgi?id=58
> 
> Hello Luke,
> 
> I'm starting to have automated matrix placement working for the FU-FU
> matrix.

superb.  do commit it, we can take a look.

> But the first comparison I can make between matrix placement
> and automated placement still got the later winning (by a little).

that's veery unlikely to be the case at 32x12, 32x25 or, when we go
to GPU DMs, 48x30 or even 128x30.   (this is FU-REGs, we are currently
investigating FU-FU, which will still potentially be be up to 30x30)


> I have two little remarks about the FU-FU example you gave me:
> 
> 1. It seems to me that the current example has 4 unit and 3 regs
>    (inverted in comment #5).

probably :)  actually for FU-FU it should be a square matrix.

> 2. Nothing is connected to gowr3_i, seems strange to me.

because if it's not a square matrix, that's what happens.

> I did try to generate a bigger FU-FU, but got stuck here:
> 
> Traceback (most recent call last):
>   File "fu_fu_matrix.py", line 169, in <module>
>     test_fu_fu_matrix()
>   File "fu_fu_matrix.py", line 166, in test_fu_fu_matrix
>     run_simulation(dut, d_matrix_sim(dut), vcd_name='test_fu_fu_matrix.vcd')
>   File "/usr/lib/python3.6/site-packages/nmigen/compat/sim/__init__.py",
> line 22, in run_simulation
>     fragment.domains += ClockDomain("sync")
> AttributeError: 'FUFUDepMatrix' object has no attribute 'domains'
> 
> I should have the latest nmigen, soc & nmutil repositry versions.

you need the version nmigen from whitequark, not from m-labs.
check the repo url in .git/config, it should be https://github.com/nmigen/nmigen/

and you can ignore that, anyway - the ilang file-generation should
have happened before that.

 
> Would it be possible for you to generate a FU-FU matrix as close as possible
> to the one intended in terms of number of FU and registers (the idea is to
> have a representative amount of wires and gates).

let's make it 30x30 - that's pretty damn big.  for 180nm we'll likely go
with 1/2 that size.

> Best regards,
> 
> PS1: I use the approach of "not getting to the bottom of things but start
> like
>      a copying monkey", but always still felt a little bit like cheating.
> Now, I will
>      do it proudly...

haha very funny :)

> PS2: Maybe you have seen this... (personally, I'm at 110 characters)
>     
> https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-
> 80-Col

sigh there's some severe problems with internet connectivity (much more
sophisticated than a DDOS attack), i still cannot get access to the
majority of news websites.  slashdot, theregister, phoronix, zoom, they're
all timing out (!)

if you can see this page https://slashdot.org/~lkcl then you'll see my
comments on this one.  summary: it's a serious mistake.  this screenshot
shows why: https://libre-soc.org/HDL_workflow/640x-2020-01-24_11-56.png

Comment 11 Jean-Paul Chaput 2020-06-06 11:41:10 BST

(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Jean-Paul.Chaput from comment #9)
> > (In reply to Luke Kenneth Casson Leighton from comment #8)

Hello Luke,

I posted in the soclayout & Coriolis repository a first trial at getting
back matrix like structures. It is *only* tailored for the FU-FU matrix
30x30 (hardcoded in the Coriolis plugin).

I added a Makefile to allow you to run it conveniently.

* Generate & flatten the netlist in vst:
  > make vstf

* You call the plugin in graphical mode in:
  > make viewf
  Then, in the menus:
    Misc -> Beta -> Matrix Placer

It is slow (~5 minutes) and there is no algorithmic reasons to,
almost everything is linear or log(n).

The important conclusion I draw with this preliminary work, is
that the automatic placer is doing better that the matrix placer.
You can see the results in RESULTS.txt.

Why is that so?

* The automatic placer is good, in graphic mode you can see him finding
  back the matrix structure.

* The matrix placer lose space as each block of the matrix as a ragged
  right edge where you systematically lose space. Of course it can be
  improved, but in the end you may just reach the 5% of marging of
  the automated placer.

* The conclusion about total wire length is less clear, seems close to
  each other. But with uneven density for the matrix placer, so a
  little bit more difficult to route.

* So we may spend a lot of time for almost no significant improvement.

So, it seems to me that Staf wins... The approach we should take is feed
it directly amost as a whole to the P&R (except for RAM). If the tools
cannot cope, we may broke it in 50Kgates blocks.

* The matrix placement may still win, but with a different approach.
  Build a manual generator from the groud up. Do not use synthesis
  then try to guess back the placement. But it is a lot of work
  with all the various cases you want to manage.

Best regards,

Comment 12 Luke Kenneth Casson Leighton 2020-06-06 12:20:42 BST

hiya jean-paul, thank you so much for this detailed and dedicated work.
i will come back to an answer things when i can see the layout.
i am walking through replicating the results, and encounter this after
"make vstf" - any clues?

r test_fu_fu_matrix test_fu_fu_matrix_flat
make: r: Command not found

Comment 13 Luke Kenneth Casson Leighton 2020-06-06 12:23:25 BST

$(FLATLO) does not appear to be defined.  looking for it now.

Comment 14 Jean-Paul Chaput 2020-06-06 13:23:56 BST

(In reply to Luke Kenneth Casson Leighton from comment #13)
> $(FLATLO) does not appear to be defined.  looking for it now.

Sorry, I just did a commit into alliance-check-toolkit to
correct it.

Comment 15 Luke Kenneth Casson Leighton 2020-06-06 13:25:59 BST

(In reply to Jean-Paul.Chaput from comment #14)
> (In reply to Luke Kenneth Casson Leighton from comment #13)
> > $(FLATLO) does not appear to be defined.  looking for it now.
> 
> Sorry, I just did a commit into alliance-check-toolkit to
> correct it.

ah thanks :)