Created attachment 57 [details] quick preliminary test ilang from soc/scoreboard/test_mem_fu_matrix.py Created attachment 57 [details] quick preliminary test ilang from soc/scoreboard/test_mem_fu_matrix.py we need a quick example to demonstrate how the Dependency Matrices would look so that coriolis2 can do the layout in a regular fashion. this even though they may convert from DFF to SRNAND, so the layout needs to be dynamic, rather than explicit naming of cells. also, some of the cells may be augmented by additional "setting" logic that needs to go into every row (not yet written), so any explicit manually-created layout that hunts for explicit cell names is going to have to be redone, and not only redone, duplicated manual work several times. this is why the idea exists to identify sub-groups of cells by way of netlist path analysis, and have them placed by Etesian in regular grid fashion. see https://bugs.libre-soc.org/show_bug.cgi?id=217#c86
jean-paul i have added this example to soclayout as experiment8 https://git.libre-soc.org/?p=soclayout.git;a=commitdiff;h=1866a4346044f65a23a0905d8e6ec9883a5f8070 this is a typical basic example: there are going to be approximately... 6 "pairs" of these matrices. they are "pairs" because there are two types: * FU-Regs - stands for Function Unit to Registers Matrix * FU-FU - stands for ... well the obvious variants include: * different numbers of "GO_READ" *and* "GO_WRITE" signals covering and protecting not only different register file ports but protecting entirely different register files altogether * Shadow Matrix * Memory-Address-Conflict Matrix * augmentation on the Registers to set multiple entries simultaneously the last thing we need is to go full manual layout on these, with so many variants.
Created attachment 58 [details] fu-fu matrix screenshot
Created attachment 59 [details] fu-regs matrix screenshot
Created attachment 60 [details] combination of two matrices showing how they are used
i just also added a second one to experiment8 as an illustration, it is an FU-Regs Matrix with the following parameters: * Number of Regs: 4 (to keep it small as a demo) * Number of Function Units: 3 (likewise) * Number of Regfile Read ports protected: 2 * Number of Regfile Write ports protected: 2 these numbers can and will vary. we have *four* Register Files to protect with these Matrices. five when we have FP support.
(In reply to Luke Kenneth Casson Leighton from comment #5) > i just also added a second one to experiment8 as an illustration, it is an > FU-Regs Matrix with the following parameters: > > * Number of Regs: 4 (to keep it small as a demo) > * Number of Function Units: 3 (likewise) > * Number of Regfile Read ports protected: 2 > * Number of Regfile Write ports protected: 2 > > these numbers can and will vary. we have *four* Register Files to > protect with these Matrices. five when we have FP support. Hello Luke, I'm looking closely to the examples but I still have problem understanding the matrix nature of the design. As a first I would like to concentrate on the FU-FU matrix, which, if I understand well, manage the read/write dependencies between FUs and generate the Go_Read & Go_Write signals towards the CU (and some other signals). There is two "level" of matrixes: 1. The architectural level (that is close to what you do) and the one I cannot clearly guess. With 3 FU, is there a 3x3 matrix or 3 FU blocks only, and in the later case, it may not be a matrix but just a row or a column. But maybe I make confusion between FU and the dependency matrix of the FUs. 2. The layout (cell level) into which the cells of *one* FU (or whatever sub-block) are also arrayed in a matrix. As we may not put all the cells of a block in just one row. So would it be possible to send examples where one block of the matrix (in the sense of 1.) is clearly identified (best would be that it is put in a sub-block) ? Or, if it is really inconvenient due to the way the design is described at nMigne level, list me what I/O signals (which bit of vectors) are specific to one element of the matrix ? A slow learner. PS: I still not completely understand the color coding of the scoreboard schematic. Is the size and position of the little blue/yellow/green squares inside the red one significant?
(In reply to Jean-Paul.Chaput from comment #6) > (In reply to Luke Kenneth Casson Leighton from comment #5) > > i just also added a second one to experiment8 as an illustration, it is an > > FU-Regs Matrix with the following parameters: > > > > * Number of Regs: 4 (to keep it small as a demo) > > * Number of Function Units: 3 (likewise) > > * Number of Regfile Read ports protected: 2 > > * Number of Regfile Write ports protected: 2 > > > > these numbers can and will vary. we have *four* Register Files to > > protect with these Matrices. five when we have FP support. > > Hello Luke, > > I'm looking closely to the examples but I still have problem understanding > the matrix > nature of the design. As a first I would like to concentrate on the FU-FU > matrix, which, > if I understand well, manage the read/write dependencies between FUs and > generate the > Go_Read & Go_Write signals towards the CU (and some other signals). that's a pretty good understanding. it however covers a bit more scope than is needed here. this needs understanding of the relationship between readable_o and go_read_o (and writable_o and go_write_o). first: all of those are bitvectors. every bitvector element in *all* of *_i and *_o refers to the same FU number. * go_rd1[0] is for FU 0 * issue_i[0] is for FU 0 * wr_pend_i[0] is for FU 0 * readable_o[0] is ... for FU...0 etc. the FU-FU Matrix generates the *desire* (the readiness) to be readABLE and also writABLE. many of these could potentially be raised simultaneously, however we have limited numbers of Regfile read ports and write ports. we therefore need to pick one, because Regfile ports are a limited (broadcast, global bus) resource. thus, these signals go through a "Priority Picker" to select *which one* of the readable_o is to be sent to the Register File as a GO_READ, and also to notify *one* Function Unit that the data being broadcast from the Register File Read Ports is *specifically* for that Function Unit and that Function Unit only. likewise writable_o goes through a *separate* Priority Picker. you can see that process in this diagram here: https://libre-soc.org/3d_gpu/group_pick_rd_rel.jpg so for the purposes of this experiment we focus on readable_o and writeable_o only: go_read *OUT* and go_write *OUT* are out of scope. however... and here is where it gets to be fun: those go_read and go_write signals as **INPUTS** to the FU-FU Matrix tell us something very important: * they tell us that, right now, on this cycle, one FU (which is some distance away from the FU-FU Matrix) *IS* receiving its Read Regists * that therefore, the FU-FU Cell's job, which is to protect and preserve the Directed Acyclic Graph of read-write hazards, is NO LONGER required to protect that Read Dependency * that therefore, that FU-FU cell may CANCEL the Read Dependency by pulling the Latch low. a similar process occurs for go_write, obviously. in addition, note that go_die_i also fires *all* go_writes and *all* go_reads, because it is a cancel / reset signal for that particular bitvector-numbered FU. therefore, * readable_o bitvector comes out, goes through PriorityPicker (out of scope of this layout), go_read_o bitvector comes out of that * go_read_i bitvector (same NETLIST as go_read_o) comes back *IN* to the FU-FU Matrix. note for layout: * go_read_i bitvector wires should definitely be laid out horizontally. * likewise go_write_i and go_die_i bitvectors. these on the LEFT. * there are *multiple* go_read_i bitvectors - go_rd1_i, go_rd2_i, go_rd3_i... * because i used bitvectors (including in the SR Latches), the Matrix is effectively split into a 1D array of 1D bitvector "managers". these are named dm0, dm1.. etc. * dm0, dm1, dm2 .... etc. need to be laid out *vertically* in order to accept the horizontal input from go_read_i and go_write_i * issue_i, rd_pend_i and wr_pend_i are also best done i believe as horizontal signals. so that is inputs. regarding outputs: * wr_wait_o and rd_wait_o on the other hand: these are what go into those GreatBigORGates (fur_x1/2/3/4) and i *believe*, because dm0..3 need to be laid out vertically, the readable_o and writable_o bitvectors need to come out at the BOTTOM. * on the other hand, if instead fur_x1/2/3/4 are done as very very skinny cells, practically empty, using available free space inside dm0/1/2/3, readable_o and writable_o could just as easily come out at the RIGHT. either way does not matter for outputs: the inputs do matter though. summary: * anything as input (*_i) comes in on the LEFT and is wired HORIZONTAL * anything as output (*_o) should go out on the BOTTOM and be wired VERTICAL however it may work if it is RIGHT and HORIZONTAL. this is up to you. > There is two "level" of matrixes: > > 1. The architectural level (that is close to what you do) and the one I > cannot clearly > guess. With 3 FU, is there a 3x3 matrix or 3 FU blocks only, and in the > later case, > it may not be a matrix but just a row or a column. > But maybe I make confusion between FU and the dependency matrix of the > FUs. yes. FUs are located. nowhere near the FU-FU matrix. an FU may be a FMAC for example, which will be... 20,000 gates, and there will be... 4 of those. the FU-FU Matrix is tiny by comparison, and centralised, and you absolutely do not want the two to be amalgamated. * FU does the job of *computing* results based on operands * FU-FU Matrix does the job of preserving the Directed Acyclic Graph (DAG) of relationships *BETWEEN* FUs. > 2. The layout (cell level) into which the cells of *one* FU (or whatever > sub-block) > are also arrayed in a matrix. As we may not put all the cells of a block > in just > one row. > > So would it be possible to send examples where one block of the matrix (in > the > sense of 1.) is clearly identified (best would be that it is put in a > sub-block) ? yaa of course. (it is already in a sub-block). added to experiment8, test_fu_fu_matrix.il. which you can re-generated to any size you wish to using this: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/scoremulti/fu_fu_matrix.py;hb=HEAD and altering the n_fu_cols and n_fu_rows (make sure they are the same). > > Or, if it is really inconvenient due to the way the design is described at > nMigne level, > list me what I/O signals (which bit of vectors) are specific to one element > of the > matrix ? ok. so because of the recursive nature of a Directed Acyclic Graph, what you ask in both (1) *and* (2) is, strictly speaking, not possible. i may have this the wrong way round, please forgive me for that: * a row in the FU-FU Matrix expresses that one FU has a *BLOCK* on other FUs (and each element in that row says which one) this would be "outputs" on a node in the Directed Acyclic Graph of read-write dependencies * a column in the FU-FU expresses that one FU is *BLOCKED BY* another FU (and each element in that column says which one) this would be "inputs" on a node in the Directed Acyclic Graph of read-write dependencies therefore, technically, it is not possible to "divide" them from each other, from an inter-relationship perspective. however... for convenience, what i have had to do is to *actually* divide them into a 1D array of 1D arrays. this because of the simulation speed of nmigen. a previous version, i had *actual* 2D cells. each SR Latch was a single bit (not a bitvector). this was so horribly slow i could not tolerate it. > A slow learner. this is a complex topic. > PS: I still not completely understand the color coding of the scoreboard > schematic. > Is the size and position of the little blue/yellow/green squares inside > the red one > significant? bear in mind this is for the FU-Regs Matrix image (see attachments, here, or see p23) scoreboard_mechanics.1.pdf - section 10.5 p23 "A scoreboard using Dep Matrix" In this diagram * a red box denotes that this entry can read or write those registers. * A blue box denotes that it is possible for an instruction in that Unit to write to this register. * A yellow box indicates that an instruction in this Unit can read this register. * Finally, a green box indicates that this Unit can either read this register for normal activities, or write to this register if a store is being performed. For each of the boxes a clocked set-reset flip-flop is used to gate state changes into the table so that subsequent instructions will see the state of the current register data-flow dependencies.
after a little thought, see the diagram in attachment "fu-fu matrix" https://bugs.libre-soc.org/attachment.cgi?id=58 notice there is "Group1", "Group2", "Group3" at the bottom? these are the Priority Pickers. there are separate Register Files (and there are 5R2W ports in the original CDC 6600) therefore, in the original CDC 6600, they had 3x banks of Priority Pickers. if we follow this arrangement, it means that readable_o and writable_o definitely come out at the BOTTOM of the FU-FU Matrix.
(In reply to Luke Kenneth Casson Leighton from comment #8) > after a little thought, see the diagram in attachment "fu-fu matrix" > https://bugs.libre-soc.org/attachment.cgi?id=58 Hello Luke, I'm starting to have automated matrix placement working for the FU-FU matrix. But the first comparison I can make between matrix placement and automated placement still got the later winning (by a little). I have two little remarks about the FU-FU example you gave me: 1. It seems to me that the current example has 4 unit and 3 regs (inverted in comment #5). 2. Nothing is connected to gowr3_i, seems strange to me. I did try to generate a bigger FU-FU, but got stuck here: Traceback (most recent call last): File "fu_fu_matrix.py", line 169, in <module> test_fu_fu_matrix() File "fu_fu_matrix.py", line 166, in test_fu_fu_matrix run_simulation(dut, d_matrix_sim(dut), vcd_name='test_fu_fu_matrix.vcd') File "/usr/lib/python3.6/site-packages/nmigen/compat/sim/__init__.py", line 22, in run_simulation fragment.domains += ClockDomain("sync") AttributeError: 'FUFUDepMatrix' object has no attribute 'domains' I should have the latest nmigen, soc & nmutil repositry versions. Would it be possible for you to generate a FU-FU matrix as close as possible to the one intended in terms of number of FU and registers (the idea is to have a representative amount of wires and gates). Best regards, PS1: I use the approach of "not getting to the bottom of things but start like a copying monkey", but always still felt a little bit like cheating. Now, I will do it proudly... PS2: Maybe you have seen this... (personally, I'm at 110 characters) https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col
(In reply to Jean-Paul.Chaput from comment #9) > (In reply to Luke Kenneth Casson Leighton from comment #8) > > after a little thought, see the diagram in attachment "fu-fu matrix" > > https://bugs.libre-soc.org/attachment.cgi?id=58 > > Hello Luke, > > I'm starting to have automated matrix placement working for the FU-FU > matrix. superb. do commit it, we can take a look. > But the first comparison I can make between matrix placement > and automated placement still got the later winning (by a little). that's veery unlikely to be the case at 32x12, 32x25 or, when we go to GPU DMs, 48x30 or even 128x30. (this is FU-REGs, we are currently investigating FU-FU, which will still potentially be be up to 30x30) > I have two little remarks about the FU-FU example you gave me: > > 1. It seems to me that the current example has 4 unit and 3 regs > (inverted in comment #5). probably :) actually for FU-FU it should be a square matrix. > 2. Nothing is connected to gowr3_i, seems strange to me. because if it's not a square matrix, that's what happens. > I did try to generate a bigger FU-FU, but got stuck here: > > Traceback (most recent call last): > File "fu_fu_matrix.py", line 169, in <module> > test_fu_fu_matrix() > File "fu_fu_matrix.py", line 166, in test_fu_fu_matrix > run_simulation(dut, d_matrix_sim(dut), vcd_name='test_fu_fu_matrix.vcd') > File "/usr/lib/python3.6/site-packages/nmigen/compat/sim/__init__.py", > line 22, in run_simulation > fragment.domains += ClockDomain("sync") > AttributeError: 'FUFUDepMatrix' object has no attribute 'domains' > > I should have the latest nmigen, soc & nmutil repositry versions. you need the version nmigen from whitequark, not from m-labs. check the repo url in .git/config, it should be https://github.com/nmigen/nmigen/ and you can ignore that, anyway - the ilang file-generation should have happened before that. > Would it be possible for you to generate a FU-FU matrix as close as possible > to the one intended in terms of number of FU and registers (the idea is to > have a representative amount of wires and gates). let's make it 30x30 - that's pretty damn big. for 180nm we'll likely go with 1/2 that size. > Best regards, > > PS1: I use the approach of "not getting to the bottom of things but start > like > a copying monkey", but always still felt a little bit like cheating. > Now, I will > do it proudly... haha very funny :) > PS2: Maybe you have seen this... (personally, I'm at 110 characters) > > https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates- > 80-Col sigh there's some severe problems with internet connectivity (much more sophisticated than a DDOS attack), i still cannot get access to the majority of news websites. slashdot, theregister, phoronix, zoom, they're all timing out (!) if you can see this page https://slashdot.org/~lkcl then you'll see my comments on this one. summary: it's a serious mistake. this screenshot shows why: https://libre-soc.org/HDL_workflow/640x-2020-01-24_11-56.png
(In reply to Luke Kenneth Casson Leighton from comment #10) > (In reply to Jean-Paul.Chaput from comment #9) > > (In reply to Luke Kenneth Casson Leighton from comment #8) Hello Luke, I posted in the soclayout & Coriolis repository a first trial at getting back matrix like structures. It is *only* tailored for the FU-FU matrix 30x30 (hardcoded in the Coriolis plugin). I added a Makefile to allow you to run it conveniently. * Generate & flatten the netlist in vst: > make vstf * You call the plugin in graphical mode in: > make viewf Then, in the menus: Misc -> Beta -> Matrix Placer It is slow (~5 minutes) and there is no algorithmic reasons to, almost everything is linear or log(n). The important conclusion I draw with this preliminary work, is that the automatic placer is doing better that the matrix placer. You can see the results in RESULTS.txt. Why is that so? * The automatic placer is good, in graphic mode you can see him finding back the matrix structure. * The matrix placer lose space as each block of the matrix as a ragged right edge where you systematically lose space. Of course it can be improved, but in the end you may just reach the 5% of marging of the automated placer. * The conclusion about total wire length is less clear, seems close to each other. But with uneven density for the matrix placer, so a little bit more difficult to route. * So we may spend a lot of time for almost no significant improvement. So, it seems to me that Staf wins... The approach we should take is feed it directly amost as a whole to the P&R (except for RAM). If the tools cannot cope, we may broke it in 50Kgates blocks. * The matrix placement may still win, but with a different approach. Build a manual generator from the groud up. Do not use synthesis then try to guess back the placement. But it is a lot of work with all the various cases you want to manage. Best regards,
hiya jean-paul, thank you so much for this detailed and dedicated work. i will come back to an answer things when i can see the layout. i am walking through replicating the results, and encounter this after "make vstf" - any clues? r test_fu_fu_matrix test_fu_fu_matrix_flat make: r: Command not found
$(FLATLO) does not appear to be defined. looking for it now.
(In reply to Luke Kenneth Casson Leighton from comment #13) > $(FLATLO) does not appear to be defined. looking for it now. Sorry, I just did a commit into alliance-check-toolkit to correct it.
(In reply to Jean-Paul.Chaput from comment #14) > (In reply to Luke Kenneth Casson Leighton from comment #13) > > $(FLATLO) does not appear to be defined. looking for it now. > > Sorry, I just did a commit into alliance-check-toolkit to > correct it. ah thanks :)