724 – Determine required memory compiler developments

Bug 724 - Determine required memory compiler developments

Summary: Determine required memory compiler developments

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's second ASIC
Classification:	Unclassified
Component:	Milestones (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- normal
Assignee:	Staf Verhaegen

URL:

Depends on:
Blocks:	690 814
	Show dependency tree / graph

Reported:	2021-10-11 19:09 BST by Staf Verhaegen
Modified:	2022-07-10 22:15 BST (History)
CC List:	3 users (show)

See Also:
NLnet milestone:	NGI.POINTER.Gigabit.ASIC
total budget (EUR) for completion of task and all subtasks:	2000
budget (EUR) for this task, excluding subtasks' budget:	2000
parent task for budget allocation:	814
child tasks for budget allocation:	781
The table of payments (in EUR) for this task; TOML format:	"Staf Verhaegen" = {amount=2000, paid=2022-07-09}

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Staf Verhaegen 2021-10-11 19:09:03 BST

- Make inventory of all memory blocks and need for number of ports.
Investigate possible architecture optimization for port reduction
- Investigate architecture of multi-port cell
- Define multi-port extension of memory compiler; including needed budget

Comment 1 Staf Verhaegen 2021-10-11 19:12:24 BST

Don't see this a dependency of #630. That is single port; this task is about what has to happen after #630.

Comment 2 Staf Verhaegen 2021-10-11 19:13:53 BST

Already comment from Luke in email:

'Yes.  i am trying to avoid more than one write port but if that's
possible (2W) it would be amazing and speed things up.

we do need 2W for the DMA buffers though.

if it is convenient i will be "stratifying" the regfiles so that they're
subdivided (interleaved), odd-even banks.

bank 1: r0 r2 r4 r6 r8 ...
bank 2: r1 r3 r5 r7 r9 ...

some of the very small regfiles (FAST, STATE, XER, CR) i would
actually recommend leaving them as DFFs.  the CR one is just
ridiculously complicated.  QTY 16of 32-bit regs comprising 4-bit
access *on top* of a (full) 32-bit access port.  multiple read-write
ports @ *both* 4-bit *and* 32-bit. plus the CR regfile is actually
unary-addressed (0b11111111 masks) for the LSBs, not
binary-addressed, but *binary* addressed for the MSBs
(selecting which of the 16).'

Comment 3 Staf Verhaegen 2021-10-11 19:14:59 BST

Could you point me to example code with these blocks with 4-bit and 32-bit ports ?

Comment 4 Staf Verhaegen 2021-10-11 19:23:29 BST

General principle is that number of ports of a memory reflects the number of parallel accesses one needs to do in the same clock cycle. Adding an extra port will increase area of the SRAM block.
Otherwise higher level logic should be used to arbitrate an access including bit width granularity changes etc.

Comment 5 Luke Kenneth Casson Leighton 2021-10-11 23:15:56 BST

(In reply to Staf Verhaegen from comment #3)
> Could you point me to example code with these blocks with 4-bit and 32-bit
> ports ?

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;h=8f881423e4aedfc38b4f35d78c842aec908cf990;hb=HEAD#l114

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/virtual_port.py;hb=HEAD

that's just one 32bit regfile for the standard Power ISA Condition Register
SVP64 needs *16* of these.

each 32bit CR has 8 CR Fields, 4 bits each.  these 4bit fields are accessed
by crand, cror etc. and the full 32bit by mcrf and mfcr.

with the 128 CR Fields being used for predication as well as Rc=1 targets
(vectorised) there will be a HELL of a lot of read and write ports onto
CRs.

to prevent delays we may need something like 4R3W or even 5R3W 32bit, with the write-enable
being 8bit wide on each 4bit CR field

fortunately there would only be QTY 16 such 4R3W/5R3W 32bit regs.

it is perfectly fine to use one of those 32-bit ports to synthesise
the 4-bit ports via an adapter.  it is the opposite arrangement to
that virtual_port.py file but that's ok.

Comment 6 Luke Kenneth Casson Leighton 2021-10-11 23:52:09 BST

(In reply to Staf Verhaegen from comment #4)
> General principle is that number of ports of a memory reflects the number of
> parallel accesses one needs to do in the same clock cycle. Adding an extra
> port will increase area of the SRAM block.

with 128 64 bit INT and FP regs we may have to do an actual reg cache
for higher clock rates, i would like to avoid that complication in the
180nm / 130nm geometries if practical.

i also do not mind breaking down into a stratified arrangement
of four *separate* regfiles or even 5:

32 regs r0-r31 all accessible
24 regs r32 r36 r40 ... r124
24 regs r33 r37 r41 ... r125
24 regs r34         ... r126
24 regs r35         ... r127

where each of those is 4R1W 64bit with byte-level write-enable.

and if instructions are ever issued add r31, r33, r127 the data
goes into a cyclic buffer that shuffles along, with appropriate
latency, to match up data from regfile port to ALU which will
*also* be stratified (5 different separate ALU banks, yes really,
and yes it'll be a Monster but hey).

however again doing a Monster Vector Processor like this i would
like to avoid in the first iteration @ 130/180 nm

avoiding all external complications like that: bare minimum:
for FP and INT i will be happy with 4R1W on a 128x 64-bit
regfile with byte-level write-enable.

this gives FMAC single cycle and also INT-MAC with room for an INT
predicate read without delay.

Comment 7 Jacob Lifshay 2021-11-26 19:29:09 GMT

assuming this is a subtask of #589, if not, please either remove budget (comment out toml field and clear numeric fields) or set correct milestone and budget parent

Comment 8 Luke Kenneth Casson Leighton 2021-11-26 19:49:30 GMT

(In reply to Jacob Lifshay from comment #7)
> assuming this is a subtask of #589,

#589 is the top-level NLnet proposal 2021-02-052.
https://libre-soc.org/nlnet_2021_crypto_router/

Comment 9 Staf Verhaegen 2021-11-27 11:45:31 GMT

What is different between #589 and #690 ?

Comment 10 Staf Verhaegen 2021-11-27 12:50:55 GMT

(In reply to Staf Verhaegen from comment #9)
> What is different between #589 and #690 ?

OK, saw on IRC. #589 is NLNet CryptoRouter and #690 is NGI POINTER.
From my point this would fit under both of these so you can decide which budget to use.

Comment 11 Luke Kenneth Casson Leighton 2021-12-06 21:28:31 GMT

Staf I identified another location of SRAMs: the DCache and ICaches.
there are two types:

* TLB and Page Table Entry caches
  - 1R1W
  - read on next cycle (sync)
  - write updates on NEXT cycle (sync)
  - forwarding needed (can be done externally,
    not part of SRAM)
  - 64 rows
  - 128 bit for tags, and 92 bit for PTEs
  - 4 write-enables therefore 4x32-bit for tags
    and 4x23-bit for PTEs

you can see the code which reads, modifies, then writes back
some bits.  I will make this use a Memory so it is more obvious

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;h=d0829f4df8ac56dbcf96b5b018a66730e24261c4;hb=1cf999d43c244b49894d77e8f2151089496239eb#l432

about 1k in size, each there, QTY 4 to be used.  i will check
that ICache is the same spec (should be)

* Cache SRAM
  - 1R1W
  - 128 rows
  - 64 bits
  - entire row written (no multiple wens)
  - read and write both 1 clock cycle

each 1k in size (128x8), QTY 8 to be used.

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/cache_ram.py;h=50ee1367cc84301bcf9cecf0f6cae51d13273227;hb=1cf999d43c244b49894d77e8f2151089496239eb

Comment 12 Luke Kenneth Casson Leighton 2021-12-06 21:30:06 GMT

staf also, just fyi, this one has to move to NGI POINTER.

Comment 13 Staf Verhaegen 2022-03-15 13:33:18 GMT

Just to notify I have seen bug #781 but currently don't have time to give technical feedback. I first need to finish Sky130 tape-out of single port SRAM which will be very close to be able to make it.

Just want to comment now that making a block that does a read followed by a write in one clock cycle is more risky than having a dual port RAM where you can do the read on one port and a write on the other port in the same clock cycle.
It will take up more space though.
Details after I have finished the tape-out.

Comment 14 Luke Kenneth Casson Leighton 2022-03-15 18:53:15 GMT

(In reply to Staf Verhaegen from comment #13)
> Just to notify I have seen bug #781 but currently don't have time to give
> technical feedback. I first need to finish Sky130 tape-out of single port
> SRAM which will be very close to be able to make it.

understood.

> Just want to comment now that making a block that does a read followed by a
> write in one clock cycle is more risky than having a dual port RAM where you
> can do the read on one port and a write on the other port in the same clock
> cycle.

redesigning the Libre-SOC Core's access to register files, to take that into
account, forcing all reads *and* writes through a single-access arbitration,
is not a practical option either in terms of the code itself nor in terms of
performance.

the absolute bare minimum that the Libre-SOC Core's code has been designed
to is that the read port(s) are completely independent of the write port(s)
and that the absolute bare minimum of each is one (1).

redesigning the Arbitration (regfile access protection logic)
is neither practical nor desirable.


> It will take up more space though.

there are 10 pipelines, some require 3x INT regfile reads, and a minimum
1x INT regfile writes, where LD/ST requires 2x INT regfile writes.

any space saved by is already overwhelmed by the size of the broadcast
bus logic.

Comment 15 Staf Verhaegen 2022-03-16 08:32:44 GMT

(In reply to Luke Kenneth Casson Leighton from comment #14)
> (In reply to Staf Verhaegen from comment #13)
> > Just to notify I have seen bug #781 but currently don't have time to give
> > technical feedback. I first need to finish Sky130 tape-out of single port
> > SRAM which will be very close to be able to make it.
> 
> understood.
> 
> > Just want to comment now that making a block that does a read followed by a
> > write in one clock cycle is more risky than having a dual port RAM where you
> > can do the read on one port and a write on the other port in the same clock
> > cycle.
> 
> redesigning the Libre-SOC Core's access to register files, to take that into
> account, forcing all reads *and* writes through a single-access arbitration,
> is not a practical option either in terms of the code itself nor in terms of
> performance.
> 
> the absolute bare minimum that the Libre-SOC Core's code has been designed
> to is that the read port(s) are completely independent of the write port(s)
> and that the absolute bare minimum of each is one (1).

The dual port SRAM can be used with one port as the fixed write port and one port as the fixed read port. If you double the number of blocks you can than make a wrapper that has one write port and two read ports, with three blocks it is one write and three read ports etc.

Comment 16 Luke Kenneth Casson Leighton 2022-03-16 09:50:57 GMT

(In reply to Staf Verhaegen from comment #15)

> The dual port SRAM can be used with one port as the fixed write port and one
> port as the fixed read port. If you double the number of blocks you can than
> make a wrapper that has one write port and two read ports, with three blocks
> it is one write and three read ports etc.

indeed.  and that can be isolated behind a standard nmigen "Memory" compatible
interface, and is a far simpler and much less disruptive job.

can you make available what you have so far so that we (and other people)
can take a look? [rather than delay several weeks until the results of the
MPW run are available]

Comment 17 Staf Verhaegen 2022-03-24 10:38:03 GMT

I would like to provide simulation model for a dual port SRAM + nmigen wrapper for 1WxR blocks.
Is there a good place somewhere in the repos where I can do that ? Or should I provide new repo ?

Comment 18 Luke Kenneth Casson Leighton 2022-03-24 11:04:20 GMT

(In reply to Staf Verhaegen from comment #17)
> I would like to provide simulation model for a dual port SRAM + nmigen
> wrapper for 1WxR blocks.

brilliant. this is the 6T cell? (1R-or-1W)? 

> Is there a good place somewhere in the repos where I can do that ? Or should
> I provide new repo ?

a new one is good, i can mirror it.

btw please do also commit the source code of the cell,
we cannot operate on a "commit when finalised basis"

if you also include the OSI-approved License which
says (usually) "No warranty, No liability" there is
no risk to you.

Comment 19 Staf Verhaegen 2022-03-24 11:33:11 GMT

(In reply to Luke Kenneth Casson Leighton from comment #18)
> (In reply to Staf Verhaegen from comment #17)
> > I would like to provide simulation model for a dual port SRAM + nmigen
> > wrapper for 1WxR blocks.
> 
> brilliant. this is the 6T cell? (1R-or-1W)? 

No, the dual port SRAM uses an 8T cell. As said above I do see risks associated with the proposal in #781 which I feel is not appropriate for the strict timeline needed for the current project.
But before having a more in-depth discussion on that subject I would like to have the dual port models available.

BTW, I do find the '1R-or-1W' name very confusing it's typically called 1RW.

> 
> > Is there a good place somewhere in the repos where I can do that ? Or should
> > I provide new repo ?
> 
> a new one is good, i can mirror it.
> 
> btw please do also commit the source code of the cell,
> we cannot operate on a "commit when finalised basis"

This is just about the simulation model so without the design of SRAM itself which will need a separate task. The latter will be done in the [c4m-flexcell repository](https://gitlab.com/Chips4Makers/c4m-flexmem) so commits will be done when going along. The TSMC 180nm specific layout of course can't be made public; the Sky130 will be.

That said, I want to clarify that a SRAM cell is not a digital cell; analog design is needed to get a full SRAM block working. You can't just write a digital wrapper around a SRAM cell to get a SRAM block which seems a wrong assumption made in #781.

Comment 20 Luke Kenneth Casson Leighton 2022-03-24 11:59:21 GMT

(In reply to Staf Verhaegen from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #18)
> > (In reply to Staf Verhaegen from comment #17)
> > > I would like to provide simulation model for a dual port SRAM + nmigen
> > > wrapper for 1WxR blocks.
> > 
> > brilliant. this is the 6T cell? (1R-or-1W)? 
> 
> No, the dual port SRAM uses an 8T cell.

ah brilliant so definitely 1R1W, which is fantastic.

we've found someone who will be happy to give a quote for a 10T Cell (2R1W)
using FlexLib.  they're a commercial Memory Compiler company who are happy to
do this as an experiment, and i've encouraged and explained to them the
benefits of using sky130 MPWs.  i'd very much like them to be able to get
started from something pre-existing, and get started very soon given the
time constraints, so it is really important that they have as much source
code as possible, otherwise they end up duplicating effort
(and charging for it).

therefore it's really important to make available the 1RW source code
the moment it is written (even if it is in a development branch).


> But before having a more in-depth discussion on that subject I would like to
> have the dual port models available.

ok.  then the fastest way which saves time is for them to be in this
directory: https://git.libre-soc.org/?p=soc.git;a=tree;f=src/soc/bus;hb=HEAD

that's a catch-all location for things not yet with an appropriate location,
it includes SPBlock512W64B8W.py for example.

> BTW, I do find the '1R-or-1W' name very confusing it's typically called 1RW.

ah i used that term purely because i didn't know the industry-standard one.
now i know it, i'll use 1RW in future.  now i see it written, it makes perfect
sense: the number (1) followed by the actions (RW).

 
> This is just about the simulation model so without the design of SRAM itself
> which will need a separate task. The latter will be done in the
> [c4m-flexcell repository](https://gitlab.com/Chips4Makers/c4m-flexmem) so
> commits will be done when going along. 

ok great.

> The TSMC 180nm specific layout of
> course can't be made public; the Sky130 will be.

yes, perfectly understood.
 
> That said, I want to clarify that a SRAM cell is not a digital cell; analog
> design is needed to get a full SRAM block working. 

yes, i heard.  drive-line strength matters when inserted into a matrix;
the size of the 2 transistors in the FF matters: get them wrong and you
can set but not reset, or you can read but not write.

> You can't just write a
> digital wrapper around a SRAM cell to get a SRAM block which seems a wrong
> assumption made in #781.

i believe Cesar was intending to do alternating-clocks at the top level
(the Memory Block) on a *pair* of Memory Blocks.

Comment 21 Staf Verhaegen 2022-03-24 13:11:58 GMT

(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to Staf Verhaegen from comment #19)
> > (In reply to Luke Kenneth Casson Leighton from comment #18)
> > > (In reply to Staf Verhaegen from comment #17)
> > > > I would like to provide simulation model for a dual port SRAM + nmigen
> > > > wrapper for 1WxR blocks.
> > > 
> > > brilliant. this is the 6T cell? (1R-or-1W)? 
> > 
> > No, the dual port SRAM uses an 8T cell.
> 
> ah brilliant so definitely 1R1W, which is fantastic.
> 
> we've found someone who will be happy to give a quote for a 10T Cell (2R1W)
> using FlexLib.  they're a commercial Memory Compiler company who are happy to
> do this as an experiment, and i've encouraged and explained to them the
> benefits of using sky130 MPWs.  i'd very much like them to be able to get
> started from something pre-existing, and get started very soon given the
> time constraints, so it is really important that they have as much source
> code as possible, otherwise they end up duplicating effort
> (and charging for it).
> 
> therefore it's really important to make available the 1RW source code
> the moment it is written (even if it is in a development branch).

Ah nice. The picture changes when people with SRAM design experience are involved. Problem is that I am overloaded meaning also that I don't have the bandwidth to teach people without SRAM design experience how to design a SRAM (block).

I think it is best then that we have a telecom with me and them to see how to proceed from here. As I am overloaded anyway I would be happy to let the compiler design be done by third party and deviate considerable part of my NGI Pointer money to the party. I could then focus on stabilizing the API and make code base amenable for external contributions. Which ATM is heavily slowed down by other interfering deadlines.

Unfortunately I think it is either one or the other in the NGI Pointer timeframe; e.g either I do dual port compiler design or I support other party in doing a compiler design (type(s) to be chosen) but not both at the same time.
Actually my preference is the latter as I am frustrated that I always am saying I will be stabilizing the API but not deliver.

So I will also change plans a little and not focus on the dual port RAM models but on completing the docs for the sky130 SRAM tape-out I just finished. This contains the 6T cell design which should be good start for discussion with your party.

Comment 22 Staf Verhaegen 2022-03-24 13:14:15 GMT

(In reply to Staf Verhaegen from comment #21)
> (In reply to Luke Kenneth Casson Leighton from comment #20)
> > No, the dual port SRAM uses an 8T cell.
> 
> ah brilliant so definitely 1R1W, which is fantastic.

No actually 2RW :)

Comment 23 Staf Verhaegen 2022-03-24 13:20:34 GMT

(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to Staf Verhaegen from comment #19)
> > You can't just write a
> > digital wrapper around a SRAM cell to get a SRAM block which seems a wrong
> > assumption made in #781.
> 
> i believe Cesar was intending to do alternating-clocks at the top level
> (the Memory Block) on a *pair* of Memory Blocks.

You can indeed have the SRAM run at double the clock frequency than the main chip and that way transform a 1RW block into a 2RW block in the slower clock domain.

When you do this you need to take care the that P&R handles the timing of these two clocks properly.
Also if you always want to do a read on the rising edge of the slower clock and a write on the falling edge it seems not so trivial to be sure about that. You will need to use the slower clock to decide if the higher clocked SRAM needs to do a read or write. This is ripe for post P&R timing problems.

Comment 24 Staf Verhaegen 2022-03-25 13:01:28 GMT

(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to Staf Verhaegen from comment #19)

> 
> therefore it's really important to make available the 1RW source code
> the moment it is written (even if it is in a development branch).

The public release of flexmem was held back because I still had TSMC data in the code so could not make a public release. This is solved now and the source code of the sky130 sram block is now online and should be able to be rebuild by anyone.
See repo: https://gitlab.com/Chips4Makers/sky130mpw5-sramtest
The needed code is mentioned in the README.

Comment 25 Staf Verhaegen 2022-04-19 09:46:05 BST

As deadline for Gigabit ASIC is approaching fast the time for new SRAM development is limited. In order to possiblity of tape-out in the NGI Pointer time frame it was decided to go for a dual port 2RW design and make wrappers for all needed blocks like registers files (3R1W etc).

Reason to go for dual port 2RW block is that blocks from the single port 1RW SRAM can be reused and this way the dual port can be fitted in the timeframe of this project. The 1RW cell has one bit line pair and the 2RW cell has two bit line pairs. The reading and writing of each of the two bit line pairs can be done with the same circuits as for the single port compiler. Due to the higher load on the bit cell when reading on the two ports at the same time the design of the 2RW cell may end up in a relatively bigger SRAM cell though.

If one would use a 10T 2R1W cell the reading and the writing has to be done with other circuits and thus involve more work. Also the 2RW is more versatile as it can for example easily be used as cache or it can be used as a 1R1W block and form there make a 2R1W wrapper around it.

As the scope of this investigation is now reduced due to time pressure also the budget for this task is reduced.