- Make inventory of all memory blocks and need for number of ports. Investigate possible architecture optimization for port reduction - Investigate architecture of multi-port cell - Define multi-port extension of memory compiler; including needed budget
Don't see this a dependency of #630. That is single port; this task is about what has to happen after #630.
Already comment from Luke in email: 'Yes. i am trying to avoid more than one write port but if that's possible (2W) it would be amazing and speed things up. we do need 2W for the DMA buffers though. if it is convenient i will be "stratifying" the regfiles so that they're subdivided (interleaved), odd-even banks. bank 1: r0 r2 r4 r6 r8 ... bank 2: r1 r3 r5 r7 r9 ... some of the very small regfiles (FAST, STATE, XER, CR) i would actually recommend leaving them as DFFs. the CR one is just ridiculously complicated. QTY 16of 32-bit regs comprising 4-bit access *on top* of a (full) 32-bit access port. multiple read-write ports @ *both* 4-bit *and* 32-bit. plus the CR regfile is actually unary-addressed (0b11111111 masks) for the LSBs, not binary-addressed, but *binary* addressed for the MSBs (selecting which of the 16).'
Could you point me to example code with these blocks with 4-bit and 32-bit ports ?
General principle is that number of ports of a memory reflects the number of parallel accesses one needs to do in the same clock cycle. Adding an extra port will increase area of the SRAM block. Otherwise higher level logic should be used to arbitrate an access including bit width granularity changes etc.
(In reply to Staf Verhaegen from comment #3) > Could you point me to example code with these blocks with 4-bit and 32-bit > ports ? https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;h=8f881423e4aedfc38b4f35d78c842aec908cf990;hb=HEAD#l114 https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/virtual_port.py;hb=HEAD that's just one 32bit regfile for the standard Power ISA Condition Register SVP64 needs *16* of these. each 32bit CR has 8 CR Fields, 4 bits each. these 4bit fields are accessed by crand, cror etc. and the full 32bit by mcrf and mfcr. with the 128 CR Fields being used for predication as well as Rc=1 targets (vectorised) there will be a HELL of a lot of read and write ports onto CRs. to prevent delays we may need something like 4R3W or even 5R3W 32bit, with the write-enable being 8bit wide on each 4bit CR field fortunately there would only be QTY 16 such 4R3W/5R3W 32bit regs. it is perfectly fine to use one of those 32-bit ports to synthesise the 4-bit ports via an adapter. it is the opposite arrangement to that virtual_port.py file but that's ok.
(In reply to Staf Verhaegen from comment #4) > General principle is that number of ports of a memory reflects the number of > parallel accesses one needs to do in the same clock cycle. Adding an extra > port will increase area of the SRAM block. with 128 64 bit INT and FP regs we may have to do an actual reg cache for higher clock rates, i would like to avoid that complication in the 180nm / 130nm geometries if practical. i also do not mind breaking down into a stratified arrangement of four *separate* regfiles or even 5: 32 regs r0-r31 all accessible 24 regs r32 r36 r40 ... r124 24 regs r33 r37 r41 ... r125 24 regs r34 ... r126 24 regs r35 ... r127 where each of those is 4R1W 64bit with byte-level write-enable. and if instructions are ever issued add r31, r33, r127 the data goes into a cyclic buffer that shuffles along, with appropriate latency, to match up data from regfile port to ALU which will *also* be stratified (5 different separate ALU banks, yes really, and yes it'll be a Monster but hey). however again doing a Monster Vector Processor like this i would like to avoid in the first iteration @ 130/180 nm avoiding all external complications like that: bare minimum: for FP and INT i will be happy with 4R1W on a 128x 64-bit regfile with byte-level write-enable. this gives FMAC single cycle and also INT-MAC with room for an INT predicate read without delay.
assuming this is a subtask of #589, if not, please either remove budget (comment out toml field and clear numeric fields) or set correct milestone and budget parent
(In reply to Jacob Lifshay from comment #7) > assuming this is a subtask of #589, #589 is the top-level NLnet proposal 2021-02-052. https://libre-soc.org/nlnet_2021_crypto_router/
What is different between #589 and #690 ?
(In reply to Staf Verhaegen from comment #9) > What is different between #589 and #690 ? OK, saw on IRC. #589 is NLNet CryptoRouter and #690 is NGI POINTER. From my point this would fit under both of these so you can decide which budget to use.
Staf I identified another location of SRAMs: the DCache and ICaches. there are two types: * TLB and Page Table Entry caches - 1R1W - read on next cycle (sync) - write updates on NEXT cycle (sync) - forwarding needed (can be done externally, not part of SRAM) - 64 rows - 128 bit for tags, and 92 bit for PTEs - 4 write-enables therefore 4x32-bit for tags and 4x23-bit for PTEs you can see the code which reads, modifies, then writes back some bits. I will make this use a Memory so it is more obvious https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;h=d0829f4df8ac56dbcf96b5b018a66730e24261c4;hb=1cf999d43c244b49894d77e8f2151089496239eb#l432 about 1k in size, each there, QTY 4 to be used. i will check that ICache is the same spec (should be) * Cache SRAM - 1R1W - 128 rows - 64 bits - entire row written (no multiple wens) - read and write both 1 clock cycle each 1k in size (128x8), QTY 8 to be used. https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/cache_ram.py;h=50ee1367cc84301bcf9cecf0f6cae51d13273227;hb=1cf999d43c244b49894d77e8f2151089496239eb
staf also, just fyi, this one has to move to NGI POINTER.
Just to notify I have seen bug #781 but currently don't have time to give technical feedback. I first need to finish Sky130 tape-out of single port SRAM which will be very close to be able to make it. Just want to comment now that making a block that does a read followed by a write in one clock cycle is more risky than having a dual port RAM where you can do the read on one port and a write on the other port in the same clock cycle. It will take up more space though. Details after I have finished the tape-out.
(In reply to Staf Verhaegen from comment #13) > Just to notify I have seen bug #781 but currently don't have time to give > technical feedback. I first need to finish Sky130 tape-out of single port > SRAM which will be very close to be able to make it. understood. > Just want to comment now that making a block that does a read followed by a > write in one clock cycle is more risky than having a dual port RAM where you > can do the read on one port and a write on the other port in the same clock > cycle. redesigning the Libre-SOC Core's access to register files, to take that into account, forcing all reads *and* writes through a single-access arbitration, is not a practical option either in terms of the code itself nor in terms of performance. the absolute bare minimum that the Libre-SOC Core's code has been designed to is that the read port(s) are completely independent of the write port(s) and that the absolute bare minimum of each is one (1). redesigning the Arbitration (regfile access protection logic) is neither practical nor desirable. > It will take up more space though. there are 10 pipelines, some require 3x INT regfile reads, and a minimum 1x INT regfile writes, where LD/ST requires 2x INT regfile writes. any space saved by is already overwhelmed by the size of the broadcast bus logic.
(In reply to Luke Kenneth Casson Leighton from comment #14) > (In reply to Staf Verhaegen from comment #13) > > Just to notify I have seen bug #781 but currently don't have time to give > > technical feedback. I first need to finish Sky130 tape-out of single port > > SRAM which will be very close to be able to make it. > > understood. > > > Just want to comment now that making a block that does a read followed by a > > write in one clock cycle is more risky than having a dual port RAM where you > > can do the read on one port and a write on the other port in the same clock > > cycle. > > redesigning the Libre-SOC Core's access to register files, to take that into > account, forcing all reads *and* writes through a single-access arbitration, > is not a practical option either in terms of the code itself nor in terms of > performance. > > the absolute bare minimum that the Libre-SOC Core's code has been designed > to is that the read port(s) are completely independent of the write port(s) > and that the absolute bare minimum of each is one (1). The dual port SRAM can be used with one port as the fixed write port and one port as the fixed read port. If you double the number of blocks you can than make a wrapper that has one write port and two read ports, with three blocks it is one write and three read ports etc.
(In reply to Staf Verhaegen from comment #15) > The dual port SRAM can be used with one port as the fixed write port and one > port as the fixed read port. If you double the number of blocks you can than > make a wrapper that has one write port and two read ports, with three blocks > it is one write and three read ports etc. indeed. and that can be isolated behind a standard nmigen "Memory" compatible interface, and is a far simpler and much less disruptive job. can you make available what you have so far so that we (and other people) can take a look? [rather than delay several weeks until the results of the MPW run are available]
I would like to provide simulation model for a dual port SRAM + nmigen wrapper for 1WxR blocks. Is there a good place somewhere in the repos where I can do that ? Or should I provide new repo ?
(In reply to Staf Verhaegen from comment #17) > I would like to provide simulation model for a dual port SRAM + nmigen > wrapper for 1WxR blocks. brilliant. this is the 6T cell? (1R-or-1W)? > Is there a good place somewhere in the repos where I can do that ? Or should > I provide new repo ? a new one is good, i can mirror it. btw please do also commit the source code of the cell, we cannot operate on a "commit when finalised basis" if you also include the OSI-approved License which says (usually) "No warranty, No liability" there is no risk to you.
(In reply to Luke Kenneth Casson Leighton from comment #18) > (In reply to Staf Verhaegen from comment #17) > > I would like to provide simulation model for a dual port SRAM + nmigen > > wrapper for 1WxR blocks. > > brilliant. this is the 6T cell? (1R-or-1W)? No, the dual port SRAM uses an 8T cell. As said above I do see risks associated with the proposal in #781 which I feel is not appropriate for the strict timeline needed for the current project. But before having a more in-depth discussion on that subject I would like to have the dual port models available. BTW, I do find the '1R-or-1W' name very confusing it's typically called 1RW. > > > Is there a good place somewhere in the repos where I can do that ? Or should > > I provide new repo ? > > a new one is good, i can mirror it. > > btw please do also commit the source code of the cell, > we cannot operate on a "commit when finalised basis" This is just about the simulation model so without the design of SRAM itself which will need a separate task. The latter will be done in the [c4m-flexcell repository](https://gitlab.com/Chips4Makers/c4m-flexmem) so commits will be done when going along. The TSMC 180nm specific layout of course can't be made public; the Sky130 will be. That said, I want to clarify that a SRAM cell is not a digital cell; analog design is needed to get a full SRAM block working. You can't just write a digital wrapper around a SRAM cell to get a SRAM block which seems a wrong assumption made in #781.
(In reply to Staf Verhaegen from comment #19) > (In reply to Luke Kenneth Casson Leighton from comment #18) > > (In reply to Staf Verhaegen from comment #17) > > > I would like to provide simulation model for a dual port SRAM + nmigen > > > wrapper for 1WxR blocks. > > > > brilliant. this is the 6T cell? (1R-or-1W)? > > No, the dual port SRAM uses an 8T cell. ah brilliant so definitely 1R1W, which is fantastic. we've found someone who will be happy to give a quote for a 10T Cell (2R1W) using FlexLib. they're a commercial Memory Compiler company who are happy to do this as an experiment, and i've encouraged and explained to them the benefits of using sky130 MPWs. i'd very much like them to be able to get started from something pre-existing, and get started very soon given the time constraints, so it is really important that they have as much source code as possible, otherwise they end up duplicating effort (and charging for it). therefore it's really important to make available the 1RW source code the moment it is written (even if it is in a development branch). > But before having a more in-depth discussion on that subject I would like to > have the dual port models available. ok. then the fastest way which saves time is for them to be in this directory: https://git.libre-soc.org/?p=soc.git;a=tree;f=src/soc/bus;hb=HEAD that's a catch-all location for things not yet with an appropriate location, it includes SPBlock512W64B8W.py for example. > BTW, I do find the '1R-or-1W' name very confusing it's typically called 1RW. ah i used that term purely because i didn't know the industry-standard one. now i know it, i'll use 1RW in future. now i see it written, it makes perfect sense: the number (1) followed by the actions (RW). > This is just about the simulation model so without the design of SRAM itself > which will need a separate task. The latter will be done in the > [c4m-flexcell repository](https://gitlab.com/Chips4Makers/c4m-flexmem) so > commits will be done when going along. ok great. > The TSMC 180nm specific layout of > course can't be made public; the Sky130 will be. yes, perfectly understood. > That said, I want to clarify that a SRAM cell is not a digital cell; analog > design is needed to get a full SRAM block working. yes, i heard. drive-line strength matters when inserted into a matrix; the size of the 2 transistors in the FF matters: get them wrong and you can set but not reset, or you can read but not write. > You can't just write a > digital wrapper around a SRAM cell to get a SRAM block which seems a wrong > assumption made in #781. i believe Cesar was intending to do alternating-clocks at the top level (the Memory Block) on a *pair* of Memory Blocks.
(In reply to Luke Kenneth Casson Leighton from comment #20) > (In reply to Staf Verhaegen from comment #19) > > (In reply to Luke Kenneth Casson Leighton from comment #18) > > > (In reply to Staf Verhaegen from comment #17) > > > > I would like to provide simulation model for a dual port SRAM + nmigen > > > > wrapper for 1WxR blocks. > > > > > > brilliant. this is the 6T cell? (1R-or-1W)? > > > > No, the dual port SRAM uses an 8T cell. > > ah brilliant so definitely 1R1W, which is fantastic. > > we've found someone who will be happy to give a quote for a 10T Cell (2R1W) > using FlexLib. they're a commercial Memory Compiler company who are happy to > do this as an experiment, and i've encouraged and explained to them the > benefits of using sky130 MPWs. i'd very much like them to be able to get > started from something pre-existing, and get started very soon given the > time constraints, so it is really important that they have as much source > code as possible, otherwise they end up duplicating effort > (and charging for it). > > therefore it's really important to make available the 1RW source code > the moment it is written (even if it is in a development branch). Ah nice. The picture changes when people with SRAM design experience are involved. Problem is that I am overloaded meaning also that I don't have the bandwidth to teach people without SRAM design experience how to design a SRAM (block). I think it is best then that we have a telecom with me and them to see how to proceed from here. As I am overloaded anyway I would be happy to let the compiler design be done by third party and deviate considerable part of my NGI Pointer money to the party. I could then focus on stabilizing the API and make code base amenable for external contributions. Which ATM is heavily slowed down by other interfering deadlines. Unfortunately I think it is either one or the other in the NGI Pointer timeframe; e.g either I do dual port compiler design or I support other party in doing a compiler design (type(s) to be chosen) but not both at the same time. Actually my preference is the latter as I am frustrated that I always am saying I will be stabilizing the API but not deliver. So I will also change plans a little and not focus on the dual port RAM models but on completing the docs for the sky130 SRAM tape-out I just finished. This contains the 6T cell design which should be good start for discussion with your party.
(In reply to Staf Verhaegen from comment #21) > (In reply to Luke Kenneth Casson Leighton from comment #20) > > No, the dual port SRAM uses an 8T cell. > > ah brilliant so definitely 1R1W, which is fantastic. No actually 2RW :)
(In reply to Luke Kenneth Casson Leighton from comment #20) > (In reply to Staf Verhaegen from comment #19) > > You can't just write a > > digital wrapper around a SRAM cell to get a SRAM block which seems a wrong > > assumption made in #781. > > i believe Cesar was intending to do alternating-clocks at the top level > (the Memory Block) on a *pair* of Memory Blocks. You can indeed have the SRAM run at double the clock frequency than the main chip and that way transform a 1RW block into a 2RW block in the slower clock domain. When you do this you need to take care the that P&R handles the timing of these two clocks properly. Also if you always want to do a read on the rising edge of the slower clock and a write on the falling edge it seems not so trivial to be sure about that. You will need to use the slower clock to decide if the higher clocked SRAM needs to do a read or write. This is ripe for post P&R timing problems.
(In reply to Luke Kenneth Casson Leighton from comment #20) > (In reply to Staf Verhaegen from comment #19) > > therefore it's really important to make available the 1RW source code > the moment it is written (even if it is in a development branch). The public release of flexmem was held back because I still had TSMC data in the code so could not make a public release. This is solved now and the source code of the sky130 sram block is now online and should be able to be rebuild by anyone. See repo: https://gitlab.com/Chips4Makers/sky130mpw5-sramtest The needed code is mentioned in the README.
As deadline for Gigabit ASIC is approaching fast the time for new SRAM development is limited. In order to possiblity of tape-out in the NGI Pointer time frame it was decided to go for a dual port 2RW design and make wrappers for all needed blocks like registers files (3R1W etc). Reason to go for dual port 2RW block is that blocks from the single port 1RW SRAM can be reused and this way the dual port can be fitted in the timeframe of this project. The 1RW cell has one bit line pair and the 2RW cell has two bit line pairs. The reading and writing of each of the two bit line pairs can be done with the same circuits as for the single port compiler. Due to the higher load on the bit cell when reading on the two ports at the same time the design of the 2RW cell may end up in a relatively bigger SRAM cell though. If one would use a 10T 2R1W cell the reading and the writing has to be done with other circuits and thus involve more work. Also the 2RW is more versatile as it can for example easily be used as cache or it can be used as a 1R1W block and form there make a 2R1W wrapper around it. As the scope of this investigation is now reduced due to time pressure also the budget for this task is reduced.