base-level SRAMs are 1R or 1W, this is a hard limitation of the way the cells are designed. addressing can be provided to give multiple banks and multiple rows but with every single cell being 1R *or* 1W, the entire (larger) SRAM is also so limited. register files obviously need much more than a single read port or a single write port and they need a bare minimum of 1R *and* 1W on the *same* clock cycle. this can be provided (in effect) by running at Double Data Rate where on one edge is 1x Read and on the other is 1x write. on top of that DDR block, a *second* trick is to have multiple such SDRAMs, write to all of them and allow independent reads. both these things need doing
idk if i'm interpreting correctly, but i think i found a dual-ported sram on skywater130: sram cell (other needed stuff is in parent dir): https://github.com/google/skywater-pdk-libs-sky130_fd_bd_sram/tree/main/cells/openram_dp_cell
By any chance, is the SRAM in question exactly the same as the one on the chip, which is described on bug 502? If so, I can adopt the simulation model on bug 502, comment 8.
(In reply to Jacob Lifshay from comment #1) > idk if i'm interpreting correctly, but i think i found a dual-ported sram on > skywater130: > sram cell (other needed stuff is in parent dir): > https://github.com/google/skywater-pdk-libs-sky130_fd_bd_sram/tree/main/ > cells/openram_dp_cell i forgot we're not using skywater130, oops...thanks lkcl for pointing that out in the meeting today!
(In reply to Jacob Lifshay from comment #3) > i forgot we're not using skywater130, oops...thanks lkcl for pointing that > out in the meeting today! if we did 32 bit Power we stand a slim chance rather than no chance of fitting into 10 mm^2 of sky130 MPWs. (In reply to Cesar Strauss from comment #2) > By any chance, is the SRAM in question exactly the same as the one on the > chip, which is described on bug 502? no it was part of a 4k suite of SRAM cells, and 1R-or-1W > If so, I can adopt the simulation model on bug 502, comment 8. assuming that Staf provides the cell, yes. that one is a 1R *OR* 1W in theory with the back to back cell idea you came up with, yes. the trick with that will be to ensure all possible permutations of reads and writes over 3 clock cycles will work as 1R *AND* 1W. clock 1 clock 2 clock 3 no r no w no r no w no r no w r w r w r r r w r r w r w r r etc. basically 64 permutations of 2 actions (r and w) over 3 cycles. msybe 256 over 4 cycles.
Without going for multiple clocks, I think it is maybe possible to implement a 1W1R memory block, using a single clock, using four 1RW memory blocks of the same size. It would go like this: The first 1RW memory block would alternate between read and write. For instance, write on even cycles, read on odd cycles. Let's call it an 1eW1oR memory. The second memory, on odd cycles, copies the value just written on the first memory, and allows reading on even cycles (1oW1eR). The "odd-only" write port is tied, but the read port (on even cycles) is free. That way, we can still write only on even cycles, but can now read both on even and odd cycles: 1eW1R Now, repeat this, to make an 1oW1R. Together with a "Live Value Table", we would get a full 1W1R memory out of four 1RW memories (plus multiplexers, and FFRAM for the LVT). I suspect the area cost of this 1W1R would not compare with one based on a single 2RW memory block, but, in the absence of this, can compete, at least in terms of risk, with a DDR 1RW memory implementation.
(In reply to Cesar Strauss from comment #5) > Without going for multiple clocks, I think it is maybe possible to implement > a 1W1R memory block, using a single clock, using four 1RW memory blocks of > the same size. > > It would go like this: > > The first 1RW memory block would alternate between read and write. For > instance, write on even cycles, read on odd cycles. Let's call it an 1eW1oR > memory. > > The second memory, on odd cycles, copies the value just written on the first > memory, and allows reading on even cycles (1oW1eR). The "odd-only" write > port is tied, but the read port (on even cycles) is free. > > That way, we can still write only on even cycles, but can now read both on > even and odd cycles: 1eW1R > > Now, repeat this, to make an 1oW1R. Together with a "Live Value Table", we > would get a full 1W1R memory out of four 1RW memories (plus multiplexers, > and FFRAM for the LVT). I think the problem is in the "Live Value Table". I don't see how you can live with a 1RW block there ?
(In reply to Staf Verhaegen from comment #6) > I think the problem is in the "Live Value Table". I don't see how you can > live with a 1RW block there ? Indeed, it will be a plain 1W1R FFRAM, but only 1-bit wide, per write lane. We could try to get rid of the LVT, using the XOR trick (as described in http://people.csail.mit.edu/ml/pubs/trets_multiport.pdf). It should involve adding another half-duty-cycle read port, I think, making it 50% larger (6 x 1RW blocks). I suppose it will start making sense for larger memories and/or more read ports, where a deep multi-port FFRAM LVT will cost more. This is of course assuming we won't have a dual port memory block available, otherwise all this exercise is moot, I guess.
Things are going according to plan. 1) Simulation model of a transparent synchronous 1RW memory block in nMigen: Code: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/sram_wrapper.py;hb=332653c94f0e6369fa8d96087a7e392430a10daf#l28 Test case waveforms: python ~/src/soc/src/soc/regfile/sram_wrapper.py SinglePortSRAMTestCase.test_sram_model gtkwave test_sram_model.gtkw 2) Wrapper around two 1RW memory blocks, allowing an independent read port, even if the write port still works only on even (or odd) cycles. Code: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/sram_wrapper.py;hb=332653c94f0e6369fa8d96087a7e392430a10daf#l233 Test case waveforms: python ~/src/soc/src/soc/regfile/sram_wrapper.py PhasedDualPortRegfileTestCase.test_phased_dual_port_regfile gtkwave test_phased_dual_port_1_transparent.gtkw Both were formally verified, giving more confidence than just a few targeted test cases. Next is implementing a full 1W/1R regfile, with four 1RW memories and a LVT. Later, we can go back to the original plan of double clocking a single 1RW memory, but at least we will have a choice of a safer fallback, if needed.
Full 1W/1R register file works! https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/sram_wrapper.py;hb=695108b7e98dc23073cf952825e18b6dce10e6d8#l544 Test case waveform: $ python ~/src/soc/src/soc/regfile/sram_wrapper.py DualPortRegfileTestCase.test_dual_port_regfile $ gtkwave test_dual_port_regfile_transparent.gtkw
(In reply to Cesar Strauss from comment #9) > Full 1W/1R register file works! cool! if only it didn't require 4x 1RW SRAMs to do it :) but, it is important, as a fall-back. > https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/sram_wrapper. > py;hb=695108b7e98dc23073cf952825e18b6dce10e6d8#l544 ah, a single-bit for the "last value", that could be a DFF, it is not so costly.
(In reply to Luke Kenneth Casson Leighton from comment #10) > ah, a single-bit for the "last value", that could be a DFF, > it is not so costly. Well, if you want byte-level granularity for writes, you need one DFF per stored byte, or 8 DFF per stored 64-bit register... The XOR trick doesn't need any DFF RAM, but costs 6 x 1RW blocks. It works like this: 1) Like before, there are two memories, each reading on every cycle, and writing on alternate cycles (2 x 1RW each) 2) Instead of a MUX, the read port is a direct XOR of the two memories. 3) Writes happens in two cycles: First, read the current value of the *other* memory, at the write location. Then, on *this* memory, write that read value, XORed with the desired value. This recovers the desired value when read: (other XOR desired) XOR other = desired We do need an extra read port, but it is only needed on alternate cycles, so it can be just an extra 1RW block, per memory. This gives 3 x 1RW per memory, or 6 x 1RW total. Nice thing is, this extra 2 x 1RW per write port is amortized as you add read ports: 1W1R = 6 x 1RW (instead of 4) 1W2R = 10 x 1RW (instead of 8) etc.
I'm worrying a bit about power-on initialization of the register file, when using this wrapper. As I understand it, on an ASIC, SRAM blocks have random data at power on, and even the reset signal will not affect its contents. If the register file is allowed to be undefined at power on, in the Instruction Set Architecture, then I guess it's OK to read random data before the first write. However, if multi-port is implemented by copying, then, at power on, reads from the same location will return different data, on different ports! I mean, I guess it's OK to read random data at power on, but inconsistent data??? Furthermore, with the LVT implementation, a write on any port will suffice to put a register location in a consistent state. However, with the XOR implementation, which does read/modify/write, I believe you need to write once on *every* write port, to accomplish that. So, do you think we should add a state machine to initialize the register file at power on?
hmmm intriguing. no, as best i understand it, all regfiles are supposed to initialise to zeros. MSR and PC are different. i mean... we *could* leave the regfiles in an unknown state, because coldboot is supposed to not assume contents are zero. doesn't feel safe though. yes, having a general-purpose wrapper that can perform initialisation would be good, and reporting back a Signal that says "initialisation completed". can you raise that as a separate bugreport?