781 – create wrapper register files around 1R-or-1W SRAMs

Bug 781 - create wrapper register files around 1R-or-1W SRAMs

Summary: create wrapper register files around 1R-or-1W SRAMs

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's second ASIC
Classification:	Unclassified
Component:	source code (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Cesar Strauss

URL:

Depends on:
Blocks:

Reported:	2022-03-01 21:46 GMT by Luke Kenneth Casson Leighton
Modified:	2022-05-03 11:56 BST (History)
CC List:	4 users (show)

See Also:
NLnet milestone:	NGI.POINTER.Gigabit.ASIC
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:	724
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2022-03-01 21:46:41 GMT

base-level SRAMs are 1R or 1W, this is a hard limitation of the
way the cells are designed.  addressing can be provided to give
multiple banks and multiple rows but with every single cell being
1R *or* 1W, the entire (larger) SRAM is also so limited.

register files obviously need much more than a single read port
or a single write port and they need a bare minimum of 1R *and*
1W on the *same* clock cycle.  this can be provided (in effect)
by running at Double Data Rate where on one edge is 1x Read and
on the other is 1x write.

on top of that DDR block, a *second* trick is to have multiple
such SDRAMs, write to all of them and allow independent reads.

both these things need doing

Comment 1 Jacob Lifshay 2022-03-06 19:28:44 GMT

idk if i'm interpreting correctly, but i think i found a dual-ported sram on skywater130:
sram cell (other needed stuff is in parent dir):
https://github.com/google/skywater-pdk-libs-sky130_fd_bd_sram/tree/main/cells/openram_dp_cell

Comment 2 Cesar Strauss 2022-03-06 21:05:29 GMT

By any chance, is the SRAM in question exactly the same as the one on the chip, which is described on bug 502?

If so, I can adopt the simulation model on bug 502, comment 8.

Comment 3 Jacob Lifshay 2022-03-09 23:42:57 GMT

(In reply to Jacob Lifshay from comment #1)
> idk if i'm interpreting correctly, but i think i found a dual-ported sram on
> skywater130:
> sram cell (other needed stuff is in parent dir):
> https://github.com/google/skywater-pdk-libs-sky130_fd_bd_sram/tree/main/
> cells/openram_dp_cell

i forgot we're not using skywater130, oops...thanks lkcl for pointing that out in the meeting today!

Comment 4 Luke Kenneth Casson Leighton 2022-03-10 00:10:06 GMT

(In reply to Jacob Lifshay from comment #3)

> i forgot we're not using skywater130, oops...thanks lkcl for pointing that
> out in the meeting today!

if we did 32 bit Power we stand a slim chance rather than no chance of
fitting into 10 mm^2 of sky130 MPWs.

(In reply to Cesar Strauss from comment #2)
> By any chance, is the SRAM in question exactly the same as the one on the
> chip, which is described on bug 502?

no it was part of a 4k suite of SRAM cells, and 1R-or-1W

> If so, I can adopt the simulation model on bug 502, comment 8.

assuming that Staf provides the cell, yes.

that one is a 1R *OR* 1W in theory with the back to back cell idea
you came up with, yes.  the trick with that will be to ensure all
possible permutations of reads and writes over 3 clock cycles
will work as 1R *AND* 1W.

clock 1      clock 2      clock 3
no r no w    no r no w    no r no w
r 
      w
r     w
             r
r            r
      w      r
r     w      r
                   w
r                  r

etc. basically 64 permutations of 2 actions (r and w) over 3 cycles.
msybe 256 over 4 cycles.

Comment 5 Cesar Strauss 2022-03-26 17:54:45 GMT

Without going for multiple clocks, I think it is maybe possible to implement a 1W1R memory block, using a single clock, using four 1RW memory blocks of the same size.

It would go like this:

The first 1RW memory block would alternate between read and write. For instance, write on even cycles, read on odd cycles. Let's call it an 1eW1oR memory.

The second memory, on odd cycles, copies the value just written on the first memory, and allows reading on even cycles (1oW1eR). The "odd-only" write port is tied, but the read port (on even cycles) is free.

That way, we can still write only on even cycles, but can now read both on even and odd cycles: 1eW1R

Now, repeat this, to make an 1oW1R. Together with a "Live Value Table", we would get a full 1W1R memory out of four 1RW memories (plus multiplexers, and FFRAM for the LVT).

I suspect the area cost of this 1W1R would not compare with one based on a single 2RW memory block, but, in the absence of this, can compete, at least in terms of risk, with a DDR 1RW memory implementation.

Comment 6 Staf Verhaegen 2022-03-28 09:08:05 BST

(In reply to Cesar Strauss from comment #5)
> Without going for multiple clocks, I think it is maybe possible to implement
> a 1W1R memory block, using a single clock, using four 1RW memory blocks of
> the same size.
> 
> It would go like this:
> 
> The first 1RW memory block would alternate between read and write. For
> instance, write on even cycles, read on odd cycles. Let's call it an 1eW1oR
> memory.
> 
> The second memory, on odd cycles, copies the value just written on the first
> memory, and allows reading on even cycles (1oW1eR). The "odd-only" write
> port is tied, but the read port (on even cycles) is free.
> 
> That way, we can still write only on even cycles, but can now read both on
> even and odd cycles: 1eW1R
> 
> Now, repeat this, to make an 1oW1R. Together with a "Live Value Table", we
> would get a full 1W1R memory out of four 1RW memories (plus multiplexers,
> and FFRAM for the LVT).

I think the problem is in the "Live Value Table". I don't see how you can live with a 1RW block there ?

Comment 7 Cesar Strauss 2022-03-28 11:54:06 BST

(In reply to Staf Verhaegen from comment #6)
> I think the problem is in the "Live Value Table". I don't see how you can
> live with a 1RW block there ?

Indeed, it will be a plain 1W1R FFRAM, but only 1-bit wide, per write lane.

We could try to get rid of the LVT, using the XOR trick (as described in http://people.csail.mit.edu/ml/pubs/trets_multiport.pdf).

It should involve adding another half-duty-cycle read port, I think, making it 50% larger (6 x 1RW blocks).

I suppose it will start making sense for larger memories and/or more read ports, where a deep multi-port FFRAM LVT will cost more.

This is of course assuming we won't have a dual port memory block available, otherwise all this exercise is moot, I guess.

Comment 8 Cesar Strauss 2022-04-03 23:59:18 BST

Things are going according to plan.

1) Simulation model of a transparent synchronous 1RW memory block in nMigen:

Code:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/sram_wrapper.py;hb=332653c94f0e6369fa8d96087a7e392430a10daf#l28

Test case waveforms:

python ~/src/soc/src/soc/regfile/sram_wrapper.py SinglePortSRAMTestCase.test_sram_model
gtkwave test_sram_model.gtkw

2) Wrapper around two 1RW memory blocks, allowing an independent read port, even if the write port still works only on even (or odd) cycles.

Code:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/sram_wrapper.py;hb=332653c94f0e6369fa8d96087a7e392430a10daf#l233

Test case waveforms:

python ~/src/soc/src/soc/regfile/sram_wrapper.py PhasedDualPortRegfileTestCase.test_phased_dual_port_regfile
gtkwave test_phased_dual_port_1_transparent.gtkw

Both were formally verified, giving more confidence than just a few targeted test cases.

Next is implementing a full 1W/1R regfile, with four 1RW memories and a LVT.

Later, we can go back to the original plan of double clocking a single 1RW memory, but at least we will have a choice of a safer fallback, if needed.

Comment 9 Cesar Strauss 2022-04-16 22:36:17 BST

Full 1W/1R register file works!

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/sram_wrapper.py;hb=695108b7e98dc23073cf952825e18b6dce10e6d8#l544

Test case waveform:

$ python ~/src/soc/src/soc/regfile/sram_wrapper.py DualPortRegfileTestCase.test_dual_port_regfile

$ gtkwave test_dual_port_regfile_transparent.gtkw

Comment 10 Luke Kenneth Casson Leighton 2022-04-16 23:07:53 BST

(In reply to Cesar Strauss from comment #9)

> Full 1W/1R register file works!

cool!  if only it didn't require 4x 1RW SRAMs to do it :)
but, it is important, as a fall-back.
 
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/sram_wrapper.
> py;hb=695108b7e98dc23073cf952825e18b6dce10e6d8#l544

ah, a single-bit for the "last value", that could be a DFF,
it is not so costly.

Comment 11 Cesar Strauss 2022-04-17 18:19:01 BST

(In reply to Luke Kenneth Casson Leighton from comment #10)
> ah, a single-bit for the "last value", that could be a DFF,
> it is not so costly.

Well, if you want byte-level granularity for writes, you need one DFF per stored byte, or 8 DFF per stored 64-bit register...

The XOR trick doesn't need any DFF RAM, but costs 6 x 1RW blocks. It works like this:

1) Like before, there are two memories, each reading on every cycle, and writing on alternate cycles (2 x 1RW each)
2) Instead of a MUX, the read port is a direct XOR of the two memories.
3) Writes happens in two cycles:
First, read the current value of the *other* memory, at the write location.
Then, on *this* memory, write that read value, XORed with the desired value.

This recovers the desired value when read:
  (other XOR desired) XOR other = desired

We do need an extra read port, but it is only needed on alternate cycles, so it can be just an extra 1RW block, per memory. This gives 3 x 1RW per memory, or 6 x 1RW total.

Nice thing is, this extra 2 x 1RW per write port is amortized as you add read ports:
   1W1R = 6 x 1RW (instead of 4)
   1W2R = 10 x 1RW (instead of 8)
   etc.

Comment 12 Cesar Strauss 2022-05-03 11:06:40 BST

I'm worrying a bit about power-on initialization of the register file, when using this wrapper.

As I understand it, on an ASIC, SRAM blocks have random data at power on, and even the reset signal will not affect its contents.

If the register file is allowed to be undefined at power on, in the Instruction Set Architecture, then I guess it's OK to read random data before the first write.

However, if multi-port is implemented by copying, then, at power on, reads from the same location will return different data, on different ports!

I mean, I guess it's OK to read random data at power on, but inconsistent data???

Furthermore, with the LVT implementation, a write on any port will suffice to put a register location in a consistent state.

However, with the XOR implementation, which does read/modify/write, I believe you need to write once on *every* write port, to accomplish that.

So, do you think we should add a state machine to initialize the register file at power on?

Comment 13 Luke Kenneth Casson Leighton 2022-05-03 11:56:10 BST

hmmm intriguing.

no, as best i understand it, all regfiles are supposed to initialise to zeros.
MSR and PC are different.

i mean... we *could* leave the regfiles in an unknown state, because
coldboot is supposed to not assume contents are zero.

doesn't feel safe though.

yes, having a general-purpose wrapper that can perform initialisation would
be good, and reporting back a Signal that says "initialisation completed".

can you raise that as a separate bugreport?