Bug 724 - Determine required memory compiler developments
Summary: Determine required memory compiler developments
Status: CONFIRMED
Alias: None
Product: Libre-SOC's second ASIC
Classification: Unclassified
Component: Milestones (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- normal
Assignee: Staf Verhaegen
URL:
Depends on:
Blocks: 690
  Show dependency treegraph
 
Reported: 2021-10-11 19:09 BST by Staf Verhaegen
Modified: 2021-12-06 21:30 GMT (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2021.02A.CryptoRouter
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 589
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
"Staf Verhaegen" = 2000


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Staf Verhaegen 2021-10-11 19:09:03 BST
- Make inventory of all memory blocks and need for number of ports.
Investigate possible architecture optimization for port reduction
- Investigate architecture of multi-port cell
- Define multi-port extension of memory compiler; including needed budget
Comment 1 Staf Verhaegen 2021-10-11 19:12:24 BST
Don't see this a dependency of #630. That is single port; this task is about what has to happen after #630.
Comment 2 Staf Verhaegen 2021-10-11 19:13:53 BST
Already comment from Luke in email:

'Yes.  i am trying to avoid more than one write port but if that's
possible (2W) it would be amazing and speed things up.

we do need 2W for the DMA buffers though.

if it is convenient i will be "stratifying" the regfiles so that they're
subdivided (interleaved), odd-even banks.

bank 1: r0 r2 r4 r6 r8 ...
bank 2: r1 r3 r5 r7 r9 ...

some of the very small regfiles (FAST, STATE, XER, CR) i would
actually recommend leaving them as DFFs.  the CR one is just
ridiculously complicated.  QTY 16of 32-bit regs comprising 4-bit
access *on top* of a (full) 32-bit access port.  multiple read-write
ports @ *both* 4-bit *and* 32-bit. plus the CR regfile is actually
unary-addressed (0b11111111 masks) for the LSBs, not
binary-addressed, but *binary* addressed for the MSBs
(selecting which of the 16).'
Comment 3 Staf Verhaegen 2021-10-11 19:14:59 BST
Could you point me to example code with these blocks with 4-bit and 32-bit ports ?
Comment 4 Staf Verhaegen 2021-10-11 19:23:29 BST
General principle is that number of ports of a memory reflects the number of parallel accesses one needs to do in the same clock cycle. Adding an extra port will increase area of the SRAM block.
Otherwise higher level logic should be used to arbitrate an access including bit width granularity changes etc.
Comment 5 Luke Kenneth Casson Leighton 2021-10-11 23:15:56 BST
(In reply to Staf Verhaegen from comment #3)
> Could you point me to example code with these blocks with 4-bit and 32-bit
> ports ?

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;h=8f881423e4aedfc38b4f35d78c842aec908cf990;hb=HEAD#l114

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/virtual_port.py;hb=HEAD

that's just one 32bit regfile for the standard Power ISA Condition Register
SVP64 needs *16* of these.

each 32bit CR has 8 CR Fields, 4 bits each.  these 4bit fields are accessed
by crand, cror etc. and the full 32bit by mcrf and mfcr.

with the 128 CR Fields being used for predication as well as Rc=1 targets
(vectorised) there will be a HELL of a lot of read and write ports onto
CRs.

to prevent delays we may need something like 4R3W or even 5R3W 32bit, with the write-enable
being 8bit wide on each 4bit CR field

fortunately there would only be QTY 16 such 4R3W/5R3W 32bit regs.

it is perfectly fine to use one of those 32-bit ports to synthesise
the 4-bit ports via an adapter.  it is the opposite arrangement to
that virtual_port.py file but that's ok.
Comment 6 Luke Kenneth Casson Leighton 2021-10-11 23:52:09 BST
(In reply to Staf Verhaegen from comment #4)
> General principle is that number of ports of a memory reflects the number of
> parallel accesses one needs to do in the same clock cycle. Adding an extra
> port will increase area of the SRAM block.

with 128 64 bit INT and FP regs we may have to do an actual reg cache
for higher clock rates, i would like to avoid that complication in the
180nm / 130nm geometries if practical.

i also do not mind breaking down into a stratified arrangement
of four *separate* regfiles or even 5:

32 regs r0-r31 all accessible
24 regs r32 r36 r40 ... r124
24 regs r33 r37 r41 ... r125
24 regs r34         ... r126
24 regs r35         ... r127

where each of those is 4R1W 64bit with byte-level write-enable.

and if instructions are ever issued add r31, r33, r127 the data
goes into a cyclic buffer that shuffles along, with appropriate
latency, to match up data from regfile port to ALU which will
*also* be stratified (5 different separate ALU banks, yes really,
and yes it'll be a Monster but hey).

however again doing a Monster Vector Processor like this i would
like to avoid in the first iteration @ 130/180 nm

avoiding all external complications like that: bare minimum:
for FP and INT i will be happy with 4R1W on a 128x 64-bit
regfile with byte-level write-enable.

this gives FMAC single cycle and also INT-MAC with room for an INT
predicate read without delay.
Comment 7 Jacob Lifshay 2021-11-26 19:29:09 GMT
assuming this is a subtask of #589, if not, please either remove budget (comment out toml field and clear numeric fields) or set correct milestone and budget parent
Comment 8 Luke Kenneth Casson Leighton 2021-11-26 19:49:30 GMT
(In reply to Jacob Lifshay from comment #7)
> assuming this is a subtask of #589,

#589 is the top-level NLnet proposal 2021-02-052.
https://libre-soc.org/nlnet_2021_crypto_router/
Comment 9 Staf Verhaegen 2021-11-27 11:45:31 GMT
What is different between #589 and #690 ?
Comment 10 Staf Verhaegen 2021-11-27 12:50:55 GMT
(In reply to Staf Verhaegen from comment #9)
> What is different between #589 and #690 ?

OK, saw on IRC. #589 is NLNet CryptoRouter and #690 is NGI POINTER.
From my point this would fit under both of these so you can decide which budget to use.
Comment 11 Luke Kenneth Casson Leighton 2021-12-06 21:28:31 GMT
Staf I identified another location of SRAMs: the DCache and ICaches.
there are two types:

* TLB and Page Table Entry caches
  - 1R1W
  - read on next cycle (sync)
  - write updates on NEXT cycle (sync)
  - forwarding needed (can be done externally,
    not part of SRAM)
  - 64 rows
  - 128 bit for tags, and 92 bit for PTEs
  - 4 write-enables therefore 4x32-bit for tags
    and 4x23-bit for PTEs

you can see the code which reads, modifies, then writes back
some bits.  I will make this use a Memory so it is more obvious

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;h=d0829f4df8ac56dbcf96b5b018a66730e24261c4;hb=1cf999d43c244b49894d77e8f2151089496239eb#l432

about 1k in size, each there, QTY 4 to be used.  i will check
that ICache is the same spec (should be)

* Cache SRAM
  - 1R1W
  - 128 rows
  - 64 bits
  - entire row written (no multiple wens)
  - read and write both 1 clock cycle

each 1k in size (128x8), QTY 8 to be used.

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/cache_ram.py;h=50ee1367cc84301bcf9cecf0f6cae51d13273227;hb=1cf999d43c244b49894d77e8f2151089496239eb
Comment 12 Luke Kenneth Casson Leighton 2021-12-06 21:30:06 GMT
staf also, just fyi, this one has to move to NGI POINTER.