- Make inventory of all memory blocks and need for number of ports.
Investigate possible architecture optimization for port reduction
- Investigate architecture of multi-port cell
- Define multi-port extension of memory compiler; including needed budget
Don't see this a dependency of #630. That is single port; this task is about what has to happen after #630.
Already comment from Luke in email:
'Yes. i am trying to avoid more than one write port but if that's
possible (2W) it would be amazing and speed things up.
we do need 2W for the DMA buffers though.
if it is convenient i will be "stratifying" the regfiles so that they're
subdivided (interleaved), odd-even banks.
bank 1: r0 r2 r4 r6 r8 ...
bank 2: r1 r3 r5 r7 r9 ...
some of the very small regfiles (FAST, STATE, XER, CR) i would
actually recommend leaving them as DFFs. the CR one is just
ridiculously complicated. QTY 16of 32-bit regs comprising 4-bit
access *on top* of a (full) 32-bit access port. multiple read-write
ports @ *both* 4-bit *and* 32-bit. plus the CR regfile is actually
unary-addressed (0b11111111 masks) for the LSBs, not
binary-addressed, but *binary* addressed for the MSBs
(selecting which of the 16).'
Could you point me to example code with these blocks with 4-bit and 32-bit ports ?
General principle is that number of ports of a memory reflects the number of parallel accesses one needs to do in the same clock cycle. Adding an extra port will increase area of the SRAM block.
Otherwise higher level logic should be used to arbitrate an access including bit width granularity changes etc.
(In reply to Staf Verhaegen from comment #3)
> Could you point me to example code with these blocks with 4-bit and 32-bit
> ports ?
that's just one 32bit regfile for the standard Power ISA Condition Register
SVP64 needs *16* of these.
each 32bit CR has 8 CR Fields, 4 bits each. these 4bit fields are accessed
by crand, cror etc. and the full 32bit by mcrf and mfcr.
with the 128 CR Fields being used for predication as well as Rc=1 targets
(vectorised) there will be a HELL of a lot of read and write ports onto
to prevent delays we may need something like 4R3W or even 5R3W 32bit, with the write-enable
being 8bit wide on each 4bit CR field
fortunately there would only be QTY 16 such 4R3W/5R3W 32bit regs.
it is perfectly fine to use one of those 32-bit ports to synthesise
the 4-bit ports via an adapter. it is the opposite arrangement to
that virtual_port.py file but that's ok.
(In reply to Staf Verhaegen from comment #4)
> General principle is that number of ports of a memory reflects the number of
> parallel accesses one needs to do in the same clock cycle. Adding an extra
> port will increase area of the SRAM block.
with 128 64 bit INT and FP regs we may have to do an actual reg cache
for higher clock rates, i would like to avoid that complication in the
180nm / 130nm geometries if practical.
i also do not mind breaking down into a stratified arrangement
of four *separate* regfiles or even 5:
32 regs r0-r31 all accessible
24 regs r32 r36 r40 ... r124
24 regs r33 r37 r41 ... r125
24 regs r34 ... r126
24 regs r35 ... r127
where each of those is 4R1W 64bit with byte-level write-enable.
and if instructions are ever issued add r31, r33, r127 the data
goes into a cyclic buffer that shuffles along, with appropriate
latency, to match up data from regfile port to ALU which will
*also* be stratified (5 different separate ALU banks, yes really,
and yes it'll be a Monster but hey).
however again doing a Monster Vector Processor like this i would
like to avoid in the first iteration @ 130/180 nm
avoiding all external complications like that: bare minimum:
for FP and INT i will be happy with 4R1W on a 128x 64-bit
regfile with byte-level write-enable.
this gives FMAC single cycle and also INT-MAC with room for an INT
predicate read without delay.
assuming this is a subtask of #589, if not, please either remove budget (comment out toml field and clear numeric fields) or set correct milestone and budget parent
(In reply to Jacob Lifshay from comment #7)
> assuming this is a subtask of #589,
#589 is the top-level NLnet proposal 2021-02-052.
What is different between #589 and #690 ?
(In reply to Staf Verhaegen from comment #9)
> What is different between #589 and #690 ?
OK, saw on IRC. #589 is NLNet CryptoRouter and #690 is NGI POINTER.
From my point this would fit under both of these so you can decide which budget to use.
Staf I identified another location of SRAMs: the DCache and ICaches.
there are two types:
* TLB and Page Table Entry caches
- read on next cycle (sync)
- write updates on NEXT cycle (sync)
- forwarding needed (can be done externally,
not part of SRAM)
- 64 rows
- 128 bit for tags, and 92 bit for PTEs
- 4 write-enables therefore 4x32-bit for tags
and 4x23-bit for PTEs
you can see the code which reads, modifies, then writes back
some bits. I will make this use a Memory so it is more obvious
about 1k in size, each there, QTY 4 to be used. i will check
that ICache is the same spec (should be)
* Cache SRAM
- 128 rows
- 64 bits
- entire row written (no multiple wens)
- read and write both 1 clock cycle
each 1k in size (128x8), QTY 8 to be used.
staf also, just fyi, this one has to move to NGI POINTER.