Luke, it looks like you probably have something that works just as well as this would, so I'm resolving the bug.
(In reply to Jacob Lifshay from comment #1)
> Luke, it looks like you probably have something that works just as well as
> this would, so I'm resolving the bug.
maayybee... i don't know. it's taking me considerable time to piece together.
(In reply to Luke Kenneth Casson Leighton from comment #2)
> (In reply to Jacob Lifshay from comment #1)
> > Luke, it looks like you probably have something that works just as well as
> > this would, so I'm resolving the bug.
> maayybee... i don't know. it's taking me considerable time to piece
nope, struggling. and the N-N-way comparisons for the high MSBs is going to kill power usage. took me a while to realise what you were saying about using
a LRU (we have a PLRU implementation here):
can you take a look? assume that each LD/ST buffer has *two* "ports",
one that is 16 bytes wide with address port0_addr and is normally used,
the other is *8* bytes wide (actually, 7), is *always* at port0_addr+16,
and deals with misalignments.
actually, that's a red herring. to clarify: it's safe to assume that
misalignments have been "dealt with" by being pre-split into separate
also assume that addresses+len are expressed as:
* top bits numbered 4 and up as a binary address
* LSBs 0 to 3 as a 16-bit unary "mask" which *also* encapsulates data length
(with no misalignments, which means that some mask bit groups will be
also assume that it is *guaranteed* - by this class - that there will be
ZERO overlaps. as in, absolutely guaranteed.
it's the top MSBs of the address that are not covered by that (because it's
i figured, rather than a ring buffer, a PriorityPicker could identify QTY 1
address to put through, then all other address MSBs go through... the PLRU?
i get lost at this point.
(In reply to Jacob Lifshay from comment #4)
> ok, reopening
thanks. whew. oh. one other thing: if it can be N-way-ported (assume that the L1 cache will at some point be dual-ported, minimum) that would be handy. if there are several LDs, on 16-byte boundaries for example, we don't want the L1 cache to be the bottleneck just because it's only possible to read/write one cache line at a time.
i'm keeeenly aware that this implies we would be need a maaassive 256-bit-wide data bus to the L1 cache.
welcome to GPUs :)
Was assuming that the L1 cache has a single cache-line-sized R or W port, though having 2 ports makes it even better.
Remember that we probably want another cache-line-sized port for communication with the L2 and the rest of the SoC.
(In reply to Jacob Lifshay from comment #6)
> Was assuming that the L1 cache has a single cache-line-sized R or W port,
> though having 2 ports makes it even better.
here's the code i'd like to drop-in-place (literally), i.e. just use it.
that's already set up for "masking" - bear in mind it's only 32-bit-wide
so the x_mask signal is only 4 bits: we need to expand that to 16-bits
and it will fit perfectly with the 16-bit "mask" from the PartialAddressBitmap
so the other end of the interface, jacob, is the LoadStoreUnitInterface. a test can be written which bypasses the cache, for now, to make life a bit simpler.
> Remember that we probably want another cache-line-sized port for
> communication with the L2 and the rest of the SoC.
one step at a time :)
wait... ahh... everything memory-wise goes through the L1 cache. the cores are "slaves" of the wishbone bus (not masters). all communication is done by way of memory-addressing (all peripherals are memory-addressable).
even "management" of the L2 cache (and L1 "flush" commands) etc. should ideally be done by way of memory-addressable "registers".
or, we have a special side-channel wishbone bus that communicates "messages". basically drops a pre-determined data structure (a nmigen Record) onto the side-channel WB bus at a known "address", it gets routed to where is needed, and the recipient decodes it as "commands".
but as far as communication with periherals is concerned, they're *all* memory-addressable. all of them. any "setup" is done by way of "Registers"... which are also simply memory-addressable... over the WB memory bus.
some of those, because they are I/O, cacheing is not permitted. so we have to be on the lookout for that.
they set the cache up to be a maximum of 2-way. however that's not the same as the number of read/write ports. arg.
ok i found a diagram which explains it:
it shouuuld be relatively straightforward to augment that to be dual-ported,
although let's not do that straight away.
hang on a minute...
if we use a very small but wide L1 cache, with the following characteristics, does it do the job?
* 8 sets
* 16 way set associative
* byte per way
the 16 bit mask *is* the 16 way selector
then i think, if you make a N-in/out multiplexor to write to each byte, the job's done.
I've been thinking that we would have a cache with a cache line size of 64 bytes (or 32 if we really have to).
we want the cache line size to be the same for all our caches since it greatly simplifies cache to cache transfers.
So, for a 32kB L1 cache, it would have 512 lines. If we had it be 8-way set associative, then there would be 64 sets and each set would have 8 lines.
Having a L1 cache much smaller than 16kB would be quite detrimental due to the excessive cache latency and lack of bandwidth to the L2.
I'm going to be unavailable till at least Sat night, then will work on publishing v0.2 of algebraics (updating transitive dependencies for simple-soft-float due to PyO3 v0.9 being released) then work on this.
(In reply to Jacob Lifshay from comment #10)
> I've been thinking that we would have a cache with a cache line size of 64
> bytes (or 32 if we really have to).
eek. ok. right. 64 / 4 = 16 that gives 4 stripes to 1R1W Memory inside the cache. that way we do not need multi ported SRAM in the cache.
if we have byte enable lines the 16 bit mask can be passed to each byte.
> we want the cache line size to be the same for all our caches since it
> greatly simplifies cache to cache transfers.
> So, for a 32kB L1 cache, it would have 512 lines.
> If we had it be 8-way set
> associative, then there would be 64 sets and each set would have 8 lines.
4 way makes more sense (a little easier) as the byte enable mask is 16 bits.
however 8 way does as well because each 16 byte enable mask is split in two.
> Having a L1 cache much smaller than 16kB would be quite detrimental due to
> the excessive cache latency and lack of bandwidth to the L2.
> I'm going to be unavailable till at least Sat night, then will work on
> publishing v0.2 of algebraics (updating transitive dependencies for
> simple-soft-float due to PyO3 v0.9 being released) then work on this.
the earlier idea of a tiny pre L1 cache may on reflection be unnecessary.
i drew out a diagram a couple days ago, it basically involved:
* a PriorityPicker to find the first entry in the AddressMatrix that is "live"
* that then says which entry is to be used to broadcast the high address bits (14 to 48) to all other live LD/STs
* bear in mind all "live" addresses have had to calculate a comparison on bits 4 thru 13 of the address, we also include *these* (ANDed) in the comparison
* anything that succeeds on matching bits 4 to 13 from the AddressMatcher matrix *and* gives a hit on the Priority Picked bits 14 to 48, its 16bit "mask" is permitted to go through an ORer to the L1 cache byte-enable on the Memory.
a refinement of that is to allow up to *four* of those to be picked, one to be routed to each "way".
that is one of the things that i wrote the MultiPriorityPicker for.
hmmm connected to LDSTCompAlu we need something that, after splitting the address and len into two, can "reconstruct" the LD/ST
i'm also going to suggest that we have a preanalysis phase on LD/STs for VL and make it SIMD-like, up to 64 bit.
so if a VL=4 LD comes in on a halfword operation it is turned into a 64 bit request.
the only thing to watch out for there is mem faults. the last 16 bits failing should *not* cause the other 3 16 bit LDs in that "batch" to fail.
we can incrementally add this as an optimisation, later
(In reply to Luke Kenneth Casson Leighton from comment #12)
> hmmm connected to LDSTCompAlu we need something that, after splitting the
> address and len into two, can "reconstruct" the LD/ST
this basically adds 16 to the real address and masks off the top bytes that overrun a 16 byte cache line.
there needs to be two latches which are in "passthrough" mode (in case the twin data is available immediately) and if one is not available, wait until it is before responding "ok".
also needs to pass through exceptions back to the LDSTCompAlu
the interesting thing is, the interface is almost certainly identical to the minerva LoadStoreInterface class.
putting together a basic LD/ST address-splitter. however it includes signalling which will connect directly to CompLDST.
basically, the address+len expands to a bit-mask, that could be mis-aligned,
so the "normal" half is sent truncated on a cache-line-boundary and the "top"
half is split into a *second* LD/ST.
what that means is that the LDSTCompUnit now needs to wait for *TWO* LD/STs
therefore, we need valid/data signalling for *two* LD/STs and, on successful
completion of *both*, the data needs "re-assembly" based on the *two* masks
*facepalm*... i managed to get all the way to a successful unit test
before noticing that i'd set up the code to LD/ST *bits*... not bytes :)
am currently adding ST splitting: drawing this diagram helped:
started a discussion here, see what happens.
i will add this and a description here, shortly:
the idea here is to have 6 to 8 LD/ST Function Units, each with
two "ports" (one exclusively for misaligned LDs/STs), which means
that we then have 12 to 16 such "ports" trying to get access to
the L0 Cache/Buffer.
that will require a Multi-Priority-Picker ("and here's one we prepared
this can be set up to be 16-in and produce 4x unique exclusive 16-out
"selectors", none of which will overlap.
that in turn allows us to OR/MUX any one of those 16 LD/ST FUs to
any one of the 4 "ports" on the L0 Cache
from there, we can split into two halves using bit 4 as the "selector"
bit, and use bits 5 and above as the "Addr hit".
a priority picker then picks *one* row and broadcasts its address to
all other rows. if there is a "hit" (same address bits 5 and upwards)
then the data-select bytemask - and data itself - is "merged" into
the output as a "single cache line", to be sent directly to the
oh. one important thing that came out of the discussion on comp.arch:
the number of wires involved, if we are not careful, is MENTAL.
let's imagine we pass the expanded (shifted, masked) data and op-length
out from each FunctionUnit. the data, previously a maximum of 64 bit,
is now a full cache-line wide (because it has been shifted into position
for the cache write), and the operation, previously encoded as len=1,2,4,8
(we could use 2 bits for that) and in binary address form LSBs 0-3, is
now a 16 bit mask...
oh times two, one for the aligned and one for the misaligned LD/ST.
so that's 128, 256, plus 16x2=32 equals 288 wires per LD/ST FU, of which
there will be 8.
that's a staggering 2,304 wires *just for the data*
if we do the splitting *after* the Muxing through the 4-port Muxer(s),
*even if the splitting was needed for input into the Address Clash Matrix*,
we can halve that number of wires to "only" around 1120.
kiinda important :)
I was thinking that the load/store FUs would just split and submit both halves to a queue-like structure that and the FU would just track both halves independently until they complete, then merge them. If a load/store doesn't need to be split, then it would only take one queue entry.
That queue can have all the dependency matrix and scheduling stuff be part of it and it only needs enough entries to keep the pipeline full -- it doesn't need double entries to handle split load/stores since the FUs will just stall if the queue is too full.
(In reply to Jacob Lifshay from comment #19)
> I was thinking that the load/store FUs would just split and submit both
> halves to a queue-like structure that and the FU would just track both
> halves independently until they complete, then merge them. If a load/store
> doesn't need to be split, then it would only take one queue entry.
yes - only submit one queue entry, however there's a second split, in order
to double the maximum throughput (without needing dual-ported L1 cache SRAM):
use bit 4 to select odd-even banks.
however that selection needs to be back at the start:
so we need *two* multi-in 4-out PriorityPickers, where the inputs are
actually done like this:
for i in range(16):
if addr[i] == 0:
multi_picker_even[i] = ldst_requested[i]
multi_picker_odd[i] = ldst_requested[i]
*then* followed by:
mpick_even = MultiPriorityPicker(multi_picker_even, 4)
mpick_odd = MultiPriorityPicker(multi_picker_odd, 4)
*then* followed by:
for i in range(4):
L0_cache_buffer_even.port[i] == mpick_even.out_en[i] # each 16-bit wide
L0_cache_buffer_odd .port[i] == mpick_odd .out_en[i] # ditto
that L0_cache_buffer_odd/even, yes it can be a Queue, however a FIFO Queue
has a bit of a problem: you can't normally get access to the entries in it.
where we need to *merge* entries (the bytemasks) that happen to have the same
address bits 5 and upwards.
> That queue can have all the dependency matrix and scheduling stuff be part
> of it and it only needs enough entries to keep the pipeline full -- it
> doesn't need double entries to handle split load/stores since the FUs will
> just stall if the queue is too full.
part of the discussion on comp.arch was that we *may* want to prioritise
aligned LDs/STs over misaligned ones.
this can easily be done by putting the misaligned LDs/ST requests at the
*end* of the PriorityPicker.
however, again, if there happens to be a sequential batch of LDs/STs
which happen to be misaligned because they are a massive sequential batch of
8-byte LDs that were offset accidentally by 4 bytes, we do *not* want
misaligned LDs/STs to be deprioritised, because they will actually merge
quite nicely with the other LDs/STs.
therefore (and i'm recommending we do this later) we actually need to
prioritise which entry should be sent from the L0 cache/buffer/queue
by the number of bits that "merge" at the bytemask level. this with
a "popcount" (or pseudo-popcount) on every entry in the L0 cache/buffer/queue
that detects at the very minimum which are full cache-lines after merging
the byte-masks which all have the same binary addresses [bits 5 and upwards].
*not* by "whether it is first in the queue".
so ultimately, a queue (FIFO queue) is not the right data structure to use.
if you'd like to write a round-robin Picker, great! :) however i don't
think it makes a huge amount of difference, initially, whether it's
Priority-picked or RR-picked. the PP and Multi-PP code exists :)
Wishbone as-is is not sufficient for handling the interconnect between the L1 and L2 caches, or to any other device that can act as a bus master and is cache coherent (such as other CPUs, OmniXtend, or cache coherent DMA). The reason is because it has no concept of ownership of a particular cache block (or caching at all), which is needed to support a cache coherence protocol.
Wishbone also has the drawback of memory bandwidth being limited by the bus having a max width of 64 bits (ends up as 6.4GB/s peak theoretical bandwidth at 800MHz), we are likely to want more bandwidth between cores and the L2 cache.
Minerva's memory interface appears to not be designed to handle cache coherence at all, so would need to be redesigned to support that if we wanted to use Minerva's memory interface, which I don't think we should since we would need to rewrite the majority of Minerva's memory interface code.
Using wishbone for devices that don't need to be cache coherent and don't need large memory bandwidth (less than 1GB/s or so) is still a good idea.
AXI4 does not have cache coherency either and yet it is still achieved. it is done by passing messages over the "control bus", using home-grown protocols in every case that i have examined (including ariane and OpenPITON).
we do not need to be fully compliant with the wishbone spec internally.
it is perfectly fine to modify the code so that it has the required bus bandwidth for internal use.
when external connections to other peoples' code is needed (peripherals) then and only then will we need to comply with the spec.
also, we're not doing an SMP core for the october 2020 deadline, because we are only doing a single core.
therefore L1 and L2 SMP cache coherency is not needed.
we *need* to urgently get moving on this, we are seriously behind.
the fastest way to do that is to not spend months writing code that already exists.
the fastest way to get the job done and meet that hard deadline is to only spend days or weeks using existing code.
minerva *already* uses wishbone.
in addition, there is the "soc" code from enjoy-digital (litex) which has testing infrastructure and connectivity to peripherals already done.
we *need* test infrastructure and peripheral connectivity.
we *cannot* spend weeks or months writing new test infrastructure that already exists because we're using a completely new Bus Protocol (TileLink) that completely isolates and cuts us off from that test infrastructure, because that test infrastructure is all written to use Wishbone.
after the October 2020 deadline, when SMP cache coherency is a goal, we can look at this again.
right now, we have to shelve SMP, and get the job done. it's only six months and at least two of those are going to be full-time working on the layout.
(In reply to Luke Kenneth Casson Leighton from comment #23)
> also, we're not doing an SMP core for the october 2020 deadline, because we
> are only doing a single core.
> therefore L1 and L2 SMP cache coherency is not needed.
Ok, we can write the code to basically have whatever cache coherency stuff gets written when getting the memory interface working and not worry about the L2 or other cores existing at all, treating it as a writethrough transparent cache. The design I'm working on is integrated closely enough with the L1 cache that it will still need at least a little of the coherence protocol to support atomics, but it can be built to have a WishBone interface adaptor.
(In reply to Jacob Lifshay from comment #24)
> (In reply to Luke Kenneth Casson Leighton from comment #23)
> > also, we're not doing an SMP core for the october 2020 deadline, because we
> > are only doing a single core.
> > therefore L1 and L2 SMP cache coherency is not needed.
> Ok, we can write the code to basically have whatever cache coherency stuff
> gets written when getting the memory interface working and not worry about
> the L2 or other cores existing at all, treating it as a writethrough
> transparent cache.
basically, yes. we're only going to be running at 25 mhz (possibly 100 mhz)
because there's no PLL, and no clock gating.
therefore yes even the L2 can be cut.
> The design I'm working on is integrated closely enough
> with the L1 cache that it will still need at least a little of the coherence
> protocol to support atomics,
remember, i said already that atomic memory operations are taken care of
by the Address / Memory Dependency Matrices.
they are already split out into individual (single) operations, each one
isolated from (batches of) LOADs and (batches of) STOREs.
so there's strictly speaking no need to implement any kind of coherence
protocol, because as things stand it's never going to be active.
yes it's over-zealous, it is however just a side-effect of how the
address-matching / memory-dependency matrices.
> but it can be built to have a WishBone
> interface adaptor.
ok good. does it implement this algorithm?
L1 cache concept, go for it - that's really straightforward and standard,
although important to appreciate that 16 *byte* level read/write-enable lines
will come through from the Function Units.
if you can also do a non-standard-wishbone-128 to standard-wishbone-64
bit "funnel", that analyses the 16 byte-level read/write-enable lines,
splitting them into two halves covering the LO-64-bit and HI-64-bit
that would fit well.
actually to be honest at the 128-bit-level it doesn't even need to be
Wishbone, which may make design easier.
another "easier" way may be to simply have *FOUR* 64-bit Wishbone
* L1 left LO-64
* L1 left HI-64
* L1 right LO-64
* L1 right HI-64
then we "jam" those down onto a single 64-bit Wishbone interface
using a standard Wishbone Bus arbiter
L0 cache/buffer, give it a little time for some review / thought
as i just found a mistake in the L0 cache/buffer design and had
to adjust it:
(In reply to Luke Kenneth Casson Leighton from comment #26)
> another "easier" way may be to simply have *FOUR* 64-bit Wishbone
> * L1 left LO-64
> * L1 left HI-64
> * L1 right LO-64
> * L1 right HI-64
> then we "jam" those down onto a single 64-bit Wishbone interface
> using a standard Wishbone Bus arbiter
i was going to say that those could all be 4 independent L1 caches,
however both left ones and both right ones have the exact same address,
so it would be a waste of power.
*however*... the 16-bit byte read/write-enable lines back at L0 can be
split into two then ORed together to produce a *single* "LO-64" and "HI-64"
L1 cache line read/write.
since i'm effectively stalled on the algorithm I've been working on (it should work, just not good at writing it out properly) and the one luke's been working on should work well enough (bug 216, I think), will be switching to just implementing that one.
If there's more time later, I can try again and reopen this bug.
if you can record some brief notes, after the oct 2020 deadline, or if we have less time pressure, it will be good to come back to this.
deferred for after oct 2020 tape out
idea review messages on list:
idea on wiki: