https://github.com/lambdaconcept/minerva looks really good: clean design, uses wishbone to the L1 caches and to the main core. this will help when it comes to adding SMP down the line. the decoder is also very clean. the only non-obvious bit is how the core works, with source/sink being created in class "Stage" (which is fine), and with the inter-transfer layouts being the same (source on previous stage equals sink on the next stage), that's obvious: the bit that's not obvious is what-gets-connected-to-what. i think this is because "sinks" are set up at the start of core.py whilst "sources" are set up much further down. in the libre-soc pipeline code, the layouts are done via objects, and the modules "take care" of placing data into the "output" inherently. here, it's messy, and the separation makes understanding difficult. other than that, though, it's pretty damn good.
will need adjusting to make the datapath between the core and L1 wider -- 64-bit at the very least, 128-bit or wider preferred.
(In reply to Jacob Lifshay from comment #1) > will need adjusting to make the datapath between the core and L1 wider -- > 64-bit at the very least, 128-bit or wider preferred. yes. four LD/STs @ 32-bit is the minimum viable data width to the L1 cache, realistically. preferably four LD/STs @ 64 bit. this is a monster we're designing! address widths also need to be updated: i'm going to suggest parameterising them because we might not have time to do an MMU (compliant with the POWER ISA), just have to see how it goes.
do note that compressed texture decoding needs to be able to load 128-bit wide values (a single compressed texture block), so our scheduling circuitry should be designed to support that. They should always be aligned, so we won't need to worry about that in the realignment network.
(In reply to Jacob Lifshay from comment #3) > do note that compressed texture decoding needs to be able to load 128-bit > wide values (a single compressed texture block), okaaay. > so our scheduling circuitry > should be designed to support that. They should always be aligned, so we > won't need to worry about that in the realignment network. whew. so that's 128-bit-wide for _textures_... that's on the *load* side. are there any simultaneous (overlapping) "store" requirements? are the code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST?
(In reply to Luke Kenneth Casson Leighton from comment #4) > (In reply to Jacob Lifshay from comment #3) > > do note that compressed texture decoding needs to be able to load 128-bit > > wide values (a single compressed texture block), > > okaaay. > > > so our scheduling circuitry > > should be designed to support that. They should always be aligned, so we > > won't need to worry about that in the realignment network. > > whew. > > so that's 128-bit-wide for _textures_... that's on the *load* side. are > there any simultaneous (overlapping) "store" requirements? are the > code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST? yes and no -- there is code that will benefit from simultaneous loads and stores (memcpy and probably most other code that has both loads and stores in a loop), however it isn't strictly necessary. It will be highly beneficial to support multiple simultaneous 8, 16, 32, or 64-bit loads to a single cache line all being able to complete simultaneously independently of alignment in that cache line. Also misaligned loads that cross cache lines (and possibly page boundaries), though those don't need to complete in a single cache access. All the above also applies to stores, though they can be a little slower since they are less common. I realize that that will require a really big realignment network, however the performance advantages I think are worth it. For a scheduling algorithm for loads that are ready to run (6600-style scheduler sent to load/store unit for execution, no conflicting stores in-front, no memory fences in-front), we can have a queue of memory ops and each cycle we pick the load at the head of the queue and then search from the head to tail for additional loads that target the same cache line stopping at the first memory fence, conflicting store, etc. Once those loads are selected, they are removed from the queue (probably by marking them as removed) and sent thru the execution pipeline. We can use a similar algorithm for stores. To find the required loads, we can use a network based on recursively summarizing chunks of the queue entries' per-cycle ready state, then reversing direction from the summary back to the queue entries to tell the entries which, if any, execution port they will be running on this cycle. There is then a mux for each execution port in the load pipeline to move the required info from the queue to the pipeline. The network design is based on the carry lookahead network for a carry lookahead adder, which allows taking O(N*log(N)) space and O(log(N)) gate latency. Loads/Stores that cross a cache boundary can be split into 2 load/store ops when sent to the queue and loads reunited when they both complete. They should be relatively rare, so we can probably support reuniting only 1 op per cycle. RMW Atomic ops and fences can be put in both load and store queues where they are executed once they reach the head of both queues.