http://bugs.libre-riscv.org/show_bug.cgi?id=215#c5 discussion and advice from comp.arch https://groups.google.com/forum/#!topic/comp.arch/cbGAlcCjiZE
(In reply to Jacob Lifshay from http://bugs.libre-riscv.org/show_bug.cgi?id=215#c5) > (In reply to Luke Kenneth Casson L > > whew. > > > > so that's 128-bit-wide for _textures_... that's on the *load* side. are > > there any simultaneous (overlapping) "store" requirements? are the > > code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST? > > yes and no -- there is code that will benefit from simultaneous loads and > stores (memcpy and probably most other code that has both loads and stores > in a loop), however it isn't strictly necessary. ok. > It will be highly beneficial to support multiple simultaneous 8, 16, 32, or > 64-bit loads to a single cache line all being able to complete > simultaneously independently of alignment in that cache line. yes i have been thinking about this for a day since getting the LDSTCompUnit operational. it currently uses a "fake" memory block (for testing) and the next step is to look at this. > Also > misaligned loads that cross cache lines (and possibly page boundaries), > though those don't need to complete in a single cache access. ok > All the above also applies to stores, though they can be a little slower > since they are less common. > > I realize that that will require a really big realignment network, however > the performance advantages I think are worth it. yes. we cannot have high performance computation yet no good getting data in and out. > For a scheduling algorithm for loads that are ready to run (6600-style > scheduler sent to load/store unit for execution, no conflicting stores > in-front, no memory fences in-front), we can have a queue of memory ops and > each cycle we pick the load at the head of the queue and then search from > the head to tail for additional loads that target the same cache line > stopping at the first memory fence, conflicting store, etc. Once those loads > are selected, they are removed from the queue (probably by marking them as > removed) and sent thru the execution pipeline. > > We can use a similar algorithm for stores. right. ok. thanks to Mitch Alsup, i have a Memory Dependency matrix that takes care of discerning and preserving the loads and stores into batches. we could if we wanted to do not only TSO, it can handle cross-SMP in a way that makes atomic memory either entirely moot or dead trivial. this i need a little more research on. anyway the point is: LOADs as a batch are already identified and hold up any STOREs, and vice-versa. > To find the required loads, we can use a network based on recursively > summarizing chunks of the queue entries' per-cycle ready state, then > reversing direction from the summary back to the queue entries to tell the > entries which, if any, execution port they will be running on this cycle. > There is then a mux for each execution port in the load pipeline to move the > required info from the queue to the pipeline. The network design is based on > the carry lookahead network for a carry lookahead adder, which allows taking > O(N*log(N)) space and O(log(N)) gate latency. with the Memory DMs already in place, taking care of separating LOAD from STORE batches, would it be necessary to do that? (i don't know the answer). also the ordering is not important (Weak Memory Model) and so a buffer will *only* have LOADs in or STOREs but not both, and it becomes possible to analyse the buffer and see if a batch is present. actually if we had a (really) small CAM that mirrored the way the L1 cache worked, that might work. 2 maybe 4 lines or something. then you can detect when the lines are fully occupied... hmmm needs thought > Loads/Stores that cross a cache boundary can be split into 2 load/store ops > when sent to the queue and loads reunited when they both complete. They > should be relatively rare, so we can probably support reuniting only 1 op > per cycle. yes. > RMW Atomic ops and fences can be put in both load and store queues where > they are executed once they reach the head of both queues. these luckily Mitch told me you simply set *both* MemRead *and* MemWrite hazard and the result is fascinatingly they have to be atomic, locking out all other LDs, STs, and, because of the way the DMs work, preserve the order as well. so there will *be* no overlap between atomics, LDs, or STs. now, whether these all go into the same queue, i don't know. the main thing is, if we have a queue it has to basically not just be a queue, it has to be a cache as well.
(In reply to Luke Kenneth Casson Leighton from comment #1) > (In reply to Jacob Lifshay from > http://bugs.libre-riscv.org/show_bug.cgi?id=215#c5) > > For a scheduling algorithm for loads that are ready to run (6600-style > > scheduler sent to load/store unit for execution, no conflicting stores > > in-front, no memory fences in-front), we can have a queue of memory ops and > > each cycle we pick the load at the head of the queue and then search from > > the head to tail for additional loads that target the same cache line > > stopping at the first memory fence, conflicting store, etc. Once those loads > > are selected, they are removed from the queue (probably by marking them as > > removed) and sent thru the execution pipeline. > > > > We can use a similar algorithm for stores. > > right. ok. thanks to Mitch Alsup, i have a Memory Dependency matrix that > takes care of discerning and preserving the loads and stores into batches. > we could if we wanted to do not only TSO, it can handle cross-SMP in a way > that makes atomic memory either entirely moot or dead trivial. this i need > a little more research on. > > anyway the point is: LOADs as a batch are already identified and hold up any > STOREs, and vice-versa. Ok. The algorithm I proposed can still be used for scheduling loads/stores inside each batch and deciding which cache line to access each cycle, since I think the memory dependency matrix currently handles deciding which ops to execute each cycle by relying on a slow and simple FIFO algorithm. The queue in the algorithm I proposed can be the same HW as the memory dependency matrix, so it would end up as a hybrid algorithm. The recursive summary mechanism inspired by carry lookahead will still be useful.
(In reply to Jacob Lifshay from comment #2) > > Ok. The algorithm I proposed can still be used for scheduling loads/stores > inside each batch and deciding which cache line to access each cycle, since > I think the memory dependency matrix currently handles deciding which ops to > execute each cycle by relying on a slow and simple FIFO algorithm. nono, it's a bitlevel matrix, capable of multi issue execution and resolving all conflicts and dependencies in a single cycle. the ordering is preserved by cascading bits being raised, just like in the Register-style DMs. so yes it has the same *characteristics* ad a FIFO bit is not a sequential binary numbered array, it is *unary* numbering, and the bits (SR Latches) specify "row X is in front of column Y". actually, "row X is a memory LD hazard on column Y" > The queue > in the algorithm I proposed can be the same HW as the memory dependency > matrix, so it would end up as a hybrid algorithm. i am very wary of making any kind of changes to the Dependency Matrices. it took months to get an understanding of them, enough to implement correctly. using them as-is, as a component: no problem. modifying them? no - not at this stage, we simply haven't got time. > The recursive summary mechanism inspired by carry lookahead will still be > useful. can you do this piece? if we spec it up. i can point you at the code it needs to connect to (minerva L1 cache and LDSTCompUnit).
(In reply to Luke Kenneth Casson Leighton from comment #3) > i am very wary of making any kind of changes to the Dependency Matrices. it > took months to get an understanding of them, enough to implement correctly. (and it was 10 months ago. sigh) the load/store system is a *multi*-step process: there is the "normal" (Register-FU and FU-FU hazard system, which takes care of the "usual" register-based dependencies that all ALUs have including the Branch Unit), *and* there is the inter-address LD/ST hazard Dependency Matrices. the memory-function-unit matrices are augmented (differ from) the "normal" register-style DMs in that they recognise 10 bits of the address as a way to see if there was a conflict. the logic being "overzealous". you remember we discussed that last year, jacob, because 10 bits happens to hit the 1 MEGABYTE boundary of Vulkan Texture blocks? what i considered doing was having, instead of a straight 10-bit address-line comparator, shifting that up by 4 bits and having a *bit-map* comparator for the 4 LSBs. in other words a byte LOAD would set 1 bit in the 16 bits, a half-word 2 bits, and so on. (actually to cover misaligned LOADs it would need to be a minimum 23 bits and/or split into *two* separate LOADs). let us assume that the bitmap width is exactly the same as the cache line width (16 bits representing 16 byte wide cache lines). this bitmap - if filled by multiple LOADs - can have a popcount carried out on it, and that tells us which "line" in the buffer has the highest number of bytes to perform LOADs. if the entire bitmap is all 1s, that is a full cache line of LOADs.
hm. misaligned is tricky. am just reading POWER 3.0B, book III, 6.5.8 (page 1074), it says that quad-word (lq, stq) is allowed to raise an "alignment" exception. there's several others as well (including when write-through is required or cacheing is inhibited) however at least if we don't have to deal with 8-byte misalignment that makes things a *lot* simpler.
ok i think i have a data structure that will help. it's a (very small) CAM, with extra "things". also it critically depends upon a pre-analysis of the LD/ST operation, breaking it down from this form, where unary is defined as (1<<N): FunctionUnit# Addr Length into: unary(FU#){0} Addr[4:] unary(Addr[0:3] ) (start=0, mask=000bbbb) unary(FU#){1} Addr[4:]+1 unary(Addr[0:3]&mask) (start=M, mask=bbb0000) when a write is misaligned across a cache line, it results in the *second* operation (unary(FU#){1}) being activated, where there would be two separate and distinct LOADs / STOREs coming from *one* FunctionUnit. the (start,mask) is so as to be able to shift and mask the data as it comes in (or out) of the register. this because when it's split into two, obviously you have to know which portion of each split covers which part of the data register. let us assume that there are 4 LOAD/STORE Function Units. there is a table (a CAM) with **EIGHT** rows, with an *unary* row-matching by: unary(FU#0){0} unary(FU#0){1} unary(FU#1){0} unary(FU#1){1} unary(FU#2){0} unary(FU#2){1} unary(FU#3){0} unary(FU#3){1} and in each column there is: * MSBs of address - address[4:] * *bitmask* of bytes covered by address[0:3] * (up to) 128 bit data to be LD/ST'd basically it's a very very small L0 cache, and because it's so small we can do row-addressing in *unary*. the key to resolving misalignment is that initial break-down of the "original" LD/ST into a *PAIR* (plus a bitmask) the bit that is a bit... awful is: there will be *four* Load-Store Units creating and contending for up to *EIGHT* simultaneous reads/writes to this CAM, worst-case (every single one of the operations crosses a cache-line boundary) that's far too much. i think it is more reasonable to limit that to only four, however even a four-way port-contention system is still one hell of a lot. writing to this store/buffer/CAM: with four ports (four ways in CAM terminology) any one LOAD/STORE will be able to write to two separate addresses in the CAM AND up to *four* LOAD/STORE Units will be able to read/write to the *same* address (same row). the Memory itself will need individual byte-level enable lines. then, the "flush" from this small CAM will be able to read/write entire lines out to the L1 Cache. each line will kiinda be like one single-entry FIFO.
jacob, could i ask you the favour of pseudo-coding up the algorithm you described? bear in mind, the DMs will have taken care of separating out all LOADs, all STOREs, and all AMOs into batches. then we can do a comparison.
(In reply to Luke Kenneth Casson Leighton from comment #7) > jacob, could i ask you the favour of pseudo-coding up the algorithm you > described? bear in mind, the DMs will have taken care of separating > out all LOADs, all STOREs, and all AMOs into batches. > > then we can do a comparison. Yeah, I can do that.
here's some links to existing code: the MemFunctionUnits: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/memfu.py;hb=HEAD this is the "top level" of the mem-matrices, so quite boring still boring: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_fu_matrix.py;hb=HEAD ok *now* we are cooking: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_dependence_cell.py;hb=HEAD however this is really the key bit: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;hb=HEAD that is where you get a Pascal's Triangle of address-comparators *only in the bottom 10 bits*. the actual algorithm is described in Mitch's book chapters, 11.4.12 i'll attach screenshots for the diagrams in a second. the code i would like to connect to is here: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/minerva/units/loadstore.py;hb=HEAD CachedLoadStoreUnit and here is where the LOADs/STOREs actually get fired off: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst.py;h=206f44876b00b6c1d94716e624a03e81208120d4;hb=HEAD#l277 here, rather than have that code *arbitrarily* fire off the LD/ST, and *assume* it will complete in a single cycle: * stwd_mem_o must be PREVENTED from asserting UNTIL it is absolutely known that the store will succeed * load_mem_o must likewise be prevented from asserting until the load is known to succeed thus, it is perfectly fine for LOAD (or STORE) to take more than one cycle (just like any other Function Unit). yes, i know, if the LD/ST cannot complete (raises an exception) this is a type of "result", which must result in the cancellation of the instruction (and all downstream instructions), and an exception raised. this is TODO however it is important that the information propagate back through the LOAD/STORE buffer as a potential "result": "LOAD could not take place due to exception".
Created attachment 36 [details] screenshot of memory dependency cell this cell is the basis of the LOAD-STORE hazard detection. its design and purpose is to separate out sequences of LOADs from sequences of STOREs as well as separate out "atomic" operations which are defined as both LOAD *and* STORE being activated (requested) at the same time.
Created attachment 37 [details] screenshot of memory dependency matrix this screenshot simply shows a group of Memory Dependency Cells.
jacob i'm extending the LD/ST-address-conflict matrix to unary-overlap in the last LSB 4 bits of the address. an unary map (LD byte == 0b0000 0000 0000 0001, LD dword = 0b0000 0000 1111 1111) then shifted up by the LSBs of the address, and split into two (if there is a misalignment), means that two bit-level comparisons detect if the LDs overlap in the last 4 bits of the address. the current system already compares bits 4 thru 11 as an *binary* address. by combining with the unary system it gives a full guaranteed detection of overlap, based not just on the address but on the length of the operation. however that leaves the top bits (11 thru 39 or 48, whatever we decide the VM page size to be). and that can be taken care of by a 4 or 8 entry mini-L1-cache-aka-LD/ST-buffer however what is nice is that the mini-L1-cache can: a) use the fact that the address-conflict matrix has *already* detected (compared) addresses in bits 0-11 b) find those addresses which *are* equal (in bits 4 and above) c) make sure the upper bits (11+) are also equal d) use a merge of the *unary* maps of bits 0..3 to detect which LDs/STs are covering the same L1 cache line, for all addresses detected in (b+c) the unary map "merge" is where a popcount will tell us which is the best one to put straight through to the L1 cache. if all 16 bits (when merged) are set, we know that's a full cache line. the nice thing is, it's not actually that complicated. we don't want to make the *entire* L1 cache line this (hence a mini-pre-L1-cache) because it would mean a popcount function on absolutely every single line of the L1 cache, and a lot more besides. plus, an "optimisation" is that for those address bits 4-11, well, because we've already compared them, instead of doing the comparison again, we can use the *conflict matrix row number* as a substitute. i think. i have to check that.
could everyone just have a look at this and see if i've got it right: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;h=0312504eb818c5e77d8a85d3948f733663b34b2e;hb=182e4cd214022a8ce0a498c2ceeca223a282f329 the idea is as described: turn the last 4 bits of the address, plus the length (1,2,4,8 bytes) into a bitmap, equal in length to a cache line (actually, 2 cache lines) i.e. 2x 16-bits then you do *unary* bit-wise compare on those, but *binary* compare on the remainder of the address. the tricky bit is that the roll-over (misaligned on a 16-byte boundary) will need *three* compares: * address-equal, compare 1st cache line (16 LSBs of the bitmap) * 1st address PLUS ONE equal to *2nd* address: - compare 8 bytes of 1st address's SECOND cache line (bits 16-23) with first 8 bytes of 2nd address's FIRST cache line (bits 0-7) * reverse - *2nd* address PLUS ONE equal to 1st address - likewise reverse on bitmap. confusing as hell, hence why it needs checking.
ok i have a pair of classes now: * LDSTSplitter - this is designed to connect directly to LDSTCompUnit * TwinPartialAddrBitmap - designed to be connected to LDSTSplitter the idea is: * misaligned LDs/STs that cross cache-line boundaries get "split" into a *pair* of LD/ST addresses plus corresponding "byte-map". * these "byte-maps" are based on the bottom LSBs of the address *and* the length (byte-LD = len=1 ==> 0b1, halfword-LD = len=2 ==> 0b11 which is then *SHIFTED UP* by the bottom LSBs * any byte-map that goes over a cache line results in the SECOND of the two addresses "activating" from the LDSTSplitter * key thing to note: the bytemaps *already encode* the position in the L1 cache, and can be used directly as "byte read/write-enable" lines on the underlying Memory. no further shifting or offsetting is required at the Cache: the bytemaps can be wired *directly* into nmigen Memory as byte-enables. * TwinPartialAddrBitmap takes *pairs* of addresses (plus length bytemaps) and identifies which addresses are clashing. there are extra addresses included (actually, excluded) because only the LSBs are checked: this is fine. important to note: LDSTSplitter has clock-synchronous signalling with combinatorial bypass on it. in other words, if the downstream LDs/STs signal that they are ready immediately, LDSTSplitter will *also* respond immediately (in that same clock cycle). however if one of the LDs/STs is not ready yet (cache miss for example) then LDSTSplitter will take note of that and wait until the downstream sources set their "valid" flag(s). errors are simply merged. this is still TODO properly.
https://ascslab.org/research/briscv/downloads.html i wonder... i wonder if it would work if we had multiple L1 caches, with an N-way crossbar on each (routing to be organised by binary address bits [4:4+log2(N)-1]) then have a L2 cache underneath that? or, just actually have multiple L1 caches with a coherent L2 cache underneath?
http://sadve.cs.illinois.edu/Publications/isca00.pdf this paper describes how to partition a cache so that media workloads do not completely trash the L1 cache. we will almost certainly need to implement this because GPU workloads are READ-PROCESS-WRITE, and even having the data in the L1 cache *at all* is completely pointless as it's never going to be accessed more than once.
If we're having multiple L1 caches, we should implement them to appear as if there is a single L1 cache -- no extra instructions are needed to flush one cache before accessing the other, this greatly simplifies the programming model and Kazan currently depends on the memory having a normal programming model (no cache flushes required for an operation to be visible on the same thread). For GPU workloads, they still have some accesses (the majority for a tiled architecture -- what Kazan is) that benefit from a L1 cache. basically, it splits the screen into small squares/rectangles (bins) and accumulates a list of rendering operations (binning) then for each square/rectangle (bin) it renders all the operations inside it in sequence, so the pixels are repeatedly stored and loaded until that bin is done rendering. Texture access also benefits from a cache since nearby pixels are accessed together and often the accesses are overlapping.
(In reply to Jacob Lifshay from comment #17) > If we're having multiple L1 caches, we should implement them to appear as if > there is a single L1 cache -- no extra instructions are needed to flush one > cache before accessing the other, yes, agreed. it appears, from the advice on comp.arch, that it is normal to "stripe" the L1 cache, using e.g. bit 4 of the address (when using 128 bit cache lines) to select odd or even halves. if there are two separate LD/ST buffers, one for each half, this doubles the bestcase data throughput at a cost of 40% misses in the general case (due to each half being, duh, half the size). however yes when it comes to flushing the cache these halves can, must and will appear to be coherent as if they were a single L1 cache. > For GPU workloads, they still have some accesses (the majority for a tiled > architecture -- what Kazan is) that benefit from a L1 cache. basically, it > splits the screen into small squares/rectangles (bins) and accumulates a > list of rendering operations (binning) then for each square/rectangle (bin) > it renders all the operations inside it in sequence, so the pixels are > repeatedly stored and loaded until that bin is done rendering. > > Texture access also benefits from a cache since nearby pixels are accessed > together and often the accesses are overlapping. ok. ok so those 1 MB texture buffers you mentioned before are accessed multiple times in near-succession?
(In reply to Luke Kenneth Casson Leighton from comment #18) > (In reply to Jacob Lifshay from comment #17) > > For GPU workloads, they still have some accesses (the majority for a tiled > > architecture -- what Kazan is) that benefit from a L1 cache. basically, it > > splits the screen into small squares/rectangles (bins) and accumulates a > > list of rendering operations (binning) then for each square/rectangle (bin) > > it renders all the operations inside it in sequence, so the pixels are > > repeatedly stored and loaded until that bin is done rendering. > > > > Texture access also benefits from a cache since nearby pixels are accessed > > together and often the accesses are overlapping. > > ok. > > ok so those 1 MB texture buffers you mentioned before are accessed multiple > times in near-succession? Yes. Though several different textures can be accessed in an interleaved fashion.
the other thing, an aspect of the memory hazard and addr matching matrices: * loads are separated completely from stores * stores likewise * atomic memory operations likewise (by marking them as both LD *and* ST) * the *batch* order is what is preserved * only one atomic ever gets sent through (batch of QTY 1). * due to the expansion of the last 4 bits into a bitmap as part of addr_match, no operations in any one "batch" will have overlapping bytes. ever. therefore there is no need - at all - to preserve "order" in any way shape or form (hence why i said that we do not need a "queue" per se, we need more a L0 Cache/buffer). (this wording - L0 cache/buffer - came out of the discussion on comp.arch with Ivan. 8 entries is likely to be sufficient for us) to emphasise: * *at no time* will the DMs put out a mixed batch of LDs, STs or Atomics. * acknowledging, clearing and completing each batch before moving onto the next is absolutely critical. * except for IO and Atomics (which will only be sent down one at a time anyway) the order of LDs (or STs) can be entirely arbitrary. also we can get away with full IO memory order preservation by treating them as "Atomic". this will result in them only being sent one at a time, in strict order. if the exact same dependency system were used at the L2 layer we could do full SMP TSO with no difficulty.
(In reply to Jacob Lifshay from comment #19) > (In reply to Luke Kenneth Casson Leighton from comment #18) > > ok so those 1 MB texture buffers you mentioned before are accessed multiple > > times in near-succession? > > Yes. Though several different textures can be accessed in an interleaved > fashion. ok that's fine. so what will happen, because the addr match is on the last 4k (11 bits), anything that has on the same 4k LSBs even though the MSBs do not match, gets delayed a couple of cycles and, after the false positive clears through the memory system, the "not-actually-conflicting" LDST's turn will come next. Mitch's analysis showed that in practice this is such a rare occurrence that the penalty is not worth being concerned about.
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/ i just updated this page, a new section "L0 Cache/Buffer". it contains a diagram i did a couple weeks ago: https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-mem_l0_to_l1_bridge.png the actual algorithm is incredibly simple: * priority picker picks one (and only one) address (per L0 cache/buffer) * for all rows greater than the one picked, match against all MSBs of the address, bits 5 and above. * all matching rows, OR the 16-bit bitmap representing {LSBs 0-3 plus LD/ST-len} those ORed bitmaps become the "byte read/write-enable" lines on a given L1 cache line. that's it - that's all there is to it. the "complex" bit is the N-in/out multiplexing from 16-in on the 8 LD/ST FunctionUnits (2 ports per FU because 1 is for "normal" requests and the 2nd is for misaligned addresses) however i just realised that if we can accept an increase in size of the L0 cache/buffer from 8 to 16 entries - or to limit the number of LD/ST Function Units to 6 - then we can instead simply have one *dedicated* "entry" for each and every FU, and the entire MASSIVE 16-to-4 multiplexer completely disappears. i'll draw it out. the caveat: we now have a 16-entry (or 12-entry) L0 Cache/Buffer, which unfortunately requires up to a 16-way CAM on the *ENTIRE* address from bits 5 and upwards. it *might* be possible to inherit (yet again) from the addr_match.py classes, which already has a full triangular comparison of all-against-all in bits 4 thru 11.
https://libre-soc.org/3d_gpu/mem_l0_to_l1_bridge_v2.png ok done. it works like this: * each FU (0-7) produces 2 LD/ST addresses which are broken up like this: addr1[0..3] addr1[4] addr1[5..11] addr1[12..48] addr2[0..3] addr2[4] addr2[5..11] addr2[12..48] where the relationship between addr1 and addr2 is: addr1[5..11] + 1 == addr2[5..11] * addr1[0..3], in combination with the LD/ST len (1/2/3/4) is turned into a bytemap mask, 24 bits in length. this bytemask is broken down into two halves: bytemask1, bytemask2 = map_addr(addr, LDST_len) [0..15], [16..23] i.e. anything that does not fit fully into bytemask1 is a "misaligned" LD/ST and the remainder overflows into bytemask2. * if addr[4..11] == 0b111111111 and bytemask2 is non-zero, this indicates a "misaligned major page fault". this is a situation that we are *not* going to deal with (and it has been catered for in the 3.0B spec) * all 16 FU LD/ST re-encodings of the (addr, LDST_len) are lined up in a table. this table breaks down, alternating between: * FU # aligned or FU # misaligned * addr1[5:11] or addr2[5:11] * addr[12:48] for *BOTH* * bytemap1[0:15] or bytemap2[0:15] * data1[0:15] or data2[15] note that addr[4] is *not* included in this because it is used to select whether L1 cache bank #0 or #1 is to be used. the algorithm for merging of LD/STs into *one single L1 cache line* is: 1). With a PriorityPicker find the index (row) of the first valid LD/ST request 2). For all entries after that row, compare Addr[5:11] and Addr[12:48]. 3). If "match" on both, OR the byte-mask for that row onto the output. that's it. that's really all there is to it. one thing that's important to note: there are only actually *eight* comparisons of addr[12:48] needed (not 16), because the addr[12:48] is *identical* for every *pair* of rows. that however is still *seven* potential 36-bit CAM hits (seven lots of 36-bit XOR gates). which is a hell of a lot. if we could somehow use the L1 "tag" in place of Addr[12:48], that would save a huge amount of power. unfortunately, every way i can think of that would get the tag *into* L0 is either equally power-consuming, or results in multi-cycle latency. if we could reliably use a hash instead, i would suggest it. however, unfortunately, the risk of a collision is too detrimental consequences. the "sensible" option that does not have too detrimental an effect on performance is: reduce the number of LD/ST FUs to 6. that would result in only 12 rows.
i just realised something: * misaligned LD/STs will create a pair of LD/STs (on each FU) where it is *guaranteed* that Addr[4] for the LO-HI request will be mutually exclusive. if Addr[4] is 0 for the LO request of the pair, Addr[4] is *guaranteed* to be 1 for the HI request of the pair * therefore splitting into L0 left/right LO/HI isn't going to work... unless we accept that the LO/HI requests actually *cross over*! FU0 LO -> L0 left if Addr[4] == 0 else L0 right FU0 HI -> L0 right addr 4 will be 1 else L0 right this has the strong disadvantage of not keeping the addresses that we *know* have the same bits in the upper parts [12:48]. the other way to do it therefore is for the L0 to be only 8 rows deep, comprising *two* 128-bit entries, i.e. to be integrated L0-odd and L0-even where previously it was two separate L0 caches. two L1 caches would then connect to that. i'll think it through.
done. https://libre-soc.org/3d_gpu/twin_l0_cache_buffer.jpg the key is that there are only 8 rows (not 16) however each row contains *two* byte-masks, *two* addr[5:11] and data sets, but *ONE* addr[12:48] per row. multiplexing is exclusively limited to 2-in 2-out, on a per-row basis. each FU having two ports need *only* perform multiplexing on its *own row* (not to any other rows), using addr[4] to select which bank the two data sets are to go into. the thing is that because the two data sets from each FU are consecutive addresses (one aligned, one mis-aligned, both coming from the exact same LD/ST), we *know* that they will both go simultaneously into the row: it's just a matter of which one goes into which row. so we need a 2-in 2-out crossbar, no multiplexing, no arbitration or delay-gates needed. i'll do the writeup now https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
Writeup of the interface / API needed for LDST to talk to the LDST Computational Unit. Additional signals will come through such as length, type, atomic etc these are in order of time sequence (edit this comment as necessary): LD -- * out: busy * in: is_ld * in: address * in: len (1/2/4/8) * in: go_addr * in: go_die * out: addr_ok (no exception will occur) * out: addr_exc (exception type) * out: ld_data * out: ld_data_ok (to be raised for 1 cycle) ST -- * out: busy * is_st * in: address * in: len (1/2/4/8) * in: go_addr * in: go_die * out: addr_ok (no exception will occur) * out: addr_exc (exception type) * in: st_data * in: go_st (raised for 1 cycle, must complete) Notes ----- * both LD/ST are identical up to address ok/exception point * LDSTCompUnit must wait for busy signal to deassert before proceeding with signalling go_addr * go_die pulls a hard reset (at any time), cancelling any outstanding LD, ST or address validation. * addr is valid and will not change from the time that go_addr is asserted until a reset occurs. (therefore addr does not need to be internally latched) * within the L0 Cache/Buffer, before address validation (and TLB, and L1 cache line reservation) have occurred, neither LD nor ST data transfer may take place. * beyond addr_ok the operation MUST be successful. L1 Cache, TLB etc MUST all have been done * after addr_ok sent, LD data can be placed as soon as ready on output bus without requiring "permission" from a go_ld signal. * for a LD, LDSTCompUnit will wait for EITHER addr exception OR a ld_data_ok. * on any address-related exception, addr_exc is asserted HI for a single-cycle signal: both LDSTCompUnit and L0 Cache/Buffer go back to reset. * ld_data_ok MUST NOT be asserted if addr_exc was asserted * ld_data_ok must remain asserted for 1 clock and ld_data must be valid on that same clock. ld_data latching is the caller's responsibility * L0 Cache/Buffer goes back to reset once ld_data_ok goes low again. * ST on the other hand must have the ADDR checking (and L1 cache line loading) done separate from the write. * ST therefore has two phases: addr ok (or not) followed by waiting for permission to carry out the actual store. * the ST data in *will* be valid at the same clock where go_st is raised. * go_st will *not* be raised by the LDSTCompUnit until *after* addr_ok/exc * again: no exceptions can occur after the addr_ok signal is raised. * go_st will remain asserted for one cycle. st_data *MUST* be accepted (and stored) * L0 Cache/Buffer must return to reset after go_st goes back to low.
i may need to implement a very basic (working) of the API in http://bugs.libre-riscv.org/show_bug.cgi?id=216#c26 effectively a thin "wrapper" around nmigen Memory rdport and wrport. it's not quite a perfect match for nmigen rdport/wrport as the address phase needs latching in, separately from how r/w data is handled. i'll make sure that the (decoded) opcode is passed over because it contains crucial information (not just ld/st-length).
i've added the "PortInterface" which connects between the LDST Computational Unit and the L0 Cache/Buffer. https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=80f66556795c80603cf7b33140adc27e01e37155 this is the "interface" (API) which is implemented in comment #26 and i put here as well: https://libre-soc.org/3d_gpu/architecture/memory_and_cache/ i've also added a "subset" of the POWER decode flags that come directly from decoding the instruction, only passing those that are relevant to actual LD/ST operations: https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=66381f20d789aaf74511c2d6a892093940f34ce5 if there's anything missing from that it means that first the actual instruction needs to be added to the POWER decoder *first*.
jacob on looking at minerva they did something which i think would be really useful to copy. they created "cacheless" memory classes and "cached" versions conforming to the same interface. this allows an incremental approach where testing of one component is not critically dependent on another that hasn't been written yet.
https://www.youtube.com/watch?v=6Yiyw4abJpE a walkthrough video which explains the components (the ones that already exist) and explains what's needed. crucially, the existing components provide certain guarantees that, by the time it comes to what the L0 Cache/Buffer has to do, is a ridiculously simple job. i forgot to mention: after the PriorityPicker, the "selection" of the units which are "above" that first-picked row, we actually already have some code which does that: michael wrote it, for the LE/GE/EQ "PartitionedSignal" comparators. the class is called "RippleLSB": https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/ripple.py;h=032512047f2e58c3ce50bc869e4a064738647943;hb=cec6f51e8b2a1ba2c2f83965be7219885ea163f0#l13 this can be moved to nmutil.
(In reply to Luke Kenneth Casson Leighton from comment #30) > the class is called "RippleLSB": > https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/ > ripple.py;h=032512047f2e58c3ce50bc869e4a064738647943; > hb=cec6f51e8b2a1ba2c2f83965be7219885ea163f0#l13 > > this can be moved to nmutil. done https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/ripple.py;hb=HEAD
jacob i've added documentation of the requirements of PortInterface: https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=6404cf302734700514e7553bdf6c5a5df5badeb8 i'm doing a "dummy" one at the moment which will implement this API, only allowing the one (and only one) Memory read/write per cycle. this makes it clear exactly where the "merging" needs to take place. putting the (twin/quad) L1 Caches in place once the L0CacheBuffer has a single-cycle 128-bit-wide interface should be dead straightforward.
okaaay, it's messy but functional, and, importantly, illustrates the API, jacob. i've not put in actual LD/ST width yet, so the tests "pass" based on storing a 16-bit value into a hard-coded-16-bit-wide "Memory", the important thing is the signalling and API conformance, illustrating the timing that's required (and expected) by LDSTCompUnit. at least this allows me to move on to the (new, ridiculously-redrawn) LDSTCompUnit. we'll need to coordinate when it comes to adding twin ports (for misaligned) however it may just be ok there to assume the misaligned port is never used by LDSTCompUnit as the 2nd port is being added to L0CacheBuffer, get the unit tests working for L0CacheBuffer, then add it to LDSTCompUnit, and *then* add unit tests to LDSTCompUnit that try using it.
What do you think of the idea to write a single-op-at-a-time memory interface with all the caches & stuff to work with the rest of the SoC? it will be much simpler to get right and not take as much time, then, after that works, we can work on the much more complex multi-op version. That way, we will have something that works even if the multi-op version doesn't end up working in time for the october tape out.
The proposed single-op-at-a-time interface would work through PortInterface, since you already wrote a lot of the code needed to work with it.
(In reply to Jacob Lifshay from comment #35) > The proposed single-op-at-a-time interface would work through PortInterface, > since you already wrote a lot of the code needed to work with it. the code i wrote is sufficient to get me to the point where i can test LDSTCompUnit. it already does exactly what you propose. as a concept, it is "technically correct" and would actually "not corrupt memory" however it will have performance that is, ultimately, a waste of effort. adding dual 128 bit memory interfaces would be 10% utilised, because they would receive only one 8-64 bit request. we might as well just use minerva's L1 code directly, in full, only one interface with the only change being to expand it to 64 bit from 32. in addition, misaligned requests would not be handlee at the hardware level, leaving us non-compliant with the POWER spec. bottom line is that the current code is for testing purposes and prototyping. its usefulness in particular is as a "mock test" i.e. to ensure that a single LDSTCompUnit "works" and that is why i wrote it, *not* to actually put it into a production chip. it can in no way be used to *properly* test the parallel capability of the engine because it will only do one request at a time. if we want to be taken seriously and want the performance of a GPU we need the twin buffer implemented in the next week. i will have the LDSTCompUnit done by then and we will need to begin testing of parallel LD/STs.
(In reply to Jacob Lifshay from comment #34) > What do you think of the idea to write a single-op-at-a-time memory > interface with all the caches & stuff to work with the rest of the SoC? as intermediary steps this is a good idea. the code i wrote yesterday already does exactly that (and is why i did it. it "works". performance sucks.) however stopping at a level that does not meet the requirements of keeping those twin 128 bit L1 caches fully occupied on every cycle is not. > it will be much simpler to get right and not take as much time, then, after > that works, we can work on the much more complex multi-op version. That way, > we will have something that works even if the multi-op version doesn't end > up working in time for the october tape out. yes. although it will suck. strictly speaking just as in minerva if you look at the code there exists "cache bypassed" LoadStore code. by using that code directly we also have the option to switch off the L1 cache, making unit testing much easier. and to emphasise, again (i have mentioned this about four times but have not received a response), the minerva L1 code and associates LDST compiletime reconfigureable Interface does *not* need duplicating. it is good enough as-is and can simply be expanded to 64 bit then quqdrupled up, or expanded to 128 bit and doubled up, and 128 to 64 bit Wishbone Arbiters dropped on the front. with only needing to connect to two well defined interfaces (multiple PortInterfaces on one side and multiple minerva LDSTCache interfaces on the other) this task is much simpler and straightforward than it seems.
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=65a2218914d3f9a82d161c4cb946e5332d49b666 i've added a TODO class "DualPortSplitter" - actually it should probably kinda be that LDSTSplitter should be adapted to conform to PortInterface. can you do that task, jacob? it's a simple self-contained task that's an important step - no need to examine any of the other code.
see http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/007196.html a new class in soc/experiment/l0_cache.py is needed which implements that core algorithm: the one that uses a PriorityPicker to select one of the "active" LD/STs, then use that to OR-select an output. please assume the following inputs: * an array of 8 Signals of length 8 bits which tell you which addresses have matched against which other addresses. ( we call this the "address-match" array) * an array of 8 Records containing {16 bytes data, 16 "enable" lines} and the following output: * a Record containing {16 bytes data, 16 "enable" lines} in the address-match array, you want to find the first non-zero line. once you have that index, use it to OR the 16 bytes and 16 "enables" from the 8 records, into the output.
Tobias: i've added a stub class "DataMerger". please can you focus on implementing that. it should be approximately... 30 lines of code, maximum. start by committing a class with the Record Layout to be used by the input (Nx) and output (1x)
On Sun, May 24, 2020 at 8:23 PM Tobias Platen <libre-soc@platen-software.de> wrote: > > this evening: > > I began working on the elaboratable function of the DataMerger class, which I still haven't commited yet. > If all bits in addr_array_i are zero the output will be zero too as nothing needs to be merged. that would be correct. it would indicate that the LD/ST unit is idle. > I' now asking myself weather the interface is correct, and how the DataMerger is used. it is used as in the diagram in the video. it is also the piece marked "Priority Picker" in this diagram: https://libre-soc.org/3d_gpu/180nm_single_core_testasic_memlayout.jpg
(In reply to Luke Kenneth Casson Leighton from comment #41) > > I began working on the elaboratable function of the DataMerger class, which I still haven't commited yet. > > If all bits in addr_array_i are zero the output will be zero too as nothing needs to be merged. > > that would be correct. it would indicate that the LD/ST unit is idle. hmmm... the 2D NxN addr_array_i indicates address matches. if we assume that in the array that the diagonal (X=Y i.e. addr[n,n]) will always be set for any given LD/ST then there will always be one active LD/ST. this is kind-of "the address matches against itself". thus, if there is only one LD/ST, and it is from unit 5, then addr_array_i[5,5] will be TRUE thus, yes, we can assume that when all bits in addr_array_i are zero, the LD/ST unit is idle.
Created attachment 63 [details] over-run 80 chars tobias, hi, code-review: * please keep to under 80 chars. the best way to do this is to make sure that the editor window is only 80 chars wide. * please do not make multi-purpose commits. the commit contains about a hundred lines of undocumented whitespace changes... *AND* contains code changes * range(0, array_size) does not need the 0,. just range(array_size) * the 2D array can actually be done as a 1D array of Signals of length N for i in range(0, array_size): ul2 = [] for j in range(0, array_size): ul2.append(Signal()) can just be: for i in range(array_size): ul2.append(Signal()) * Signals all need to be reset_less=True (with very few exceptions) * Signals in loops do not end up being named properly ($1, $2) which gets really annoying. add a parameter name="addr_match_%d" % j * the traditional industry-standard short-name in hardware for "byte enable" is just "en". layout = (('data', 128), ('en', 16) )
excellent, tobias. i added some docstrings, please do git pull and carry on. i just had to redo the docstring i just wrote after noticing (only now) several days of using a MACos terminal at 86x30 not 80x30. sigh :) anyway: the docstring should now show clearly what's needed. commit db13193e20441c6988e7a1b3969979461136eca3 Author: Tobias Platen <tplaten@posteo.de> Date: Mon May 25 15:41:49 2020 +0200 refactoring (see #216 Comment 43)
tobias, i just noticed in soc.regfile.regfile.py, this code: def elaborate(self, platform): m = Module() for (regs, p) in self._rdports: #print (p) ror = treereduce(list(regs)) m.d.comb += p.data_o.eq(ror) that's pretty much exactly what is needed for DataMerger. the ORing of all of the data together (from only those with a "match" in the selected row, though), is "it". however bear in mind, both the enable *and* data signals need to be merged out. you may need to do them separately: one treereduce for the en and another for the data, both "optionally" controlled by the match on the selected row.
tobias, the DataMerger class looks good as a first cut: it looks like it does the job. couple of things: ---- + for j in range(self.addr): this should be self.array_size ---- + with m.If(self.addr_match_i[j]>0): instead of > 0, bool() can (should) be used: with m.If(self.addr_match_i[j].bool()): ----- + with m.If(self.addr_match_i[idx][j] && valid): ah. in nmigen, you can't use boolean operators (&& or ||) because there is no operator-overload for "bool". a PEP was actually raised about this and, rather short-sightedly, turned down. you therefore have to use the bit-wise operators, and make sure that the things that you are comparing are only 1 bit wide. with m.If(self.addr_match_i[idx][j] & valid): ------ + self.data_o.eq(0) this line is not needed. combinatorial circuits default to zero unless you set some of the Signal initialisation arguments. ----- + self.data_o.eq(self.data_i[j]|self.data_o) this isn't python programming, unfortunately: you can't treat Signals in a loop as "sequentially-accumulating variables". this is why i said to use treereduce. or, actually, now you can import ortreereduce, from soc.regfile.regfile. so do this instead: l = [] for j in range(self.array_size): select = self.addr_match_i[idx][j] & valid) l.append(Mux(select, self.data_i[j], 0) self.data_o.eq(ortreereduce(l) see how that works? because hardware is effectively "statically-assigned variables", you can't really do dynamic-accumulation. or... you can, but it will create a massive chain of gates and the completion time will be awful. :) so, use a Mux-in-a-loop, and tree-reduce them. strictly speaking we could hypothetically use the 2nd parameter of nmutil.utils.treereduce to actually do the Mux there, reduce things down to maybe one or two lines of code, however that's getting obtuse and "too clever" :) too clever means the code becomes unreadable.
hi tobias, saw the latest commit, getting there. ------ comb, sync = m.d.comb, m.d.sync sync is not needed ------- #(1) pick a row m.submodules.pick = pick = PriorityEncoder(self.array_size) for j in range(self.array_size): with m.If(self.addr_match_i[j].bool()): pick.i.eq(pick.i||(1<<j)) again: as previous comment, this is not python (and the comb += is missing at the front, which means that the nmigen AST is discarded, doing nothing) everything is parallel hardware, not sequential python. the nmigen AST captures that parallel hardware definition. so, here, we want each bit of the PriorityPicker input to connect to one bit of the test of whether a line in addr_match_i is empty/non-empty. therefore, we need to assign addr_match_i[i].bool() to each bit of pick.i: # assign each priority-picker input bit the test whether the # address match line is non-empty. for j in range(self.array_size): comb += pick.i[j].eq(self.addr_match_i[j].bool()) simple as that. ------- self.data_o.eq(ortreereduce(l)) import is missing. ----- fix those and it'll be time to move on to a unit test.
---------- hmmm... l = [] for j in range(self.array_size): select = self.addr_match_i[idx][j] & valid l.append(Mux(select, self.data_i[j], 0)) self.data_o.eq(ortreereduce(l)) this would be better / clearer: with m.If(valid): l = [] for j in range(self.array_size): select = self.addr_match_i[idx][j] l.append(Mux(select, self.data_i[j], 0)) self.data_o.eq(ortreereduce(l)) ----- return m is missing.
I've added a unittest for DataMerger and I get the following import error: cannot import name 'treereduce' from 'nmutil.util' I git pulled the nmutil repo but there were no changes.
(In reply to Tobias Platen from comment #49) > I've added a unittest for DataMerger and I get the following import error: > cannot import name 'treereduce' from 'nmutil.util' > I git pulled the nmutil repo but there were no changes. see earlier comment. "this is why i said to use treereduce. or, actually, now you can import ortreereduce, from soc.regfile.regfile."
(In reply to Luke Kenneth Casson Leighton from comment #50) > "this is why i said to use treereduce. or, actually, now you can import > ortreereduce, from soc.regfile.regfile." saw commit d42161569bfaad5f0703f67d41634fd460719c1b excellent. (btw remember to do whitespace and actual code-changes separately) (In reply to Luke Kenneth Casson Leighton from comment #48) > ----- > > return m is missing. return m is still missing. this will prevent the module from working, because the AST - which was constructed by the elaborate function - is entirely blank (None), due to the return result being implicitly None.
tobias: the DataMerger class is coming along well. now that the unit test passes (doesn't break anything) feel free to enable it by default, and then use it to confirm, incrementally, from that point on, that what you commit will now at least "not break anything" :) btw do give each DataMergerRecord() a unique name. this so that when people inspect the graphviz, each DataMergerRecord will have a prefix {name}_data etc. by which they can be identified correctly with the python source code. next bit: unit test. after that: formal proof.
commit 23b682daa011ace5cfbfbe2b60ece7369cbe403c Author: Tobias Platen <tplaten@posteo.de> Date: Sat May 30 21:11:02 2020 +0200 unit test for DataMerger looks good, tobias. i added some use of "Settle()" which should be used (as we just learned thanks to Cesar highlighting it) after any combinatorial logic changes, rather than doing a full one-cycle "yield"
Before running formal verification, I first had to upgrade nmigen, because my code works using records. When using an old version of nmigen, I got the following error. > Object (rec <unnamed> data en) cannot be used as a key in value collections The second one is a bug in nmigen when it invokes sby with an empty working directory, which is the current directory in this case. I set it to "." with the following lines added to the code. + if(len(spec_dir)==0): + spec_dir="."
proof_datamerger.py: for b in range(0,8): just range(8). there is no need to specify a start range of 0 comb += dut.addr_array_i[j].eq(AnySeq(dut.array_size)) comb += dut.data_i[j].eq(AnySeq(16+128)) AnySeq is for synchronous data where the value changes on every clock. use AnyConst instead which will not change.
fixed
(In reply to Tobias Platen from comment #56) > fixed fantastic. you'll see i added some comments in the DualPortInterface. i think what might need to be done is: * to have the the incoming PortInterface.oper_i.data_len be one unary bit only (0b1, 0b10, 0b100, 0b1000) * however for the outgoing ones, due to the "misalignment", they can be set to 0b11, 0b111, 0b101, and so on. actually this is ok! :) it is a 4-bit field after all. so basically: * one LD/ST comes in * it will be routed to EITHER left (bit 4 ==0) OR right (bit 4==1) * misalignment possibility results in a *SPLIT* LD/ST that **ALWAYS** goes out the **OTHER** port. * if both ports are to be used, the data from both has to be recombined. now, there are a few things that are different from LDSTSplitter: 1) LDSTSplitter rather hilariously treats the data as 16-bit, not 16-byte. this is easily fixable by using Repl on the mask and multiplying by 8 and so on 2) LDSTSplitter does *NOT* do the correct bit-4 routing. it is HARDCODED to ALWAYS assign non-misaligned data to port 0 with m.If(mask2 == mzero): comb += self.valid_o.eq(self.sld_valid_o[0]) <-- wrong fixing that is slightly complicated in that ld1 and ld2 need to dynamically swap over. 3) LDSTSplitter although it creates the mask correctly (using LenExpand) does not "re-create" the *binary* version of that, which is the key information that needs to go back into PortInterface.oper_i.data_len along with the "altered" addresses.
worked out that this: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/minerva/wishbone.py;hb=HEAD is exactly what we need to do the arbitration of odd/even dual ports. actually there could be four 64-bit wishbone buses, each seeking arbitration onto the one 64-bit memory bus: * LSB 64-bit half of the 128-bit ODD L0CacheBuffer Port interface * MSB 64-bit half of the 128-bit ODD L0CacheBuffer Port interface * LSB 64-bit half of the 128-bit EVEN L0CacheBuffer Port interface * MSB 64-bit half of the 128-bit EVEN L0CacheBuffer Port interface the detection as to whether any given one of these should be enabled is simply done by ORing the relevent sections of the 16-bit byte-enable column line (wen / ren). this comes directly from the "en" lines from DataMerger. * half (8) of the Datamerger byte-enable lines will go to enabling one of the 64-bit Wishbone interfaces * the other half (8) will go to the other one.
i've moved the responsibility for byte-reversal from L0CacheBuffer into LDSTCompUnit. this means that PortInterface can be simplified (no need to have the full CompLDSTOpSubset) and it begins to look much more like minerva LoadStoreInterface
(In reply to Luke Kenneth Casson Leighton from comment #59) > this means that PortInterface can be simplified (no need to have the full > CompLDSTOpSubset) and it begins to look much more like minerva > LoadStoreInterface [master bafa9f0] remove CompLDSTOpSubset, replace with just data_len. done. quite straightforward (very quick) as there is not much information needed to be passed over. whilst LoadStoreInterface has x_mask (byte-enable lines) and expects data to be wishbone-bus-aligned, i'm recommending that we keep PortInterface binary-encoded and process it on-demand through LenExpand to create the x_mask (byte-enable lines). the reason is that it's a lot more extra wires to pass round a 16-bit expanded bit-mask and we have a laarge number of PortInterfaces to route.