see http://bugs.libre-riscv.org/show_bug.cgi?id=132#c9
I had an idea about this last night. If it were acceptable that all the partitions be the same size, then something like this seems reasonable: c b c a c b c xxx|xxx|xxx|xxx|xxx|xxx|xxx|xxx partition bits: abc Guaranteeing that all the partitions are the same size and naturally aligned is just a matter of ensuring that if a partition bit is set, then all previous partition bits must also be set. If we need to have different width partitions, something like this might work, though it's a bit messier: a b c d e f g xxx|xxx|xxx|xxx|xxx|xxx|xxx|xxx partition bits: dbfaceg or partition bits: dbacfeg My idea was that it'd be sort of like a serialized binary tree. To ensure that the partitions are naturally aligned, a given bit can only be set if its parent bits are set, which seems harder to check/enforce than the method above
ok so the idea is, that the only options for partition sizes are: * 64 * 32,32 * 16,16,16,16 * 8,8,8,8,8,8,8,8 is that the idea? if so, this restricts us to not being able to run 32-bit arithmetic in one "lane" and 16-16 bit arithmetic in the other. the way that the 6600 engine and register file is to be arranged is that the register file is subdivided 32-HI, 32-LO times two. so, four ports, but only 32-bit-wide. if you want to do 64-bit arithmetic, you have to use *two* of those ports. also, each *byte* of the register file has its own write-enable line. this so that on vectorised instructions, if there are 32-bit instructions that happen to hit the 32-LO register port, the 32-*HI* port can be used for 32, 16-16, 8-8-8-8 *completely different* instructions that *HAPPEN* to occur (or are deliberately arranged to occur) on the exact same cycle and happen to be the exact same operation. now, whether these conditions turn out to be reasonable or not is another matter, hence why, yeah, it should be fine to consider this, and thus perhaps greatly simplify the partitioning. would we end up with a huge number of 32-bit-adds mixed in with 8-8-8-8 adds? i don't honestly know.
(In reply to Luke Kenneth Casson Leighton from comment #2) > ok so the idea is, that the only options for partition sizes are: > > * 64 > * 32,32 > * 16,16,16,16 > * 8,8,8,8,8,8,8,8 > > is that the idea? Yes > > if so, this restricts us to not being able to run 32-bit arithmetic in one > "lane" and 16-16 bit arithmetic in the other. > > this so that on vectorised instructions, if there are 32-bit instructions > that happen to hit the 32-LO register port, the 32-*HI* port can be used > for 32, 16-16, 8-8-8-8 *completely different* instructions that *HAPPEN* > to occur (or are deliberately arranged to occur) on the exact same cycle > and happen to be the exact same operation. > > now, whether these conditions turn out to be reasonable or not is another > matter, hence why, yeah, it should be fine to consider this, and thus > perhaps greatly simplify the partitioning. > > would we end up with a huge number of 32-bit-adds mixed in with 8-8-8-8 > adds? i don't honestly know. This would make scheduling a bit more complicated, but it might be beneficial to do this only for some modules. I don't think it'd make a huge difference for the adder or comparator to use an aligned partition, but it might simplify the shifter a good bit (because it eliminates a couple of the matrix entries).
(In reply to Michael Nolan from comment #3) > (In reply to Luke Kenneth Casson Leighton from comment #2) > > ok so the idea is, that the only options for partition sizes are: > > > > * 64 > > * 32,32 > > * 16,16,16,16 > > * 8,8,8,8,8,8,8,8 > > > > is that the idea? > > Yes > > > > > if so, this restricts us to not being able to run 32-bit arithmetic in one > > "lane" and 16-16 bit arithmetic in the other. > > > > > this so that on vectorised instructions, if there are 32-bit instructions > > that happen to hit the 32-LO register port, the 32-*HI* port can be used > > for 32, 16-16, 8-8-8-8 *completely different* instructions that *HAPPEN* > > to occur (or are deliberately arranged to occur) on the exact same cycle > > and happen to be the exact same operation. > > > > now, whether these conditions turn out to be reasonable or not is another > > matter, hence why, yeah, it should be fine to consider this, and thus > > perhaps greatly simplify the partitioning. > > > > would we end up with a huge number of 32-bit-adds mixed in with 8-8-8-8 > > adds? i don't honestly know. > > This would make scheduling a bit more complicated, i'm anticipating it to be quite straightforward, by way of pushing the "predicate bits" directly into the regfile write-enable lines, and to breaking down operations into 32-bit "chunks". so, where a sequence of elements (say 16 bit) are to be ADDed, that will be "converted" into 2x 16-16 SIMD operations: one will go to HI-32 regfile, the other will go to LO-32 regfile. it's pretty straightforward. it'll be slightly wasteful where the vector length is not an exact multiple of 32-bits (3x8 for example) however as a first iteration i'm not that concerned. > but it might be beneficial to do this only for some modules. honestly it would complicate the decode phase, along these lines: "if operation == NOT_CAPABLE_OF_DYNAMIC_PARTITIONING { do something else }" whether that's ok compared to the complexity of the partitioned ALU ops? > I don't think it'd make a huge > difference for the adder or comparator to use an aligned partition, but it > might simplify the shifter a good bit (because it eliminates a couple of the > matrix entries). it does... however i think the Switch statement really has to go. if you run "proc" "opt" then "show top" on a 64-bit shifter, it's awful. the MUX chain is absolutely dreadful: each "switch" statement gets turned into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the results *chained* between each! by comparison, for the gt_combiner, the mux-chains aren't 64-bit long, they're only 7-long, because they're done on the *partition* gates, not per-permutation-of-all-values-of-partitions. you did manage to convert the "switch statement" from the original that i did, of eq_combiner, and i am confident that the same thing can be done here, based on the tables: https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/ the only thing being, each table (each column output, o0...o7) is computed independently, you can't share data *between* each column, and that's fine.
okaaay i decoded those tables a bit, into "if" statements: if True: o3 = a0b0[31:24] if ~p0: o3 |= a1b0[23:16] if p0: o3 |= a1b1[23:16] if ~p0&p1: o3 |= a2b1[15:8] if p0&p1: o3 |= a2b0[15:8] if p1: o3 |= a2b2[15:8] if ~p0&~p1&~p2: o3 |= a3b0 if p0&~p1&~p2: o3 |= a3b1 if p1&~p2: o3 |= a3b2 if p2: o3 |= a3b3 that's quite easy to do as a parallel tree of ORs (see the treereduce function i created which should be moved to nmutil, really) https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/regfile/regfile.py;h=b1d6f1c6717351f7b38be806bb25462e51f3bbf0;hb=6f35af64465700971bce1e4dae1f7e69c2ec25ad#l68 aNbM is basically matrix[N][M] is that beginning to look a little clearer? i'll do the other tables as well