dynamic shift needed very similar to mul https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/

i've started adding in some tables which make it clear what the output is in each partition. it turns out that we'll need *three* variants of partitioned shift: * full dynamic (a and b are both PartitionedSignal) * vector-scalar (a is PartitionedSignal but b is a "global" (ordinary Signal) * vector-const. this last one is much more straightforward than the others. next order of difficulty is the vector-scalar one, and the full dynamic one is a pig.

michael, would you like to have a go at this one next? i've put some notes on the wiki page as to how i think it *might* be done.

I've got good news and bad news. The good news is I have a dynamic shifter working in part_shift/part_shift_dynamic.py. The bad news is it works out to 8500 wire bits to do a 64 bit, 8 partition shifter (ouch!). I currently have a gigantic switch statement to select which row of the matrix is needed for each element (similar to the old part_cmp), so optimizing this would probably help greatly with the number of gates needed.

(In reply to Michael Nolan from comment #3) > I've got good news and bad news. The good news is I have a dynamic shifter > working in part_shift/part_shift_dynamic.py. The bad news is it works out to > 8500 wire bits to do a 64 bit, 8 partition shifter (ouch!). *gates, not wire bits

(In reply to Michael Nolan from comment #3) > I've got good news and bad news. The good news is I have a dynamic shifter > working in part_shift/part_shift_dynamic.py. superb. > The bad news is it works out to > 8500 gates bits to do a 64 bit, 8 partition shifter (ouch!). w00t! :) > I currently have a gigantic switch statement to select which row of the > matrix is needed for each element (similar to the old part_cmp), so > optimizing this would probably help greatly with the number of gates needed. the yosys graph is wonderful: like a dog munched your spaghetti bolognaise dinner then threw it up when you held the dog up at waist-level and did a nice pirouette :) what would reduce the number of gates significantly is getting the size of b down, in the matrices. see "question: b1" here: https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/ if you look just at the last... err... column (where you are creating the outputs that you know are going to go into the MSB-partition (MSP? Most Significant Partition?) in that case, you know for a fact that shifting by greater than the width of that partition (let's say it's 8 bits wide), you *know* that anything that shifts by greater than 8 is always, always, always going to result in that "contribution" from the shift-matrix being zero. therefore, what you can do is *truncate* the B-input to 3 bits! but... butbutbut, that's not quite the full story, because actually what you want is MIN(B[partitionbits], 8). NOT, repeat, NOT: B[partitionbits] & (0b111) # (8-1 in binary == 0b111 mask) even when you are dealing with the LSB-partition (LSP? Least Significant Partition), you definitely do not want "A << B", you want "A << B[log2(len(output))-1:0]" "A<<B" will try to generate 64-64 shifters. A<< B[trunc] will generate (correctly) 64-7 shifters, because anything greater than 0b1000000 is *equivalent* to 0b1000000. some shifters take 6 bits (not 7) and assume that 0b000000 actually means "0b1000000", others will add 1 to the B operand... ... let's not worry about that for now, though, and focus on getting the B inputs truncated somewhat?

wait... no... yes, it looks like you've already done that (to some extent), at least limited the range of the matrix partial-results. however what's missing is the truncation of B to match the length of the output: comb += matrix[i][j].eq(a << b) start = end should be something like: outwid = matrix[i][j].shape()[0] bwid = math.ceil(math.log2(outwid + 1)) comb += matrix[i][j].eq(a << b[:bwid]) start = end or something like that. there's almost certainly at least two off-by-one bugs in that :)

one important point to remember is that we need to be able to cheaply implement the shift instructions, on RISC-V, at least, the shift amount's high bits are specifically ignored, so shifting by n bits actually gives the result of shifting by n % bit_len where bit_len is the number of bits in the field/register being shifted.

(In reply to Luke Kenneth Casson Leighton from comment #5) > the yosys graph is wonderful: like a dog munched your spaghetti bolognaise > dinner then threw it up when you held the dog up at waist-level and did a > nice pirouette :) ROFL, that's far more creative than most of my insults usually are. I would have just said it looks like sh*t. (In reply to Luke Kenneth Casson Leighton from comment #6) > wait... no... yes, it looks like you've already done that (to some extent), > at least limited the range of the matrix partial-results. > > however what's missing is the truncation of B to match the length of the > output: > > comb += matrix[i][j].eq(a << b) > start = end > > should be something like: > > outwid = matrix[i][j].shape()[0] > bwid = math.ceil(math.log2(outwid + 1)) > comb += matrix[i][j].eq(a << b[:bwid]) > start = end > > or something like that. there's almost certainly at least two off-by-one > bugs in that :) I suspect yosys is doing some of this already, since the size of my results are limited, but I can give it a try.

(In reply to Luke Kenneth Casson Leighton from comment #6) > wait... no... yes, it looks like you've already done that (to some extent), > at least limited the range of the matrix partial-results. > > however what's missing is the truncation of B to match the length of the > output: > or something like that. there's almost certainly at least two off-by-one > bugs in that :) This didn't seem to make too much of a difference. Before: Number of cells: 8608 After: Number of cells: 8523 I also eliminated unused entries from the shift matrix which did have some effect: Number of cells: 8047 I suspect there's a better way of calculating what matrix entry is needed than my giant-ass switch statement may help, it seems to be selecting the row corresponding to the last set bit of the gates.

yeh i ran "proc" then "opt" followed by "show top" and woooo did it take a looong time, the mux-cascades are absolutely frickin enormous. one of them must be 64 bits in length, which we can't possibly go live with, the latency will be insane. > it seems to be selecting the row corresponding to the last set bit of > the gates. that sounds about right, intuitively.

ok i've managed to limit the bitwidth of partial result shift matrix size to 64,56,48,...8 because obviously we don't need to include anything beyond that which will go into 64 bit: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx only needs to be: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx then, based on that, i'm trying to restrict the amount by which the B input actually shifts by, because xxxx << 4-or-greater is clearly *always* going to be 0000 therefore why bother trying, it's just wasting gates. however, that bit's not working yet. see "tshift".

> however, that bit's not working yet. see "tshift". ha, correction, actually it is :)

Looks like either a0faa5a or 6527ad39 made the problem slightly worse: Number of wires: 8866 the previous commit, bdca76073 had ~1k less gates: Number of wires: 7893

(In reply to Michael Nolan from comment #13) > Looks like either a0faa5a or 6527ad39 made the problem slightly worse: > Number of wires: 8866 > > the previous commit, bdca76073 had ~1k less gates: > Number of wires: 7893 we need to get rid of the switch statement. we can't possibly have a Mux-cascade of 64 Muxes, the gate latency will kill any possibility of the chip running over about... 200 mhz. i've just done the other three tables in https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/ and the pattern is beginning to emerge.

(in the gt_combiner etc. the mux-cascades are only 7 or 8 long, and that is "acceptable" gate latency.)

so it looks like you go: * start with ~p0&~p1&~p2&~p3...&~pN, and start with M=0 * select aNbM and OR that into the output, then increment M by 1 * take the front of the AND-chain, remove the invert on the FIRST item (~p0 becomes p0) * select aNbM and OR that into the output, then increment M by 1 more * DROP the front of the AND-chain (p0 on the first iteration) * repeat until you get down to a list of only one item, this should be when M=max: drop the invert and select aNbM. am i making any sense? :)

> it does... however i think the Switch statement really has to go. if > you > run "proc" "opt" then "show top" on a 64-bit shifter, it's awful. > the MUX chain is absolutely dreadful: each "switch" statement gets > turned > into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the > results *chained* between each! oh. > * start with ~p0&~p1&~p2&~p3...&~pN, and start with M=0 > * select aNbM and OR that into the output, then increment M by 1 > * take the front of the AND-chain, remove the invert on the FIRST item > (~p0 becomes p0) > * select aNbM and OR that into the output, then increment M by 1 more > * DROP the front of the AND-chain (p0 on the first iteration) > * repeat until you get down to a list of only one item, this should > be when M=max: drop the invert and select aNbM. > > am i making any sense? :) sort of, I'll give it a closer look when I work on it tomorrow.

ahno, it's even simpler than that. it's: * start off with every gate-bit inverted * next loop: front of list is not-inverted, rest of list is inverted * repeat until only one item left. i'll write a quick bit of python to print out the table.

ah! def decoderange(col, k): xs = (col-k)*8 return "%d:%d" % (xs+7, xs) def decodelist(l): res = [] for (idx, sgn) in l: sgn = " " if sgn else "~" res.append("%sp%d" % (sgn, idx)) return '&'.join(res) N = 4 M = 4 for col in range(M): r = decoderange(col, 0) print "if True: o%d = a0b0[%s]" % (col, r) for i in range(1,col+1): l = [] for j in range(i): l.append([j, False]) k = 0 s = decodelist(l) r = decoderange(col, i) print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r) k += 1 while l: l[0][1] = True s = decodelist(l) r = decoderange(col, i) print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r) k += 1 del l[0] print

(In reply to Michael Nolan from comment #17) > > into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the > > results *chained* between each! > > oh. btw that's why i advised to run yosys show, however i realise now it should include running proc and opt.

(In reply to Michael Nolan from comment #17) > > it does... however i think the Switch statement really has to go. if > > you > > run "proc" "opt" then "show top" on a 64-bit shifter, it's awful. > > the MUX chain is absolutely dreadful: each "switch" statement gets > > turned > > into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the > > results *chained* between each! Note that there are specific yosys optimizations for mux trees that greatly simplify them: muxpack is probably the one you want to try http://www.clifford.at/yosys/cmd_muxpack.html

(In reply to Jacob Lifshay from comment #7) > one important point to remember is that we need to be able to cheaply > implement the shift instructions, on RISC-V, at least, the shift amount's > high bits are specifically ignored, so shifting by n bits actually gives the > result of shifting by n % bit_len where bit_len is the number of bits in the > field/register being shifted. this needs to go into all the unit tests, and it needs to be done dynamically. so, when a *partition length* is 8, the bitmask is 0b111 on the b operand. when a partition length is 32, the bitmask is 0b11111 on b. this needs to be in the formal proof as well, michael. btw we're spending a lot of time on this one, i'm upping the budget.

(In reply to Luke Kenneth Casson Leighton from comment #22) > this needs to go into all the unit tests, and it needs to be done > dynamically. > so, when a *partition length* is 8, the bitmask is 0b111 on the b operand. > when a partition length is 32, the bitmask is 0b1111 on b. assuming you mean 0b11111 for 32

(In reply to Jacob Lifshay from comment #23) > assuming you mean 0b11111 for 32 yep well spotted - edited

I did something similar (I think?) to your description/python code, but broken up into a couple parts I first created a table of n partial results like so: # calculate shift amounts - s[0] = b[i] s[i] = Mux(gates[i-1], b[i], s[i-1]) r[i] = a[i] << s[i] for i in range(n) Then I calculated the results by: o[0] = r[0] o[1] = r[1] | Mux(gates[0], 0, r[0]) o[2] = r[2] | Mux(gates[1], 0, Mux(gates[0], 0, r[0])) etc. This reduced the gate count from 8k gates for a 64x8 shifter, to 1.7k, and I think I can do a bit better.

(In reply to Michael Nolan from comment #25) > I did something similar (I think?) to your description/python code, but > broken up into a couple parts excellent. it's getting complicated (the yosys graphs are unreadable when at 100% zoom) so once we're happy with it i'll break it down into modules. > I first created a table of n partial results like so: > > # calculate shift amounts - > s[0] = b[i] > s[i] = Mux(gates[i-1], b[i], s[i-1]) > > r[i] = a[i] << s[i] for i in range(n) ah rats, you took out the "truncation" that i put on B. ah well - that can (has to) go back in, btw, however with the Switch statement removed that's well on the way > Then I calculated the results by: > > o[0] = r[0] > o[1] = r[1] | Mux(gates[0], 0, r[0]) > o[2] = r[2] | Mux(gates[1], 0, Mux(gates[0], 0, r[0])) > etc. > > This reduced the gate count from 8k gates for a 64x8 shifter, to 1.7k, and I > think I can do a bit better. eeexxcellent, muhahaha oh.. err... i thiiink i see what you did. you actually selected the correct "B" to perform the shift of the output, based on the partition-info? ee.... :)

(In reply to Luke Kenneth Casson Leighton from comment #26) > ah rats, you took out the "truncation" that i put on B. > ah well - that can (has to) go back in, btw, however with > the Switch statement removed that's well on the way Yes, wanted to get it working without worrying about truncation complicating things. Working on adding it back in now though > oh.. err... i thiiink i see what you did. you actually selected > the correct "B" to perform the shift of the output, based on the > partition-info? > > ee.... :) Yup!

(In reply to Michael Nolan from comment #27) > (In reply to Luke Kenneth Casson Leighton from comment #26) > > > ah rats, you took out the "truncation" that i put on B. > > ah well - that can (has to) go back in, btw, however with > > the Switch statement removed that's well on the way > > Yes, wanted to get it working without worrying about truncation complicating > things. Working on adding it back in now though exccellent. remember, it's *sigh* not as straightforward as truncating (bit-wise)... well, you saw what i had in the original, with the comparators. a *bit-wise* truncation also needs to be *shudder* based on the partition length. i think we can safely round-up 24-bit and other non-power-of-two partition sizes, on the assumption that anyone trying to use this code for that purpose is a bit... unhinged :)

> exccellent. remember, it's *sigh* not as straightforward as truncating > (bit-wise)... well, you saw what i had in the original, with the comparators. > a *bit-wise* truncation also needs to be *shudder* based on the partition > length. This has been added to the shifter as well as the proof in the latest commit

(In reply to Michael Nolan from comment #29) > > a *bit-wise* truncation also needs to be *shudder* based on the partition > > length. > > This has been added to the shifter as well as the proof in the latest commit okaaay, i'll work it out for test_partsig.py

(In reply to Luke Kenneth Casson Leighton from comment #30) > okaaay, i'll work it out for test_partsig.py got it. bit messy, but it works.

just doing some review / commenting / etc. this one's quite important: diff --git a/src/ieee754/part_shift/part_shift_dynamic.py b/src/ieee754/part_shift/part_shift_dynamic.py index 5dd72ed..95c07d6 100644 --- a/src/ieee754/part_shift/part_shift_dynamic.py +++ b/src/ieee754/part_shift/part_shift_dynamic.py @@ -63,12 +63,14 @@ class PartitionedDynamicShift(Elaboratable): for i in range(len(b_intervals)): mask = Signal(b_intervals[i].shape(), name="shift_mask%d" % i, reset_less=True) - bits = [] + bits = Signal(gates.width-i+1, name="bits%d" % i, reset_less=True) + bl = [] for j in range(i, gates.width): - if bits: - bits.append(~gates[j] & bits[-1]) + if bl: + bl.append(~gates[j] & bits[j-i-1]) else: - bits.append(~gates[j]) + bl.append(~gates[j]) + comb += bits.eq(Cat(*bl)) comb += mask.eq(Cat((1 << min_bits)-1, bits) & ((1 << max_bits)-1)) shifter_masks.append(mask) if you "accumulate" something in a chain (bits, in this case), then assign it later (in mask), what you find is that the entire *chain* is duplicated... *multiple times*. by assigning each bit of "bits" to a Signal, and using the *signal* - not the chain - you don't end up with a massive cascading code-repetition. i deployed this trick in another area of the code as well. this one's quite important: it looks "really simple" to just do bits = bits & foo or bits.append(input & bits[-1]) however it's a pattern that really needs to be avoided. yes, found it: element = Mux(gates[i-1], masked, element) elmux = Signal(b_intervals[i].shape(), name="elmux%d" % i, reset_less=True) comb += elmux.eq(element) element = elmux that was another one where there was an accidental "repeated chain", this time from the input to Mux(). i'm trying to identify where the max() thing can be put back in. i think it's where shiftbits = math.ceil(math.log2(width)) that needs to be inside the loop, and it needs to be shiftbits = math.ceil(math.log2(width-intervals[i].start)) something like that. then, the b partial... shifter... needs to be set not to this: comb += shifter.eq(element) it needs to be m.If(element > shiftbits) comb += shifter.eq(shiftbits) m.Else() comb += shifter.eq(element) i think that's it. the reason is, there's absolutely no point trying to shift the data beyond the limit of the partial. also, partial needs to be limited to width-intervals[i].start the last column is only 8 bits, for example, so why is the partial result always set to 64 bit?

another one. again, the Mux-chain result = Mux(gates, partial, result) was creating a massive chain that was being duplicated: output[0] = result output[1] = Mux(gates, partial, result) output[2] = Mux(gates, partial, Mux(gates, partial, result)) ... output[N] = Mux(Mux(Mux(Mux(Mux(Mux(......)))))))))) this does *NOT* get "optimised out" so you have to be extremely careful to make absolutely sure that the input on these kinds of chains is *not* an "expression", but *only* a signal to which the expression has been *assigned*. index 9e214d0..aebcef5 100644 --- a/src/ieee754/part_shift/part_shift_dynamic.py +++ b/src/ieee754/part_shift/part_shift_dynamic.py @@ -145,6 +145,7 @@ class PartitionedDynamicShift(Elaboratable): print("select: [%d:%d]" % (start, end)) res = Signal(width, name="res%d" % i, reset_less=True) comb += res.eq(result) + result = res s,e = intervals[0] out.append(res[s:e])

ok i *think* i worked it out, and also did the split into separate modules. it's not perfect by any means. i tried running the static shift, it's not having any of it, btw. i'll add it into test_partsig.py ready for when it works?

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_shift/part_shift_scalar.py;hb=HEAD ah! needs to have the truncation on the b input, doesn't it? same as in the dynamic one, depending on partition width?

(In reply to Luke Kenneth Casson Leighton from comment #35) > ah! needs to have the truncation on the b input, doesn't it? same as in the > dynamic one, depending on partition width? Yep, you're right. Working on that now

(In reply to Michael Nolan from comment #36) > (In reply to Luke Kenneth Casson Leighton from comment #35) > > ah! needs to have the truncation on the b input, doesn't it? same as in the > > dynamic one, depending on partition width? > > Yep, you're right. Working on that now haha that's funny, you ended up importing ShifterMask, i'd managed to forget it could be used, even though i was just thinking, "hmm, the partition mask code would end up being duplicated between both modules" :)

I'm guessing this will need to be pipelined at some point, would you like me to give that a try?

(In reply to Michael Nolan from comment #38) > I'm guessing this will need to be pipelined at some point, would you like me > to give that a try? well, before that, we need to see if it's actually needed, by checking the longest path (longest chain). you rarely hear of processors breaking down a 64 bit shift into pipelines. looking at RocketChip's ALU, that's gone to 1.5ghz in 45nm and they've not done pipelining. and, because of the breakdown into smaller sections we might actually have *reduced* latency. btw scalar formal proof is failing, here python3 ieee754/part_shift/formal/proof_shift_scalar.py SBY 21:35:29 [proof_shift_scalar_shift] base: starting process "cd proof_shift_scalar_shift/src; yosys -ql ../model/design.log ../model/design.ys" SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver conflict for $16 [15] between cell $17.Y and constant 1'0 in dut: Resolved using constant. SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver conflict for $16 [14] between cell $17.Y and constant 1'0 in dut: Resolved using constant. ... ... SBY 21:35:29 [proof_shift_scalar_shift] engine_0: ## 0:00:00 Value for anyconst in top (ieee754/part_shift/formal/proof_shift_scalar.py:50): 0 SBY 21:35:29 [proof_shift_scalar_shift] engine_0: ## 0:00:00 Value for anyconst in top (ieee754/part_shift/formal/proof_shift_scalar.py:51): 0 SBY 21:35:29 [proof_shift_scalar_shift] engine_0: ## 0:00:00 Value for anyconst in top (ieee754/part_shift/formal/proof_shift_scalar.py:49): 32768 SBY 21:35:29 [proof_shift_scalar_shift] engine_0: ## 0:00:00 Assert failed in top: ieee754/part_shift/formal/proof_shift_scalar.py:67

(In reply to Luke Kenneth Casson Leighton from comment #39) > btw scalar formal proof is failing, here > python3 ieee754/part_shift/formal/proof_shift_scalar.py > SBY 21:35:29 [proof_shift_scalar_shift] base: starting process "cd > proof_shift_scalar_shift/src; yosys -ql ../model/design.log > ../model/design.ys" > SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver > conflict for $16 [15] between cell $17.Y and constant 1'0 in dut: Resolved > using constant. > SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver > conflict for $16 [14] between cell $17.Y and constant 1'0 in dut: Resolved > using constant. Both proofs work on my machine on commit d635235. I'm using yosys 0.9-1706 (b9dfdbbf).

(In reply to Michael Nolan from comment #40) > (In reply to Luke Kenneth Casson Leighton from comment #39) > > > > > btw scalar formal proof is failing, here > > python3 ieee754/part_shift/formal/proof_shift_scalar.py > > SBY 21:35:29 [proof_shift_scalar_shift] base: starting process "cd > > proof_shift_scalar_shift/src; yosys -ql ../model/design.log > > ../model/design.ys" > > SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver > > conflict for $16 [15] between cell $17.Y and constant 1'0 in dut: Resolved > > using constant. > > SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver > > conflict for $16 [14] between cell $17.Y and constant 1'0 in dut: Resolved > > using constant. > > Both proofs work on my machine on commit d635235. I'm using yosys 0.9-1706 > (b9dfdbbf). yehh that can happen and i am using 0.8. will upgrade see what happens

ahh goood. ok next thing, i get the unit test up and running, the one which recognises Const. then we can sign this one off. lkcl@fizzy:~/src/libreriscv/ieee754fpu/src$ python3 ieee754/part_shift/formal/proof_shift_scalar.py {8: (slice (sig gates) 0:1), 16: (slice (sig gates) 1:2)} .{8: (slice (sig gates) 0:1), 16: (slice (sig gates) 1:2)} . ---------------------------------------------------------------------- Ran 2 tests in 0.307s OK lkcl@fizzy:~/src/libreriscv/ieee754fpu/src$ yosys -V Yosys 0.9+1706 (git sha1 c7af1b22, clang 7.0.0-svn342187-1~exp1~20180919215158.32 -fPIC -Os)

(In reply to Luke Kenneth Casson Leighton from comment #42) > ahh goood. > > ok next thing, i get the unit test up and running, the one which recognises > Const. ok i got it to recognise Signal() i'm happy with it. next :)