See discussion in http://bugs.libre-riscv.org/show_bug.cgi?id=186#c10
This is complete in e42917b. The dynamic shift right turned out to be a little more complicated than simply bit reversing the input and output, as the shift amounts had to be reordered as well.
Created attachment 26 [details] screenshot of yosys (In reply to Michael Nolan from comment #1) > This is complete in e42917b. > > The dynamic shift right turned out to be a little more complicated than > simply bit reversing the input and output, as the shift amounts had to be > reordered as well. oh that's interesting. ohh yes, for example, the order of the list of shift amounts has to be reversed... yeah tricky :) there's a couple of FIXMEs i left for you. one is a reset_less=True that was left out: the other is a chain of expressions using Mux() rather than assigning to a Signal that then goes into the Mux() on the next loop. el[0] = Signal() el[1] = Mux(Signal(), el[0]) el[2] = Mux(Signal(), Mux(el[1], el[0])) el[3] = Mux(Signal(), Mux(el[2], Mux(el[1], el[2]))) which wasn't what you wanted, but was what you got :) you wanted: el[0] = Signal() el[1] = Mux(Signal(), el[0]) el[2] = Mux(Signal(), el[1]) el[3] = Mux(Signal(), el[2]) ... ... you can see from the yosys screenshot where the mux + mux-mux + mux-mux-mux + mux-mux-mux-mux + .... ends up i fixed a number of these a couple weeks back, which deprived you of the opportunity to learn the importance of this one (apologies). so this time i've just put comments in. i leave you to sort that out? https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_shift/part_shift_dynamic.py;h=5e2c4406a175f8828a4490458be17c862c22041d;hb=8694385463d9ebf0cf476e89b997a88f5dd9c91f#l208
Created attachment 27 [details] screenshot of yosys (2) ah, nice. see how the cascade is no longer an O(N^2) thing? just a simple O(N) one instead. if you always do a "view" on the ilang file, even during partial development, and make sure to name things so that it's possible to track from the python code to the graphviz, then "oink what the heck is that mess" becomes an immediate red flag. next one, ROR? :) or shall we leave that one to be a micro-op, put data twice through the pipelines as a FSM? hmm, back to the other bugreport i think
(In reply to Luke Kenneth Casson Leighton from comment #3) > Created attachment 27 [details] > screenshot of yosys (2) > > ah, nice. see how the cascade is no longer an O(N^2) thing? > just a simple O(N) one instead. Ah yes I see now. However it looks like yosys is able to optimize the first version into the second one, they both end up as the same number of gates. > > next one, ROR? :) or shall we leave that one to be a > micro-op, put data twice through the pipelines as a > FSM? hmm, back to the other bugreport i think ROR shouldn't be as common as SHL and SHR in normal code right? I think doing it in microcode would be fine, at least for the first iteration.
(In reply to Michael Nolan from comment #4) > > ah, nice. see how the cascade is no longer an O(N^2) thing? > > just a simple O(N) one instead. > > Ah yes I see now. However it looks like yosys is able to optimize the first > version into the second one, they both end up as the same number of gates. yyeah, not completely happy relying on that happening. > > next one, ROR? :) or shall we leave that one to be a > > micro-op, put data twice through the pipelines as a > > FSM? hmm, back to the other bugreport i think > > ROR shouldn't be as common as SHL and SHR in normal code right? it's more for crypto and other stuff. however there's that POWER ISA opcode > I think > doing it in microcode would be fine, at least for the first iteration. see how it goes.
(In reply to Michael Nolan from comment #4) > (In reply to Luke Kenneth Casson Leighton from comment #3) > > next one, ROR? :) or shall we leave that one to be a > > micro-op, put data twice through the pipelines as a > > FSM? hmm, back to the other bugreport i think > > ROR shouldn't be as common as SHL and SHR in normal code right? I think > doing it in microcode would be fine, at least for the first iteration. If we do it as microcode, we should try to have it still be a constant-time operation, since it's commonly assumed to be constant-time by crypto code.
(In reply to Jacob Lifshay from comment #6) > If we do it as microcode, we should try to have it still be a constant-time > operation, since it's commonly assumed to be constant-time by crypto code. Rol ra, rb, rc getting translated to shl tmp1, rb, rc shr tmp2, rb, (32-rc) or ra, tmp1, tmp2 should be constant time with respect to the data being manipulated. The actual timing probably depends on what else is in the pipe, but that's true of other instructions as well.
(In reply to Michael Nolan from comment #7) > Rol ra, rb, rc getting translated to > > shl tmp1, rb, rc > shr tmp2, rb, (32-rc) > or ra, tmp1, tmp2 > > should be constant time with respect to the data being manipulated. so there are a couple ways to do the micro-coding, one of them being to have "spare" inputs (extra Function Units) to the pipelines, which the normal instruction issue engine does not have access to. the output from the shl (tmp1) goes into an *extra* FU operand 1 on an OR pipe. the output tmp2 goes into the operand2 of the OR pipeline's extra FU. finally the output from the OR goes into ra, in the *normal* result system. it's basically hard-coded micro-code, rather than "programmable" micro-code. > The > actual timing probably depends on what else is in the pipe, but that's true > of other instructions as well. please, nobody mention spectre or other timing-related attacks. sigh. basically in the above example, if you do enough OR operations, shr or shl operations, you can tell statistically how many ROR operations there are. it's... information leakage, and the only way to stop it is to have more resources available than can be issued. in any combination. i am not hugely keen on doing that kind of analysis at this stage!
(In reply to Luke Kenneth Casson Leighton from comment #8) > (In reply to Michael Nolan from comment #7) > > > Rol ra, rb, rc getting translated to > > > > shl tmp1, rb, rc > > shr tmp2, rb, (32-rc) > > or ra, tmp1, tmp2 > > > > should be constant time with respect to the data being manipulated. > > so there are a couple ways to do the micro-coding, one of them being to > have "spare" inputs (extra Function Units) to the pipelines, which > the normal instruction issue engine does not have access to. > > the output from the shl (tmp1) goes into an *extra* FU operand 1 on an OR > pipe. > the output tmp2 goes into the operand2 of the OR pipeline's extra FU. > > finally the output from the OR goes into ra, in the *normal* result system. > > it's basically hard-coded micro-code, rather than "programmable" micro-code. sounds good! > > > > The > > actual timing probably depends on what else is in the pipe, but that's true > > of other instructions as well. > > please, nobody mention spectre or other timing-related attacks. sigh. I was referring to data-dependent timing, which is quite easy to get right in this case (don't do things like translating to a variable number of 1-bit rotations). With regards to spectre, we should still try to mitigate it in cases where it's relatively easy and doesn't have a major performance loss (not this specific case of scheduling bitwise-or instructions).