Bug 189 - Create partitioned right shift using the existing partitioned left shift
Summary: Create partitioned right shift using the existing partitioned left shift
Status: PAYMENTPENDING FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: PC Windows
: --- enhancement
Assignee: Michael Nolan
URL:
Depends on:
Blocks: 186
  Show dependency treegraph
 
Reported: 2020-02-24 23:48 GMT by Michael Nolan
Modified: 2020-12-02 16:05 GMT (History)
4 users (show)

See Also:
NLnet milestone: NLnet.2019.02
total budget (EUR) for completion of task and all subtasks: 150
budget (EUR) for this task, excluding subtasks' budget: 150
parent task for budget allocation: 48
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments
screenshot of yosys (54.80 KB, image/jpeg)
2020-02-26 18:04 GMT, Luke Kenneth Casson Leighton
Details
screenshot of yosys (2) (29.86 KB, image/jpeg)
2020-02-27 17:35 GMT, Luke Kenneth Casson Leighton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Nolan 2020-02-24 23:48:07 GMT
See discussion in http://bugs.libre-riscv.org/show_bug.cgi?id=186#c10
Comment 1 Michael Nolan 2020-02-26 16:59:00 GMT
This is complete in e42917b.

The dynamic shift right turned out to be a little more complicated than simply bit reversing the input and output, as the shift amounts had to be reordered as well.
Comment 2 Luke Kenneth Casson Leighton 2020-02-26 18:04:55 GMT
Created attachment 26 [details]
screenshot of yosys

(In reply to Michael Nolan from comment #1)
> This is complete in e42917b.
> 
> The dynamic shift right turned out to be a little more complicated than
> simply bit reversing the input and output, as the shift amounts had to be
> reordered as well.

oh that's interesting.  ohh yes, for example, the order of the list of
shift amounts has to be reversed... yeah tricky :)

there's a couple of FIXMEs i left for you.  one is a reset_less=True
that was left out: the other is a chain of expressions using Mux()
rather than assigning to a Signal that then goes into the Mux() on
the next loop.

el[0] = Signal()
el[1] = Mux(Signal(), el[0])
el[2] = Mux(Signal(), Mux(el[1], el[0]))
el[3] = Mux(Signal(), Mux(el[2], Mux(el[1], el[2])))

which wasn't what you wanted, but was what you got :)

you wanted:

el[0] = Signal()
el[1] = Mux(Signal(), el[0])
el[2] = Mux(Signal(), el[1])
el[3] = Mux(Signal(), el[2])
...
...

you can see from the yosys screenshot where the mux + mux-mux + mux-mux-mux +
mux-mux-mux-mux + .... ends up

i fixed a number of these a couple weeks back, which deprived you of the opportunity to learn the importance of this one (apologies).  so this
time i've just put comments in.

i leave you to sort that out?

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_shift/part_shift_dynamic.py;h=5e2c4406a175f8828a4490458be17c862c22041d;hb=8694385463d9ebf0cf476e89b997a88f5dd9c91f#l208
Comment 3 Luke Kenneth Casson Leighton 2020-02-27 17:35:56 GMT
Created attachment 27 [details]
screenshot of yosys (2)

ah, nice.  see how the cascade is no longer an O(N^2) thing?
just a simple O(N) one instead.

if you always do a "view" on the ilang file, even during
partial development, and make sure to name things so that
it's possible to track from the python code to the graphviz,
then "oink what the heck is that mess" becomes an immediate
red flag.

next one, ROR? :)  or shall we leave that one to be a
micro-op, put data twice through the pipelines as a
FSM?  hmm, back to the other bugreport i think
Comment 4 Michael Nolan 2020-02-27 21:45:19 GMT
(In reply to Luke Kenneth Casson Leighton from comment #3)
> Created attachment 27 [details]
> screenshot of yosys (2)
> 
> ah, nice.  see how the cascade is no longer an O(N^2) thing?
> just a simple O(N) one instead.

Ah yes I see now. However it looks like yosys is able to optimize the first version into the second one, they both end up as the same number of gates.

> 
> next one, ROR? :)  or shall we leave that one to be a
> micro-op, put data twice through the pipelines as a
> FSM?  hmm, back to the other bugreport i think

ROR shouldn't be as common as SHL and SHR in normal code right? I think doing it in microcode would be fine, at least for the first iteration.
Comment 5 Luke Kenneth Casson Leighton 2020-02-27 21:47:59 GMT
(In reply to Michael Nolan from comment #4)

> > ah, nice.  see how the cascade is no longer an O(N^2) thing?
> > just a simple O(N) one instead.
> 
> Ah yes I see now. However it looks like yosys is able to optimize the first
> version into the second one, they both end up as the same number of gates.

yyeah, not completely happy relying on that happening.

> > next one, ROR? :)  or shall we leave that one to be a
> > micro-op, put data twice through the pipelines as a
> > FSM?  hmm, back to the other bugreport i think
> 
> ROR shouldn't be as common as SHL and SHR in normal code right? 

it's more for crypto and other stuff.  however there's that POWER ISA
opcode

> I think
> doing it in microcode would be fine, at least for the first iteration.

see how it goes.
Comment 6 Jacob Lifshay 2020-02-27 21:50:08 GMT
(In reply to Michael Nolan from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > next one, ROR? :)  or shall we leave that one to be a
> > micro-op, put data twice through the pipelines as a
> > FSM?  hmm, back to the other bugreport i think
> 
> ROR shouldn't be as common as SHL and SHR in normal code right? I think
> doing it in microcode would be fine, at least for the first iteration.

If we do it as microcode, we should try to have it still be a constant-time operation, since it's commonly assumed to be constant-time by crypto code.
Comment 7 Michael Nolan 2020-02-27 22:10:09 GMT
(In reply to Jacob Lifshay from comment #6)

> If we do it as microcode, we should try to have it still be a constant-time
> operation, since it's commonly assumed to be constant-time by crypto code.

Rol ra, rb, rc getting translated to

shl tmp1, rb, rc
shr tmp2, rb, (32-rc)
or  ra, tmp1, tmp2

should be constant time with respect to the data being manipulated. The actual timing probably depends on what else is in the pipe, but that's true of other instructions as well.
Comment 8 Luke Kenneth Casson Leighton 2020-02-27 22:27:22 GMT
(In reply to Michael Nolan from comment #7)

> Rol ra, rb, rc getting translated to
> 
> shl tmp1, rb, rc
> shr tmp2, rb, (32-rc)
> or  ra, tmp1, tmp2
> 
> should be constant time with respect to the data being manipulated. 

so there are a couple ways to do the micro-coding, one of them being to
have "spare" inputs (extra Function Units) to the pipelines, which
the normal instruction issue engine does not have access to.

the output from the shl (tmp1) goes into an *extra* FU operand 1 on an OR pipe.
the output tmp2 goes into the operand2 of the OR pipeline's extra FU.

finally the output from the OR goes into ra, in the *normal* result system.

it's basically hard-coded micro-code, rather than "programmable" micro-code.


> The
> actual timing probably depends on what else is in the pipe, but that's true
> of other instructions as well.

please, nobody mention spectre or other timing-related attacks.  sigh.

basically in the above example, if you do enough OR operations, shr or
shl operations, you can tell statistically how many ROR operations there are.

it's... information leakage, and the only way to stop it is to have more
resources available than can be issued.

in any combination.

i am not hugely keen on doing that kind of analysis at this stage!
Comment 9 Jacob Lifshay 2020-02-27 23:03:14 GMT
(In reply to Luke Kenneth Casson Leighton from comment #8)
> (In reply to Michael Nolan from comment #7)
> 
> > Rol ra, rb, rc getting translated to
> > 
> > shl tmp1, rb, rc
> > shr tmp2, rb, (32-rc)
> > or  ra, tmp1, tmp2
> > 
> > should be constant time with respect to the data being manipulated. 
> 
> so there are a couple ways to do the micro-coding, one of them being to
> have "spare" inputs (extra Function Units) to the pipelines, which
> the normal instruction issue engine does not have access to.
> 
> the output from the shl (tmp1) goes into an *extra* FU operand 1 on an OR
> pipe.
> the output tmp2 goes into the operand2 of the OR pipeline's extra FU.
> 
> finally the output from the OR goes into ra, in the *normal* result system.
> 
> it's basically hard-coded micro-code, rather than "programmable" micro-code.

sounds good!

> 
> 
> > The
> > actual timing probably depends on what else is in the pipe, but that's true
> > of other instructions as well.
> 
> please, nobody mention spectre or other timing-related attacks.  sigh.

I was referring to data-dependent timing, which is quite easy to get right in this case (don't do things like translating to a variable number of 1-bit rotations).

With regards to spectre, we should still try to mitigate it in cases where it's relatively easy and doesn't have a major performance loss (not this specific case of scheduling bitwise-or instructions).