Bug 120 - implement RISC-V FSGNJ instruction
Summary: implement RISC-V FSGNJ instruction
Status: PAYMENTPENDING FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Michael Nolan
URL:
Depends on:
Blocks: 48
  Show dependency treegraph
 
Reported: 2019-07-28 20:35 BST by Luke Kenneth Casson Leighton
Modified: 2020-02-09 20:48 GMT (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2019.02
total budget (EUR) for completion of task and all subtasks: 100
budget (EUR) for this task, excluding subtasks' budget: 100
parent task for budget allocation: 48
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments
screenshot of fsgnj (20.51 KB, image/jpeg)
2020-01-27 13:42 GMT, Luke Kenneth Casson Leighton
Details
yosys screenshot (27.72 KB, image/jpeg)
2020-01-27 15:55 GMT, Luke Kenneth Casson Leighton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-07-28 20:35:35 BST
FSGNJ is needed, 16/32/64-bit, all three versions: MV, ABS, NEG.
Comment 1 Jacob Lifshay 2019-07-28 23:40:21 BST
(In reply to Luke Kenneth Casson Leighton from comment #0)
> FSGNJ is needed, 16/32/64-bit, all three versions: MV, ABS, NEG.

note that fsgnj* can do more than just mv, abs, and neg -- we need to make sure that we implement more than just those pseudo-instructions

the implementation seems utterly trivial, just wires a not gate and an xor gate
Comment 2 Michael Nolan 2020-01-26 18:22:07 GMT
I'd like to take a stab at this, however I'm not sure where to put it. Should it be made a separate module in the ieee754 directory or should it go somewhere else (like part of an existing module?)

Also, I've been looking through the FPU modules and I've noticed there are two "styles" of modules:

 - Pipelined - like fpmul/pipeline.py
 - State machine - like fpmul/fmul.py

Since this module is only a couple of gates and can be combinatorial, I'd imagine it should probably have a similar interface to the state machine variety, but I'm not sure.
Comment 3 Jacob Lifshay 2020-01-26 18:40:26 GMT
(In reply to Michael Nolan from comment #2)
> Also, I've been looking through the FPU modules and I've noticed there are
> two "styles" of modules:
> 
>  - Pipelined - like fpmul/pipeline.py
>  - State machine - like fpmul/fmul.py
> 
> Since this module is only a couple of gates and can be combinatorial, I'd
> imagine it should probably have a similar interface to the state machine
> variety, but I'm not sure.

From what I recall, the state-machine modules were added to serve as a reference for conversion to pipelined form. The RISC-V sign-manipulation instructions are simple enough that they can be a single-stage pipeline.
Comment 4 Luke Kenneth Casson Leighton 2020-01-26 18:57:17 GMT
(In reply to Michael Nolan from comment #2)
> I'd like to take a stab at this, 

cool!  this is pretty easy, not rocket science.

> however I'm not sure where to put it.

start cookie-cut from e.g. the fclass one or the fcvt int to float one.

> Should it be made a separate module in the ieee754 directory 

yes.

> or should it go
> somewhere else (like part of an existing module?)

no, see hdl_workflow
 
> Also, I've been looking through the FPU modules and I've noticed there are
> two "styles" of modules:
> 
>  - Pipelined - like fpmul/pipeline.py
>  - State machine - like fpmul/fmul.py

yes. right. i was experimenting. a lot. the idea was, FSM or pipeline, it's not your decision, you just create combinatorial building blocks and the user decides how to join them together.

it worked, but got complicated and i left it for another day. leave FSM for now.

 
> Since this module is only a couple of gates and can be combinatorial, I'd
> imagine it should probably have a similar interface to the state machine
> variety, but I'm not sure.

you'll see on inspecting the (simpler) fcvts that it's mostly setup blurb.

gimme sec let me find some links on git.libre-riscv.org and illustrate
Comment 5 Luke Kenneth Casson Leighton 2020-01-26 18:59:24 GMT
(In reply to Jacob Lifshay from comment #3)

> From what I recall, the state-machine modules were added to serve as a
> reference for conversion to pipelined form. 

pretty much. i used them as stepping stones in a looong incremental conversion process, making sure there was always something to test at all times.  it was a pig :)
 
> The RISC-V sign-manipulation
> instructions are simple enough that they can be a single-stage pipeline.

yes.
Comment 6 Luke Kenneth Casson Leighton 2020-01-26 19:30:36 GMT
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fcvt/int2float.py;h=fcf3e34cf3025b02995b76251d363834640b7deb;hb=HEAD

right.  copy that, aaall of it, call it fcvt.py or something in a new subdirectory ieee754/fcvt, rename as appropriate (global search replace).  replace lines 32 to 119.

lines 18 and 19 set up a 2 bit op comment, 0x3 == 0b00 / 0b01 / 0b10 for mv, abs, neg.

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fcvt/pipeline.py;h=5e6cc19dee7f98c971d1ea7e2266339b4bf4d2ec;hb=HEAD

copy this except delete lines 117 onwards and search/replace FPCVTF2IntMuxInOut with  the new fcvt class name.

i am goung to have to think how the classes stack together (it's been a while) leave that with me

an __init__.py is necessary.

the test dir gets copied too, cooy just this file

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fcvt/test/test_fcvt_f2int_pipe.py;h=4e01d8c2bc6915a6465162eae48decdeb710645f;hb=HEAD

now you should notice there, opwidth has to be set to 2 because, duh, you have 3 ops, 0b00 0b01 and 0b10

and to test them each you have to set opcode=....

the callback function in the unit test obviously has to change.

if you do this:
$ python3
>>> from sfpy import Float32
>>> x = Float32(1.0)
>>> dir(x)

you will get a list of functions, you're looking for bits (for mv), a sgn function and an abs function.

those give you the function to call in the test converters see lines 10 to 68 all of which you replace with fsgn_mv_f32_to_f32 etc etc etc etc.
Comment 7 Michael Nolan 2020-01-26 22:20:45 GMT
Thanks for the help! I have something working more or less, but I'd like to formally verify it too.
Comment 8 Luke Kenneth Casson Leighton 2020-01-26 22:34:08 GMT
(In reply to Michael Nolan from comment #7)
> Thanks for the help! I have something working more or less,

cool that was quick. mind you i only allocated a small budget to it precisely because it was likely to be.

i'd say commit and push it however i need your agreement (on-list) to the charter, first.

if you can do that first, then i can add your ssh key, then push, then review, then approve, and then we can set you up an RfP.

apologies it's quite a few hoops, i am delighted that someone else is happy to work on the FPU.

> but I'd like to
> formally verify it too.

that's a little trickier as it's outside of everyone currently in the teams' experience. and also a separate bugreport. and a different Grant etc etc
Comment 9 Michael Nolan 2020-01-26 23:40:02 GMT
> i need your agreement (on-list) to the charter, first.

I (just now) replied to the HDL Workflow thread with my agreement.

> that's a little trickier as it's outside of everyone currently in the teams' experience.

I should probably have mentioned this in my intro post, but I have some experience with formal verification (the yosys flavor anyway). Provided the pipeline control signals don't screw things up too much, it should be pretty straightforward.
Comment 10 Luke Kenneth Casson Leighton 2020-01-27 00:09:38 GMT
(In reply to Michael Nolan from comment #9)
> > i need your agreement (on-list) to the charter, first.
> 
> I (just now) replied to the HDL Workflow thread with my agreement.

appreciated.

> > that's a little trickier as it's outside of everyone currently in the teams' experience.
> 
> I should probably have mentioned this in my intro post, but I have some
> experience with formal verification (the yosys flavor anyway). Provided the
> pipeline control signals don't screw things up too much, it should be pretty
> straightforward.

niiice. that'll come in real handy.

you'll be interested to know that one of the pipelines uses nmigen FIFO, and that that does actually have a formal proof in nmigen as a unit test.  it's worth taking a look at.

there is a lot to do here which is why formal proofs has its own entire budget.

let's keep this particular bugreport on topic though.
Comment 11 Michael Nolan 2020-01-27 00:32:34 GMT
> let's keep this particular bugreport on topic though.

Alrighty
Comment 12 Luke Kenneth Casson Leighton 2020-01-27 11:06:50 GMT
https://git.libre-riscv.org/?p=ieee754fpu.git;a=commitdiff;h=2352155642c82227e24e8782c6de61601938aa56

that's funny, i watched the progression of the commits, i wondered how long it would take to work out you could index by -1 to get at the sign bit for arbitrary float length :)

there are other classes in FPBase i think which calculate the mantissa and exponent offset and length etc. so that you don't have to hardcode those values.

also if we want to FP8, FP80 or FP128 in the future no code has to change except a one-line addition.

so, the last bit here: to check in the RISCV spec as to whether normalisation needs to occur, certainly for F.MV i know it doesn't however for other ops i am not sure.

one way to check is to track down the equivalent code for rocket-chip.
Comment 13 Luke Kenneth Casson Leighton 2020-01-27 11:20:10 GMT
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fsgnj/fsgnj.py;h=c43e180c0c55d6e12ee081a5b6e236aad6a1e554;hb=2352155642c82227e24e8782c6de61601938aa56

nice. have you done this before? :)

some comments on lines 44 46 48 for each of the opcodes would be nice.

and one around 42, "first do sign" or something then line 52, "then copy mantissa/exp unmodified"

oh, also, line 41, Signals are all resetless.  the reason is because they are combinatorial blocks, where the latches (pipelines) are reset.

actually because of the way the 6600 works they don't need it.  the 6600 DMs isolate garbage during init.
Comment 14 Luke Kenneth Casson Leighton 2020-01-27 11:39:37 GMT
https://github.com/chipsalliance/rocket-chip/blob/master/src/main/scala/tile/FPU.scala#L514

class FPtoFP.  what's going on there with widening-conversions, i have no idea, given that the spec specifically says that NaN payloads are preserved

relevant part of spec:

Floating-point to floating-point sign-injection instructions, FSGNJ.S, FSGNJN.S, and FSGNJX.S,
produce a result that takes all bits except the sign bit from rs1. For FSGNJ, the result'ss sign bit is
rs2'ss sign bit; for FSGNJN, the result'ss sign bit is the opposite of rs2'ss sign bit; and for FSGNJX,
the sign bit is the XOR of the sign bits of rs1 and rs2. Sign-injection instructions do not set
floating-point exception flags, nor do they canonicalize NaNs. Note, FSGNJ.S rx, ry, ry moves ry
to rx (assembler pseudoinstruction FMV.S rx, ry); FSGNJN.S rx, ry, ry moves the negation of ry
to rx (assembler pseudoinstruction FNEG.S rx, ry); and FSGNJX.S rx, ry, ry moves the absolute
value of ry to rx (assembler pseudoinstruction FABS.S rx, ry).
Comment 15 Luke Kenneth Casson Leighton 2020-01-27 12:29:08 GMT
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fsgnj/pipeline.py;h=63f3a8a089072db75affe5ccf5fea4faf90c42bc;hb=HEAD

cool!  you worked out how to remove the denormalisation phase already!
Comment 16 Luke Kenneth Casson Leighton 2020-01-27 13:13:56 GMT
*click*, apologies michael: i just realised something.  although Switch/Case
statements ultimately map down to Mux() or in rare cases PMux, for the FPU code
we can't use them.... or If() statements (basically anything with a "with" 
context).

the reason is down to the deployment of the PartitionedSignal class, which
will perform operations in parallel.  to support m.Case(), m.If() etc. we
would need to do a seeerious amount of coding, possibly even modifying
nmigen itself.

by keeping to Mux as a decision-maker plus basic arithmetic operators we
can write a parallel-PartitionSignal-aware-Mux() plus parallel-aware arith
and the job of converting all code to dynamic SIMD is a by-the-numbers
process pretty much replacing Signal with PartitionedSignal across all FP
classes.

apologies i forgot to mention that.
Comment 17 Michael Nolan 2020-01-27 13:38:08 GMT
> the reason is down to the deployment of the PartitionedSignal class, which
> will perform operations in parallel.  to support m.Case(), m.If() etc. we
> would need to do a seeerious amount of coding, possibly even modifying
> nmigen itself.

So what you're saying is: If(), Switch() and friends are fine for modules that are strictly scalar, but will not work if the module is converted to SIMD.

> apologies i forgot to mention that.

All good
Comment 18 Luke Kenneth Casson Leighton 2020-01-27 13:42:32 GMT
Created attachment 19 [details]
screenshot of fsgnj
Comment 19 Luke Kenneth Casson Leighton 2020-01-27 13:49:38 GMT
(In reply to Michael Nolan from comment #17)
> > the reason is down to the deployment of the PartitionedSignal class, which
> > will perform operations in parallel.  to support m.Case(), m.If() etc. we
> > would need to do a seeerious amount of coding, possibly even modifying
> > nmigen itself.
> 
> So what you're saying is: If(), Switch() and friends are fine for modules
> that are strictly scalar, but will not work if the module is converted to
> SIMD.

correct.  the partition mask will at some point be passed in through a
Mix-In class such that there should be very little in the way of changes.

> > apologies i forgot to mention that.
> 
> All good

btw if you want to see the actual gate-level graph, which i highly
recommend getting into the habit of doing, install graphviz and xdot
and use this:

    yosys
    > read_ilang test_fsgnj_something_something.il
    > show top

or type "show <tab> <tab>" to get a list of modules.  the one you want
is fsgnj$4

you can see it here:
http://bugs.libre-riscv.org/attachment.cgi?id=19

what you've written basically falls very neatly into the "acceptably
small size range such that it's easy to see what's going on and also
will be dead-easy to do the layout for using alliance/coriolis2"

the block "proc group_0" is where the switch/case statement is, which
needs replacing with some Mux()es.  those will basically be:

   Mux(op[1],
       Mux(op[0], 0b11 calculation, 0b10 calculation),
       Mux(op[0], 0b01 calculation, 0b00 calculation)
      )

something like that, where 0b11 will be "default" err i think.
Comment 20 Michael Nolan 2020-01-27 15:36:18 GMT
> correct.  the partition mask will at some point be passed in through a
> Mix-In class such that there should be very little in the way of changes.

That's pretty neat!

> the block "proc group_0" is where the switch/case statement is, which
> needs replacing with some Mux()es.  those will basically be:

>   Mux(op[1],
>       Mux(op[0], 0b11 calculation, 0b10 calculation),
>       Mux(op[0], 0b01 calculation, 0b00 calculation)
>      )

I added something like this on my latest commit. However I defined the behavior of opcode 0b11 to be the same as 0b10 (Saving 1 mux in the process). Since the RISCV spec doesn't define the behavior of opcode 0b11, this seems safe to do (right? Rocket does it)
Comment 21 Luke Kenneth Casson Leighton 2020-01-27 15:55:24 GMT
Created attachment 20 [details]
yosys screenshot
Comment 22 Luke Kenneth Casson Leighton 2020-01-27 16:08:01 GMT
(In reply to Michael Nolan from comment #20)
> > correct.  the partition mask will at some point be passed in through a
> > Mix-In class such that there should be very little in the way of changes.
> 
> That's pretty neat!

yeah :) testing's going to be a pig.  far more cases... *sigh* :)

> > the block "proc group_0" is where the switch/case statement is, which
> > needs replacing with some Mux()es.  those will basically be:
> 
> >   Mux(op[1],
> >       Mux(op[0], 0b11 calculation, 0b10 calculation),
> >       Mux(op[0], 0b01 calculation, 0b00 calculation)
> >      )
> 
> I added something like this on my latest commit. However I defined the
> behavior of opcode 0b11 to be the same as 0b10 (Saving 1 mux in the
> process). Since the RISCV spec doesn't define the behavior of opcode 0b11,
> this seems safe to do (right? Rocket does it)

yehyeh makes perfect sense to me.

okaaay, so this gets interesting.  take a look at the second
screenshot (after doing a git pull and re-run the unit tests
you can generate it yourself if you prefer).

http://bugs.libre-riscv.org/attachment.cgi?id=20

can you see that there's some spurious inexplicable maths
going on, there, compared to the first graphviz?

http://bugs.libre-riscv.org/attachment.cgi?id=19

that extra "stuff" is because inside the FPNumDecode, for
convenience i got it to subtract the "offset" that's part
of an exponent, and the "create()" function adds it back on.

so if you do yosys "show sc_decode_a" you'll find a whole stack
of extra stuff, none of which is actually necessary.

we know it's not necessary because the previous version - without
FPNumDecode - worked perfectly well and pass all unit tests.

this is why i emphasise in https://libre-riscv.org/HDL_workflow/
that it's essential to run the yosys "show" command as part of the
workflow.  you get to see the actual graph - the actual gates
created - and go "oink?? what's that doing in there??"

:)


second (minor) thing, you'd created a Signal "sign" and then had
immediately overwritten it with a python variable *also* named
"sign" which was the result of the Mux().

this seems to be a common source of confusion in nmigen, i've seen
this done before: a python variable is created and then overwritten.
remember that nmigen creates ASTs (abstract syntax trees) and so
anything added to m.d.comb (or m.d.sync) is completely unused.


third thing: again, which i spotted only after looking at the
graphviz: Mux(opcode[0], ~b.s, b.s) was extremely puzzling
in the graph however is quite logical in the code.

i thought about it and realised that the Mux could be replaced with
an XOR gate.  sign = opcode[0] ^ ~b.s

however that's really not very obvious so i provided a (terse)
explanatory comment.


i'd suggest getting rid of FPNumDecode, and returning to b[-1] etc.
can i leave that with you?
Comment 23 Michael Nolan 2020-01-27 17:02:23 GMT
> okaaay, so this gets interesting.  take a look at the second
> screenshot (after doing a git pull and re-run the unit tests
> you can generate it yourself if you prefer).
>
> http://bugs.libre-riscv.org/attachment.cgi?id=20
>
> can you see that there's some spurious inexplicable maths
> going on, there, compared to the first graphviz?
>
> http://bugs.libre-riscv.org/attachment.cgi?id=19
>
> that extra "stuff" is because inside the FPNumDecode, for
> convenience i got it to subtract the "offset" that's part
> of an exponent, and the "create()" function adds it back on.
>
> so if you do yosys "show sc_decode_a" you'll find a whole stack
> of extra stuff, none of which is actually necessary.
>
> we know it's not necessary because the previous version - without
> FPNumDecode - worked perfectly well and pass all unit tests.
>
> this is why i emphasise in https://libre-riscv.org/HDL_workflow/
> that it's essential to run the yosys "show" command as part of the
> workflow.  you get to see the actual graph - the actual gates
> created - and go "oink?? what's that doing in there??"

I'll admit, I did run the yosys show, but was only looking at the Mux
part, and didn't notice that extra add.

I'm a little surprised that part doesn't get optimized away (I played
around with flattening fsignj and running yosys' optimizations on
it). I
wonder if it's because FPNumDecode adds some extra bits to the
exponant
when it decodes it.


>
> second (minor) thing, you'd created a Signal "sign" and then had
> immediately overwritten it with a python variable *also* named
> "sign" which was the result of the Mux().
>
> this seems to be a common source of confusion in nmigen, i've seen
> this done before: a python variable is created and then overwritten.
> remember that nmigen creates ASTs (abstract syntax trees) and so
> anything added to m.d.comb (or m.d.sync) is completely unused.

This was why I was looking at the graph to see if the mux was actually
there. Though it and
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpmul/specialcases.py;h=066d2b9f0fc15dd153cb1c7b262901636b7fe344;hb=HEAD
(line 88) seem to work ok, is that because the result variable is
added to the nmigen ast in the "m.d.comb +="


> third thing: again, which i spotted only after looking at the
> graphviz: Mux(opcode[0], ~b.s, b.s) was extremely puzzling
> in the graph however is quite logical in the code.
>
> i thought about it and realised that the Mux could be replaced with
> an XOR gate.  sign = opcode[0] ^ ~b.s
>
> however that's really not very obvious so i provided a (terse)
> explanatory comment.

Note, this should probably read (and does in fsgnj.py) as `sign =
opcode[0] ^ b.s`

This doesn't seem any more obvious in the graph to me than the Mux and
invert so idk.

> i'd suggest getting rid of FPNumDecode, and returning to b[-1] etc.
> can i leave that with you?

Sure 


Edit: Whoops, looks like I replied to the mailing list post, not the bugzilla one
Comment 24 Michael Nolan 2020-01-27 17:08:54 GMT
> second (minor) thing, you'd created a Signal "sign" and then had
> immediately overwritten it with a python variable *also* named
> "sign" which was the result of the Mux().

Oh, I see what you mean now.

    sign = Signal()        # 1
    sign = Mux(op[0],...)  # 2
    sign = Mux(op[1],...)  # 3

I thought you were referring to lines 2 and 3 overwriting sign, not line 1 being overwritten.