3 operand FMAC (multiply and accumulate) needed, FP16/32/64 with unit tests. https://github.com/gilani/fpfma/blob/fpfma_clean/fpfma.v
should be part of fpmul simd integer multiplier needs modification to support or we can just add after
(In reply to Jacob Lifshay from comment #1) > should be part of fpmul Yes. > simd integer multiplier needs modification to support or we can just add > after Hm hm, don't know. Might be simpler to do that as a first option, and optimise later under separate funding / bugs. I will see if something can be thrown together with the existing pipeline building blocks, without having to do full normalisation and rounding followed immediately by denormalisation. The post norm stages of both mul and add are designed to output more bits, to the rounding phase: it might be possible to just extend the add number to the same (nonstandard) mantissa width and go from there.
(In reply to Luke Kenneth Casson Leighton from comment #2) > (In reply to Jacob Lifshay from comment #1) > > should be part of fpmul > > Yes. > > > simd integer multiplier needs modification to support or we can just add > > after meant adding integer adder right after integer mul. needs to be: 53*3 bit wide adder for fp64 24*3 for fp32 needed to handle the case of a*b + c where a = 0.00xxxx, b = 0.00yyyy, c = 0.zzzz, unrounded result is 0.zzzzpppppppp where a * b = 0.0000pppppppp rounding and normalization needs to take all bits into account
(In reply to Jacob Lifshay from comment #3) > meant adding integer adder right after integer mul. > needs to be: > 53*3 bit wide adder for fp64 > 24*3 for fp32 > > needed to handle the case of a*b + c where a = 0.00xxxx, b = 0.00yyyy, c = > 0.zzzz, unrounded result is 0.zzzzpppppppp where a * b = 0.0000pppppppp zowee, 53*3 wide add. is the gate latency on that ok? I don't know. > rounding and normalization needs to take all bits into account I'm fairly certain that there are optimisations involving the rounding modes and depending on whether a*b is larger than c or not (or, a-exp plus b-exp greater than c-exp). I took a look at hardfloat-3 and it has something like that, although decoding what is being done is a different matter. No code comments. Will try a nonoptimal version, see what happens.
https://github.com/ucb-bar/berkeley-hardfloat/blob/master/src/main/scala/MulAddRecFN.scala
(In reply to Luke Kenneth Casson Leighton from comment #4) > (In reply to Jacob Lifshay from comment #3) > > > meant adding integer adder right after integer mul. > > needs to be: > > 53*3 bit wide adder for fp64 > > 24*3 for fp32 > > > > needed to handle the case of a*b + c where a = 0.00xxxx, b = 0.00yyyy, c = > > 0.zzzz, unrounded result is 0.zzzzpppppppp where a * b = 0.0000pppppppp > > zowee, 53*3 wide add. is the gate latency on that ok? I don't know. ignoring wire delay, n-bit addition can be done in O(log n) time and O(n * log n) space using carry look ahead. > > rounding and normalization needs to take all bits into account > > I'm fairly certain that there are optimisations involving the rounding modes > and depending on whether a*b is larger than c or not (or, a-exp plus b-exp > greater than c-exp). yeah, but the hw needs to handle the worst case, which is as above. don't forget that we need to handle a * b - c as well.
http://www.jhauser.us/arithmetic/HardFloat-1/doc/HardFloat-Verilog.html http://www.jhauser.us/arithmetic/HardFloat.html the source is a .zip archive, where mulAddRecFn.v can be found, and it's proven and correct. i can do a conversion to nmigen, keeping the logic intact, it's only 450 lines or so, and we'll not need to do "research". also i just spotted that there's a really clean "rounding" function which will be extremely useful.
# XXX check! {doSubMags ? ~sigC : sigC, # {(sigSumWidth - sigWidth + 2){doSubMags}}}; extComplSigC.eq(Cat((sigSumWidth - sigWidth + 2){doSubMags}}, Mux(doSubMags, ~sigC, sigC))), anyone know what this translates to?
(In reply to Luke Kenneth Casson Leighton from comment #8) > # XXX check! {doSubMags ? ~sigC : sigC, > # {(sigSumWidth - sigWidth + 2){doSubMags}}}; > extComplSigC.eq(Cat((sigSumWidth - sigWidth + 2){doSubMags}}, > Mux(doSubMags, ~sigC, sigC))), > > anyone know what this translates to? I think that's a repeat
(In reply to Jacob Lifshay from comment #9) > (In reply to Luke Kenneth Casson Leighton from comment #8) > > # XXX check! {doSubMags ? ~sigC : sigC, > > # {(sigSumWidth - sigWidth + 2){doSubMags}}}; > > extComplSigC.eq(Cat((sigSumWidth - sigWidth + 2){doSubMags}}, > > Mux(doSubMags, ~sigC, sigC))), > > > > anyone know what this translates to? > > I think that's a repeat yeh i looked it up, fortunately found a stackexchange resource that used double-brackets like this, so i did this: # XXX check! {doSubMags ? ~sigC : sigC, # {(sigSumWidth - sigWidth + 2){doSubMags}}}; sc = [doSubMags] * (sigSumWidth - sigWidth + 2) + \ [Mux(doSubMags, ~sigC, sigC)] extComplSigC.eq(Cat(*sc)) so it's 1 bit worth of doSubMags, times ssw-sw+2, with sigC tacked onto the end (MSB), inverted if doSubMags is true. possible values: 0b0111111 0b1111111 0b1000000 0b0000000 something like that
(In reply to Luke Kenneth Casson Leighton from comment #10) > yeh i looked it up, fortunately found a stackexchange resource that > used double-brackets like this, so i did this: > > # XXX check! {doSubMags ? ~sigC : sigC, > # {(sigSumWidth - sigWidth + 2){doSubMags}}}; > sc = [doSubMags] * (sigSumWidth - sigWidth + 2) + \ > [Mux(doSubMags, ~sigC, sigC)] > extComplSigC.eq(Cat(*sc)) > nmigen has a Repl operator that makes the result much cleaner. It translates directly to Verilog's repeat construct.
(In reply to Jacob Lifshay from comment #11) > nmigen has a Repl operator that makes the result much cleaner. It translates > directly to Verilog's repeat construct. rats! i keep forgetting about that one. there's several places now where i used [x] * w. thanks for the reminder.
See https://docs.rs/simple-soft-float/0.1.0/simple_soft_float/struct.Float.html#method.fused_mul_add for the docs for the fused-mul-add implementation.
*** This bug has been marked as a duplicate of bug 877 ***