analyse how to use mapreduce mode to do dotproduct
reduce on FMA might do it because mapreduce is insane on 3-op instructions, they're normally excluded. therefore we might as well change the meaning of fma to *be* dotproduct.
alexandre says that fma might actually make sense to have 3-op, because of matrix multiply
jacob notes: fma reduction would be a polynomial reduction but it would be a Bad Idea (tm) to implement in hardware
fma can be used in reduce mode for dot-product but matrix multiply amounts to multiple parallel dot-products, so you woulnd't want non-reduce fma for that what I can't quite picture as useful (which definitely is no authoritative) is reduce on the multiply, rather than on the add.
fma reducing on the multiply can be used for the same case that one would use a reduce multiply followed by a scalar add, so it is likely useful too
reminder of what dotproduct is: result = 0 for i in range(x): result += a[i] * b[i] which makes sense for fma as long as RC starts out as zero.
(In reply to Alexandre Oliva from comment #4) > fma can be used in reduce mode for dot-product > > but matrix multiply amounts to multiple parallel dot-products, so you > woulnd't want non-reduce fma for that in my mind it would make sense simply to do an outer for-loop on a reduce-fma > what I can't quite picture as useful (which definitely is no authoritative) > is > reduce on the multiply, rather than on the add. can you write that out in pseudocode?
(In reply to Luke Kenneth Casson Leighton from comment #3) > jacob notes: fma reduction would be a polynomial reduction but it would be a > Bad Idea (tm) to implement in hardware *could* be a polynomial reduction: v = a v = x * v + b v = x * v + c v = x * v + d produces: v == d + x * c + x^2 * b + x^3 * a having fma reduction be a dot product is also valid, easier to implement in hardware, and more useful: v = a v = b * c + v v = d * e + v v = f * g + v v == a + dot(<b, d, f>, <c, e, g>)
(In reply to Jacob Lifshay from comment #8) > (In reply to Luke Kenneth Casson Leighton from comment #3) > > jacob notes: fma reduction would be a polynomial reduction but it would be a > > Bad Idea (tm) to implement in hardware > > *could* be a polynomial reduction: > v = a > v = x * v + b > v = x * v + c > v = x * v + d > > produces: > v == d + x * c + x^2 * b + x^3 * a once a b and c are factored out, yes. above is more (with substitution) d + (v * (c + (v * (b + (v * a))))) which i think may be doable with some overlapping fmas (no reduce required) the polynomial version: i love it. it's so cool that i think we should give it a shot. interestingly it may be possible to detect from the src/dest scalar/vector marking. this one is dest=v (needed in case of intermediaries) src1=s src2=s src3=v and also, note, RT == RB > > having fma reduction be a dot product is also valid, easier to implement in > hardware, well we are waay past the point where stuff is "easy" :) we are long into FSMs and micro-coding. > and more useful: > v = a > v = b * c + v > v = d * e + v > v = f * g + v > > v == a + dot(<b, d, f>, <c, e, g>) this one is dest=v (needed for intermediary results) src1=v src2=v src3=s and note, RT == RC
https://bugs.libre-soc.org/show_bug.cgi?id=817#c39 with the precedent being set for 3-in 2-out regs for FMA in integer, i see no reason why the trend should not continue in FP, is that going to work?
(In reply to Luke Kenneth Casson Leighton from comment #10) > https://bugs.libre-soc.org/show_bug.cgi?id=817#c39 > > with the precedent being set for 3-in 2-out regs for FMA in integer, i > see no reason why the trend should not continue in FP, is that going > to work? no, it would require 4-in 2-out: https://bugs.libre-soc.org/show_bug.cgi?id=817#c40
ok i think i worked it out: konstantinos suggested Matrix Determinant, which if added as a REMAP Schedule could produce the fmacs needed. the alternative is a REMAP Indexed Schedue, it just depends how common this is. https://libre-soc.org/irclog/%23libre-soc.2023-04-29.log.html#t2023-04-29T22:16:42 example: Indexed is just a couple of SHAPEs, for 2x2 it is two 2x8-bit packed indices, those can be loaded with ori. oh wait this is dotproduct :)