Bug 555 - encode a dotproduct in a single instruction
Summary: encode a dotproduct in a single instruction
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Alexandre Oliva
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2020-12-23 19:22 GMT by Luke Kenneth Casson Leighton
Modified: 2023-04-30 13:14 BST (History)
3 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-12-23 19:22:14 GMT
analyse how to use mapreduce mode to do dotproduct
Comment 1 Luke Kenneth Casson Leighton 2020-12-23 19:23:20 GMT
reduce on FMA might do it because mapreduce is insane on 3-op instructions, they're normally excluded.

therefore we might as well change the meaning of fma to *be* dotproduct.
Comment 2 Luke Kenneth Casson Leighton 2020-12-23 19:26:58 GMT
alexandre says that fma might actually make sense to have 3-op, because of matrix multiply
Comment 3 Luke Kenneth Casson Leighton 2020-12-23 19:27:58 GMT
jacob notes: fma reduction would be a polynomial reduction but it would be a Bad Idea (tm) to implement in hardware
Comment 4 Alexandre Oliva 2020-12-23 19:29:24 GMT
fma can be used in reduce mode for dot-product

but matrix multiply amounts to multiple parallel dot-products, so you woulnd't want non-reduce fma for that

what I can't quite picture as useful (which definitely is no authoritative) is
reduce on the multiply, rather than on the add.
Comment 5 Alexandre Oliva 2020-12-23 19:38:36 GMT
fma reducing on the multiply can be used for the same case that one would use a reduce multiply followed by a scalar add, so it is likely useful too
Comment 6 Luke Kenneth Casson Leighton 2020-12-23 19:58:58 GMT
reminder of what dotproduct is:

    result = 0
    for i in range(x):
        result += a[i] * b[i]

which makes sense for fma as long as RC starts out as zero.
Comment 7 Luke Kenneth Casson Leighton 2020-12-23 20:00:40 GMT
(In reply to Alexandre Oliva from comment #4)
> fma can be used in reduce mode for dot-product
> 
> but matrix multiply amounts to multiple parallel dot-products, so you
> woulnd't want non-reduce fma for that

in my mind it would make sense simply to do an outer for-loop on a reduce-fma

> what I can't quite picture as useful (which definitely is no authoritative)
> is
> reduce on the multiply, rather than on the add.

can you write that out in pseudocode?
Comment 8 Jacob Lifshay 2020-12-23 20:11:25 GMT
(In reply to Luke Kenneth Casson Leighton from comment #3)
> jacob notes: fma reduction would be a polynomial reduction but it would be a
> Bad Idea (tm) to implement in hardware

*could* be a polynomial reduction:
v = a
v = x * v + b
v = x * v + c
v = x * v + d

produces:
v == d + x * c + x^2 * b + x^3 * a


having fma reduction be a dot product is also valid, easier to implement in hardware, and more useful:
v = a
v = b * c + v
v = d * e + v
v = f * g + v

v == a + dot(<b, d, f>, <c, e, g>)
Comment 9 Luke Kenneth Casson Leighton 2020-12-23 22:25:46 GMT
(In reply to Jacob Lifshay from comment #8)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > jacob notes: fma reduction would be a polynomial reduction but it would be a
> > Bad Idea (tm) to implement in hardware
> 
> *could* be a polynomial reduction:
> v = a
> v = x * v + b
> v = x * v + c
> v = x * v + d
> 
> produces:
> v == d + x * c + x^2 * b + x^3 * a

once a b and c are factored out, yes.
above is more (with substitution)

   d + (v * (c + (v * (b + (v * a)))))

which i think may be doable with some overlapping fmas (no reduce required)



the polynomial version: i love it.  it's so cool that i think we should give it a shot.  interestingly it may be possible to detect from the src/dest scalar/vector marking.

this one is

dest=v (needed in case of intermediaries)
src1=s
src2=s
src3=v

and also, note, RT == RB
 
> 
> having fma reduction be a dot product is also valid, easier to implement in
> hardware, 

well we are waay past the point where stuff is "easy" :)  we are long into FSMs and micro-coding.

> and more useful:
> v = a
> v = b * c + v
> v = d * e + v
> v = f * g + v
> 
> v == a + dot(<b, d, f>, <c, e, g>)

this one is

dest=v (needed for intermediary results)
src1=v
src2=v
src3=s

and note, RT == RC
Comment 10 Luke Kenneth Casson Leighton 2022-05-13 20:30:04 BST
https://bugs.libre-soc.org/show_bug.cgi?id=817#c39

with the precedent being set for 3-in 2-out regs for FMA in integer, i
see no reason why the trend should not continue in FP, is that going
to work?
Comment 11 Jacob Lifshay 2022-05-13 21:03:56 BST
(In reply to Luke Kenneth Casson Leighton from comment #10)
> https://bugs.libre-soc.org/show_bug.cgi?id=817#c39
> 
> with the precedent being set for 3-in 2-out regs for FMA in integer, i
> see no reason why the trend should not continue in FP, is that going
> to work?

no, it would require 4-in 2-out:
https://bugs.libre-soc.org/show_bug.cgi?id=817#c40
Comment 12 Luke Kenneth Casson Leighton 2023-04-30 13:14:34 BST
ok i think i worked it out: konstantinos suggested Matrix Determinant,
which if added as a REMAP Schedule could produce the fmacs needed.
the alternative is a REMAP Indexed Schedue, it just depends how common
this is.

https://libre-soc.org/irclog/%23libre-soc.2023-04-29.log.html#t2023-04-29T22:16:42

example: Indexed is just a couple of SHAPEs, for 2x2 it is two 2x8-bit
packed indices, those can be loaded with ori.  oh wait this is
dotproduct :)