555 – encode a dotproduct in a single instruction

Bug 555 - encode a dotproduct in a single instruction

Summary: encode a dotproduct in a single instruction

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Alexandre Oliva

URL:

Depends on:
Blocks:	213
	Show dependency tree / graph

Reported:	2020-12-23 19:22 GMT by Luke Kenneth Casson Leighton
Modified:	2023-04-30 13:14 BST (History)
CC List:	3 users (show)

See Also:	817
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-12-23 19:22:14 GMT

analyse how to use mapreduce mode to do dotproduct

Comment 1 Luke Kenneth Casson Leighton 2020-12-23 19:23:20 GMT

reduce on FMA might do it because mapreduce is insane on 3-op instructions, they're normally excluded.

therefore we might as well change the meaning of fma to *be* dotproduct.

Comment 2 Luke Kenneth Casson Leighton 2020-12-23 19:26:58 GMT

alexandre says that fma might actually make sense to have 3-op, because of matrix multiply

Comment 3 Luke Kenneth Casson Leighton 2020-12-23 19:27:58 GMT

jacob notes: fma reduction would be a polynomial reduction but it would be a Bad Idea (tm) to implement in hardware

Comment 4 Alexandre Oliva 2020-12-23 19:29:24 GMT

fma can be used in reduce mode for dot-product

but matrix multiply amounts to multiple parallel dot-products, so you woulnd't want non-reduce fma for that

what I can't quite picture as useful (which definitely is no authoritative) is
reduce on the multiply, rather than on the add.

Comment 5 Alexandre Oliva 2020-12-23 19:38:36 GMT

fma reducing on the multiply can be used for the same case that one would use a reduce multiply followed by a scalar add, so it is likely useful too

Comment 6 Luke Kenneth Casson Leighton 2020-12-23 19:58:58 GMT

reminder of what dotproduct is:

    result = 0
    for i in range(x):
        result += a[i] * b[i]

which makes sense for fma as long as RC starts out as zero.

Comment 7 Luke Kenneth Casson Leighton 2020-12-23 20:00:40 GMT

(In reply to Alexandre Oliva from comment #4)
> fma can be used in reduce mode for dot-product
> 
> but matrix multiply amounts to multiple parallel dot-products, so you
> woulnd't want non-reduce fma for that

in my mind it would make sense simply to do an outer for-loop on a reduce-fma

> what I can't quite picture as useful (which definitely is no authoritative)
> is
> reduce on the multiply, rather than on the add.

can you write that out in pseudocode?

Comment 8 Jacob Lifshay 2020-12-23 20:11:25 GMT

(In reply to Luke Kenneth Casson Leighton from comment #3)
> jacob notes: fma reduction would be a polynomial reduction but it would be a
> Bad Idea (tm) to implement in hardware

*could* be a polynomial reduction:
v = a
v = x * v + b
v = x * v + c
v = x * v + d

produces:
v == d + x * c + x^2 * b + x^3 * a


having fma reduction be a dot product is also valid, easier to implement in hardware, and more useful:
v = a
v = b * c + v
v = d * e + v
v = f * g + v

v == a + dot(<b, d, f>, <c, e, g>)

Comment 9 Luke Kenneth Casson Leighton 2020-12-23 22:25:46 GMT

(In reply to Jacob Lifshay from comment #8)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > jacob notes: fma reduction would be a polynomial reduction but it would be a
> > Bad Idea (tm) to implement in hardware
> 
> *could* be a polynomial reduction:
> v = a
> v = x * v + b
> v = x * v + c
> v = x * v + d
> 
> produces:
> v == d + x * c + x^2 * b + x^3 * a

once a b and c are factored out, yes.
above is more (with substitution)

   d + (v * (c + (v * (b + (v * a)))))

which i think may be doable with some overlapping fmas (no reduce required)



the polynomial version: i love it.  it's so cool that i think we should give it a shot.  interestingly it may be possible to detect from the src/dest scalar/vector marking.

this one is

dest=v (needed in case of intermediaries)
src1=s
src2=s
src3=v

and also, note, RT == RB
 
> 
> having fma reduction be a dot product is also valid, easier to implement in
> hardware, 

well we are waay past the point where stuff is "easy" :)  we are long into FSMs and micro-coding.

> and more useful:
> v = a
> v = b * c + v
> v = d * e + v
> v = f * g + v
> 
> v == a + dot(<b, d, f>, <c, e, g>)

this one is

dest=v (needed for intermediary results)
src1=v
src2=v
src3=s

and note, RT == RC

Comment 10 Luke Kenneth Casson Leighton 2022-05-13 20:30:04 BST

https://bugs.libre-soc.org/show_bug.cgi?id=817#c39

with the precedent being set for 3-in 2-out regs for FMA in integer, i
see no reason why the trend should not continue in FP, is that going
to work?

Comment 11 Jacob Lifshay 2022-05-13 21:03:56 BST

(In reply to Luke Kenneth Casson Leighton from comment #10)
> https://bugs.libre-soc.org/show_bug.cgi?id=817#c39
> 
> with the precedent being set for 3-in 2-out regs for FMA in integer, i
> see no reason why the trend should not continue in FP, is that going
> to work?

no, it would require 4-in 2-out:
https://bugs.libre-soc.org/show_bug.cgi?id=817#c40

Comment 12 Luke Kenneth Casson Leighton 2023-04-30 13:14:34 BST

ok i think i worked it out: konstantinos suggested Matrix Determinant,
which if added as a REMAP Schedule could produce the fmacs needed.
the alternative is a REMAP Indexed Schedue, it just depends how common
this is.

https://libre-soc.org/irclog/%23libre-soc.2023-04-29.log.html#t2023-04-29T22:16:42

example: Indexed is just a couple of SHAPEs, for 2x2 it is two 2x8-bit
packed indices, those can be loaded with ori.  oh wait this is
dotproduct :)