Bug 143 - REMAP CSR for Matrix Multiplies
Summary: REMAP CSR for Matrix Multiplies
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/openpower/sv/remap
Depends on:
Blocks: 213
  Show dependency treegraph
Reported: 2019-10-07 13:01 BST by Luke Kenneth Casson Leighton
Modified: 2022-07-17 12:16 BST (History)
1 user (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-10-07 13:01:01 BST
Prioritising the REMAP CSR which can reshape VL loops looks like it will be important for Matrix multiplies.  Swizzle looks to be too complex for too little end result for MMul.
Comment 1 Luke Kenneth Casson Leighton 2019-10-07 13:01:11 BST
Apologies I hadn't realised quite how important swizzling really is.


I have been looking at the PLX 3D paper and it contains an algorithm for 4x4 matrix times 4x1 vector.

That algorithm is:

fmul f2, f1.xxxx, f10
fmac f2, f1.yyyy, f11, f2
fmac f2, f1.zzzz, f12, f2
fmac f2, f1.wwww, f13, f2

VBLOCK swizzle table format can cope with this in a single block by setting a swizzler onto four registers that are *redirected* to f1, each with a different swizzle setting.

Macro op fusion would result in *doubling* the number of instructions.

Both are not ideal.

For this particular case however I am inclined to review the decision to put the REMAP CSR on the back burner.


These were intended for Matrices, however I forgot about them after thinking that Vector Mul was not as high a priority.

Swizzle looks to be extremely awkward and costly, making the REMAP CSRs attractive by comparison.

With the right REMAP, setting

* SHAPE1 to operate on a 4-element continuous loop and attached to f2
* SHAPE2 to wait 4 elements before incrementing by 1, and attaching to f1

the Matrix Multiply is LITERALLY reduced to 2 instructions, one of which is to clear out f2 to 4 zeros, the other is an FMAC with a VL of 16 (no SUBVLs).

VL could be set with an SVP-64 instruction, no need to set up a VBLOCK.

The alternative is to add REMAP to VBLOCK.