Prioritising the REMAP CSR which can reshape VL loops looks like it will be important for Matrix multiplies. Swizzle looks to be too complex for too little end result for MMul.
Apologies I hadn't realised quite how important swizzling really is.
I have been looking at the PLX 3D paper and it contains an algorithm for 4x4 matrix times 4x1 vector.
That algorithm is:
fmul f2, f1.xxxx, f10
fmac f2, f1.yyyy, f11, f2
fmac f2, f1.zzzz, f12, f2
fmac f2, f1.wwww, f13, f2
VBLOCK swizzle table format can cope with this in a single block by setting a swizzler onto four registers that are *redirected* to f1, each with a different swizzle setting.
Macro op fusion would result in *doubling* the number of instructions.
Both are not ideal.
For this particular case however I am inclined to review the decision to put the REMAP CSR on the back burner.
These were intended for Matrices, however I forgot about them after thinking that Vector Mul was not as high a priority.
Swizzle looks to be extremely awkward and costly, making the REMAP CSRs attractive by comparison.
With the right REMAP, setting
* SHAPE1 to operate on a 4-element continuous loop and attached to f2
* SHAPE2 to wait 4 elements before incrementing by 1, and attaching to f1
the Matrix Multiply is LITERALLY reduced to 2 instructions, one of which is to clear out f2 to 4 zeros, the other is an FMAC with a VL of 16 (no SUBVLs).
VL could be set with an SVP-64 instruction, no need to set up a VBLOCK.
The alternative is to add REMAP to VBLOCK.