this one is... annoying/tedious/necessary.
elwidth overrides when srcwid!=destwid are such a performance killer due to lane crossing that it is better to perform an "in-advance conversion" to make the bitwidth the same across src and dest than it is to do lane-crossing.
in addition, OpenPOWER Scalar FP32 fits across the bits of a FP64 to make it look as if it was actually an FP64.
both RVV and VSX perform fcvt conversion such that packed FP32 is easy and routine.
fcvt capability is therefore required somehow. the most sensible method is adding an explicit opcode, although there are other methods.
one interesting option for fcvt is to also combine it with fclass (storing the analysis bits in CR1) when Rc=1
format of instructions around p548.
here is the location where the special-casing should be performed, jacob, but not by way of being part of the SV loop, but by this particular operation explicitly being encoded and defined as:
* input formats (src) are DEFINED as being OpenPOWER bit-spread at the src elwidth (defaults to FP64)
* output formats (dest) are DEFINED as being "compacted and in the sensible sane way".
these both would be a null-operation (fmv) when srcwid == destwid.
it *might* be possible, with some careful analysis, to allow for fmv itself to perform this conversion process.
fmr p150 v3.0B 4.6.5
my feeling is that it would be reasonable to have these perform fcvt between the src elwidth and dest elwidth, such that followup FP operations that were also elwidth overridden had ("understood") the exact same FP format.
which brings us to an interesting point: what the heck does running single-precision FP ops on elwidth=32 even mean??
single precision FP Ops on elwidth=default means "do the op @ FP32 but distribute the bits across FP64"
my feeling is that this behaviour should be preserved at lower elwidth.
i.e. "do the op @ FP16 but distribute the bits across FP32".
i.e. single precision ops is redefined to be "do the op at half the precision"
fascinatingly if this is followed and the dest elwidth is *also* FP16 then there is a way to get faster computation even when the src elwidth is FP32 formatted.