> The reason is that GPU code has scalar and vector code mixed together
> everywhere, so SV not having separate scalar instructions could increase
> instruction count by >10% due to the additional setvl instructions, also it
> could greatly increase pipeline stalls on dynamic setvl, since the VL-loop
> stage has to wait for the setvl execution to know what VL to use (for
> setvli a stall isn't needed since the decoder can just use the value from
> the immediate field, no wait required). This comes from the basic compiler
> optimization of using scalar instructions to do an operation once, then
> splat if needed, instead of using vector instructions when all vector
> elements are known to be identical. Basically all modern GPUs have that
> optimization, also described as detecting the difference between uniform
> and varying SIMT variables before the whole-function-vectorization pass. If
> that optimization isn't done, it will increase power consumption
> substantially and will take longer to run due to the many additional
> instructions jamming up the issue width.
> Also, in my mind SV should have always had full separate scalar
> arguments/instructions, since otherwise we get a half-done attempt at
> having scalar code that makes the compiler *more* complex -- the compiler
> already has to handle having separate codepaths for scalar and vector
> instructions, just, without the ISA-level concept of scalar SVP64
> instructions, it adds many more special cases to translating scalar
> instructions, since they may need to be converted to effectively vector
> instructions with VL needing to be guaranteed non-zero, often requiring
> saving VL (if it was modified from fail-on-first), overwriting VL, running
> the scalar op, then overwriting VL again to restore VL for the surrounding
> vector ops.
> An alternative option that achieves the same end goal without needing to
> move the decoder is to use the scalar/vector-bit for the first/dest reg
> (which is always in the same spot -- instruction forms without a dest reg
> can have their SVP64 register fields moved one reg field over to make
> space) as a whole-instruction scalar/vector-bit, the operations that that
> removes (those with scalar dest but vector arguments -- which are not
> common instructions) can be effectively substituted with scalar mv.x.
> Since the bit is always in the same spot and all instructions have that
> bit, decoding it from the SVP64 prefix then becomes utterly trivial.
> This also simplifies the logic for the SV loop FSM since it no longer needs
> to implement the write-once-then-finish logic which I expect to be quite
Move register decoder (required in fetch pipeline anyway to correctly add instruction to scheduling matrixes) to before VL-loop stage, allowing us to use that to get the vector/scalar bits of all svp64 register fields and OR them together to form a whole-instruction vector/scalar bit.
> Decoding should only take a few more gates (around 10, less than 100) since
> you just have a few separate gates to OR all vector/scalar SVP64 bits
> together for each SVP64 prefix kind (there's only around 5 kinds) and use a
> mux based on the decoded number of registers (which I expect the decoder to
> need anyway for dependency matrix stuff) to select which OR gate's output
> to use. This produces a vector_and_not_scalar signal that should be easy to
> add to the VL-loop stage.
All alternatives don't increase the number of instructions since all that happens is we're reinterpreting some combinations of vector/scalar register arguments as making the instruction bypass the VL loop, thereby executing once no matter what value VL currently has. These instructions will ignore vstart and not modify it, since vstart is only used/modified by the VL-loop. This doesn't need a new opcode mnemonic. The SUBVL loop still runs, allowing using SUBVL=1 for scalar operations and SUBVL=2/3/4 for SIMT-uniform vec2/3/4 operations.
conceptually this one is a "no". anything which relies on checking VL
and changing behaviour is unworkable from a Dependency Hazard perspective,
as well as breaking the rule of abstraction and total independence between
Prefix and Suffix.
i've just experimented with this for FFT and bit-reversed instructions,
breaking this fundamentally critical design rule, and it has been absolute